polars.scan_parquet#
- polars.scan_parquet(
- source: str | Path | list[str] | list[Path],
- *,
- n_rows: int | None = None,
- cache: bool = True,
- parallel: ParallelStrategy = 'auto',
- rechunk: bool = True,
- row_count_name: str | None = None,
- row_count_offset: int = 0,
- storage_options: dict[str, Any] | None = None,
- low_memory: bool = False,
- use_statistics: bool = True,
- hive_partitioning: bool = True,
- retries: int = 0,
Lazily read from a local or cloud-hosted parquet file (or files).
This function allows the query optimizer to push down predicates and projections to the scan level, typically increasing performance and reducing memory overhead.
- Parameters:
- source
Path(s) to a file If a single path is given, it can be a globbing pattern.
- n_rows
Stop reading from parquet file after reading
n_rows
.- cache
Cache the result after reading.
- parallel{‘auto’, ‘columns’, ‘row_groups’, ‘none’}
This determines the direction of parallelism. ‘auto’ will try to determine the optimal direction.
- rechunk
In case of reading multiple files via a glob pattern rechunk the final DataFrame into contiguous memory chunks.
- row_count_name
If not None, this will insert a row count column with give name into the DataFrame
- row_count_offset
Offset to start the row_count column (only use if the name is set)
- storage_options
Options that inform use how to connect to the cloud provider. If the cloud provider is not supported by us, the storage options are passed to
fsspec.open()
. Currently supported providers are: {‘aws’, ‘gcp’, ‘azure’ }. See supported keys here:If
storage_options
are not provided we will try to infer them from the environment variables.- low_memory
Reduce memory pressure at the expense of performance.
- use_statistics
Use statistics in the parquet to determine if pages can be skipped from reading.
- hive_partitioning
Infer statistics and schema from hive partitioned URL and use them to prune reads.
- retries
Number of retries if accessing a cloud instance fails.
See also
Examples
Scan a local Parquet file.
>>> pl.scan_parquet("path/to/file.parquet")
Scan a file on AWS S3.
>>> source = "s3://bucket/*.parquet" >>> pl.scan_parquet(source) >>> storage_options = { ... "aws_access_key_id": "<secret>", ... "aws_secret_access_key": "<secret>", ... "aws_region": "us-east-1", ... } >>> pl.scan_parquet(source, storage_options=storage_options)