polars.scan_parquet#
- polars.scan_parquet(
- source: str | Path | list[str] | list[Path],
- *,
- n_rows: int | None = None,
- cache: bool = True,
- parallel: ParallelStrategy = 'auto',
- rechunk: bool = True,
- row_count_name: str | None = None,
- row_count_offset: int = 0,
- storage_options: dict[str, Any] | None = None,
- low_memory: bool = False,
- use_statistics: bool = True,
- hive_partitioning: bool = True,
- retries: int = 0,
Lazily read from a local or cloud-hosted parquet file (or files).
This function allows the query optimizer to push down predicates and projections to the scan level, typically increasing performance and reducing memory overhead.
- Parameters:
- source
Path(s) to a file If a single path is given, it can be a globbing pattern.
- n_rows
Stop reading from parquet file after reading
n_rows.- cache
Cache the result after reading.
- parallel{‘auto’, ‘columns’, ‘row_groups’, ‘none’}
This determines the direction of parallelism. ‘auto’ will try to determine the optimal direction.
- rechunk
In case of reading multiple files via a glob pattern rechunk the final DataFrame into contiguous memory chunks.
- row_count_name
If not None, this will insert a row count column with give name into the DataFrame
- row_count_offset
Offset to start the row_count column (only use if the name is set)
- storage_options
Options that inform use how to connect to the cloud provider. If the cloud provider is not supported by us, the storage options are passed to
fsspec.open(). Currently supported providers are: {‘aws’, ‘gcp’, ‘azure’ }. See supported keys here:If
storage_optionsare not provided we will try to infer them from the environment variables.- low_memory
Reduce memory pressure at the expense of performance.
- use_statistics
Use statistics in the parquet to determine if pages can be skipped from reading.
- hive_partitioning
Infer statistics and schema from hive partitioned URL and use them to prune reads.
- retries
Number of retries if accessing a cloud instance fails.
See also
Examples
Scan a local Parquet file.
>>> pl.scan_parquet("path/to/file.parquet")
Scan a file on AWS S3.
>>> source = "s3://bucket/*.parquet" >>> pl.scan_parquet(source) >>> storage_options = { ... "aws_access_key_id": "<secret>", ... "aws_secret_access_key": "<secret>", ... "aws_region": "us-east-1", ... } >>> pl.scan_parquet(source, storage_options=storage_options)