polars.scan_parquet#
- polars.scan_parquet(
 - source: str | Path | list[str] | list[Path],
 - *,
 - n_rows: int | None = None,
 - cache: bool = True,
 - parallel: ParallelStrategy = 'auto',
 - rechunk: bool = True,
 - row_count_name: str | None = None,
 - row_count_offset: int = 0,
 - storage_options: dict[str, Any] | None = None,
 - low_memory: bool = False,
 - use_statistics: bool = True,
 - hive_partitioning: bool = True,
 - retries: int = 0,
 Lazily read from a local or cloud-hosted parquet file (or files).
This function allows the query optimizer to push down predicates and projections to the scan level, typically increasing performance and reducing memory overhead.
- Parameters:
 - source
 Path(s) to a file If a single path is given, it can be a globbing pattern.
- n_rows
 Stop reading from parquet file after reading
n_rows.- cache
 Cache the result after reading.
- parallel{‘auto’, ‘columns’, ‘row_groups’, ‘none’}
 This determines the direction of parallelism. ‘auto’ will try to determine the optimal direction.
- rechunk
 In case of reading multiple files via a glob pattern rechunk the final DataFrame into contiguous memory chunks.
- row_count_name
 If not None, this will insert a row count column with give name into the DataFrame
- row_count_offset
 Offset to start the row_count column (only use if the name is set)
- storage_options
 Options that inform use how to connect to the cloud provider. If the cloud provider is not supported by us, the storage options are passed to
fsspec.open(). Currently supported providers are: {‘aws’, ‘gcp’, ‘azure’ }. See supported keys here:If
storage_optionsare not provided we will try to infer them from the environment variables.- low_memory
 Reduce memory pressure at the expense of performance.
- use_statistics
 Use statistics in the parquet to determine if pages can be skipped from reading.
- hive_partitioning
 Infer statistics and schema from hive partitioned URL and use them to prune reads.
- retries
 Number of retries if accessing a cloud instance fails.
See also
Examples
Scan a local Parquet file.
>>> pl.scan_parquet("path/to/file.parquet")
Scan a file on AWS S3.
>>> source = "s3://bucket/*.parquet" >>> pl.scan_parquet(source) >>> storage_options = { ... "aws_access_key_id": "<secret>", ... "aws_secret_access_key": "<secret>", ... "aws_region": "us-east-1", ... } >>> pl.scan_parquet(source, storage_options=storage_options)