polars.scan_parquet#

polars.scan_parquet(
source: str | Path | list[str] | list[Path],
*,
n_rows: int | None = None,
cache: bool = True,
parallel: ParallelStrategy = 'auto',
rechunk: bool = True,
row_count_name: str | None = None,
row_count_offset: int = 0,
storage_options: dict[str, Any] | None = None,
low_memory: bool = False,
use_statistics: bool = True,
hive_partitioning: bool = True,
retries: int = 0,
) LazyFrame[source]#

Lazily read from a local or cloud-hosted parquet file (or files).

This function allows the query optimizer to push down predicates and projections to the scan level, typically increasing performance and reducing memory overhead.

Parameters:
source

Path(s) to a file If a single path is given, it can be a globbing pattern.

n_rows

Stop reading from parquet file after reading n_rows.

cache

Cache the result after reading.

parallel{‘auto’, ‘columns’, ‘row_groups’, ‘none’}

This determines the direction of parallelism. ‘auto’ will try to determine the optimal direction.

rechunk

In case of reading multiple files via a glob pattern rechunk the final DataFrame into contiguous memory chunks.

row_count_name

If not None, this will insert a row count column with give name into the DataFrame

row_count_offset

Offset to start the row_count column (only use if the name is set)

storage_options

Options that inform use how to connect to the cloud provider. If the cloud provider is not supported by us, the storage options are passed to fsspec.open(). Currently supported providers are: {‘aws’, ‘gcp’, ‘azure’ }. See supported keys here:

If storage_options are not provided we will try to infer them from the environment variables.

low_memory

Reduce memory pressure at the expense of performance.

use_statistics

Use statistics in the parquet to determine if pages can be skipped from reading.

hive_partitioning

Infer statistics and schema from hive partitioned URL and use them to prune reads.

retries

Number of retries if accessing a cloud instance fails.

Examples

Scan a local Parquet file.

>>> pl.scan_parquet("path/to/file.parquet")  

Scan a file on AWS S3.

>>> source = "s3://bucket/*.parquet"
>>> pl.scan_parquet(source)  
>>> storage_options = {
...     "aws_access_key_id": "<secret>",
...     "aws_secret_access_key": "<secret>",
...     "aws_region": "us-east-1",
... }
>>> pl.scan_parquet(source, storage_options=storage_options)