polars.scan_parquet#

polars.scan_parquet(
source: str | Path | list[str] | list[Path],
*,
n_rows: int | None = None,
row_index_name: str | None = None,
row_index_offset: int = 0,
parallel: ParallelStrategy = 'auto',
use_statistics: bool = True,
hive_partitioning: bool = True,
glob: bool = True,
hive_schema: SchemaDict | None = None,
rechunk: bool = False,
low_memory: bool = False,
cache: bool = True,
storage_options: dict[str, Any] | None = None,
retries: int = 0,
) LazyFrame[source]#

Lazily read from a local or cloud-hosted parquet file (or files).

This function allows the query optimizer to push down predicates and projections to the scan level, typically increasing performance and reducing memory overhead.

Parameters:
source

Path(s) to a file If a single path is given, it can be a globbing pattern.

n_rows

Stop reading from parquet file after reading n_rows.

row_index_name

If not None, this will insert a row index column with the given name into the DataFrame

row_index_offset

Offset to start the row index column (only used if the name is set)

parallel{‘auto’, ‘columns’, ‘row_groups’, ‘none’}

This determines the direction of parallelism. ‘auto’ will try to determine the optimal direction.

use_statistics

Use statistics in the parquet to determine if pages can be skipped from reading.

hive_partitioning

Infer statistics and schema from hive partitioned URL and use them to prune reads.

glob

Expand path given via globbing rules.

hive_schema

The column names and data types of the columns by which the data is partitioned. If set to None (default), the schema of the Hive partitions is inferred.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

rechunk

In case of reading multiple files via a glob pattern rechunk the final DataFrame into contiguous memory chunks.

low_memory

Reduce memory pressure at the expense of performance.

cache

Cache the result after reading.

storage_options

Options that indicate how to connect to a cloud provider.

The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:

If storage_options is not provided, Polars will try to infer the information from environment variables.

retries

Number of retries if accessing a cloud instance fails.

Examples

Scan a local Parquet file.

>>> pl.scan_parquet("path/to/file.parquet")  

Scan a file on AWS S3.

>>> source = "s3://bucket/*.parquet"
>>> pl.scan_parquet(source)  
>>> storage_options = {
...     "aws_access_key_id": "<secret>",
...     "aws_secret_access_key": "<secret>",
...     "aws_region": "us-east-1",
... }
>>> pl.scan_parquet(source, storage_options=storage_options)