polars.scan_parquet#
- polars.scan_parquet(
- source: ScanSource,
- *,
- n_rows: int | None = None,
- row_index_name: str | None = None,
- row_index_offset: int = 0,
- parallel: ParallelStrategy = 'auto',
- use_statistics: bool = True,
- hive_partitioning: bool | None = None,
- glob: bool = True,
- schema: SchemaDict | None = None,
- hive_schema: SchemaDict | None = None,
- try_parse_hive_dates: bool = True,
- rechunk: bool = False,
- low_memory: bool = False,
- cache: bool = True,
- storage_options: dict[str, Any] | None = None,
- credential_provider: CredentialProviderFunction | Literal['auto'] | None = 'auto',
- retries: int = 2,
- include_file_paths: str | None = None,
- allow_missing_columns: bool = False,
Lazily read from a local or cloud-hosted parquet file (or files).
This function allows the query optimizer to push down predicates and projections to the scan level, typically increasing performance and reducing memory overhead.
- Parameters:
- source
Path(s) to a file or directory When needing to authenticate for scanning cloud locations, see the
storage_options
parameter.- n_rows
Stop reading from parquet file after reading
n_rows
.- row_index_name
If not None, this will insert a row index column with the given name into the DataFrame
- row_index_offset
Offset to start the row index column (only used if the name is set)
- parallel{‘auto’, ‘columns’, ‘row_groups’, ‘prefiltered’, ‘none’}
This determines the direction and strategy of parallelism. ‘auto’ will try to determine the optimal direction.
The
prefiltered
strategy first evaluates the pushed-down predicates in parallel and determines a mask of which rows to read. Then, it parallelizes over both the columns and the row groups while filtering out rows that do not need to be read. This can provide significant speedups for large files (i.e. many row-groups) with a predicate that filters clustered rows or filters heavily. In other cases,prefiltered
may slow down the scan compared other strategies.The
prefiltered
settings falls back toauto
if no predicate is given.Warning
The
prefiltered
strategy is considered unstable. It may be changed at any point without it being considered a breaking change.- use_statistics
Use statistics in the parquet to determine if pages can be skipped from reading.
- hive_partitioning
Infer statistics and schema from hive partitioned URL and use them to prune reads.
- glob
Expand path given via globbing rules.
- schema
Specify the datatypes of the columns. The datatypes must match the datatypes in the file(s). If there are extra columns that are not in the file(s), consider also enabling
allow_missing_columns
.Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- hive_schema
The column names and data types of the columns by which the data is partitioned. If set to
None
(default), the schema of the Hive partitions is inferred.Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- try_parse_hive_dates
Whether to try parsing hive values as date/datetime types.
- rechunk
In case of reading multiple files via a glob pattern rechunk the final DataFrame into contiguous memory chunks.
- low_memory
Reduce memory pressure at the expense of performance.
- cache
Cache the result after reading.
- storage_options
Options that indicate how to connect to a cloud provider.
The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:
Hugging Face (
hf://
): Accepts an API key under thetoken
parameter:{'token': '...'}
, or by setting theHF_TOKEN
environment variable.
If
storage_options
is not provided, Polars will try to infer the information from environment variables.- credential_provider
Provide a function that can be called to provide cloud storage credentials. The function is expected to return a dictionary of credential keys along with an optional credential expiry time.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- retries
Number of retries if accessing a cloud instance fails.
- include_file_paths
Include the path of the source file(s) as a column with this name.
- allow_missing_columns
When reading a list of parquet files, if a column existing in the first file cannot be found in subsequent files, the default behavior is to raise an error. However, if
allow_missing_columns
is set toTrue
, a full-NULL column is returned instead of erroring for the files that do not contain the column.
See also
Examples
Scan a local Parquet file.
>>> pl.scan_parquet("path/to/file.parquet")
Scan a file on AWS S3.
>>> source = "s3://bucket/*.parquet" >>> pl.scan_parquet(source) >>> storage_options = { ... "aws_access_key_id": "<secret>", ... "aws_secret_access_key": "<secret>", ... "aws_region": "us-east-1", ... } >>> pl.scan_parquet(source, storage_options=storage_options)