polars.read_parquet#

polars.read_parquet( source: FileSource, *, columns: list[int] | list[str] | None = None, n_rows: int | None = None, row_index_name: str | None = None, row_index_offset: int = 0, parallel: ParallelStrategy = 'auto', use_statistics: bool = True, hive_partitioning: bool | None = None, glob: bool = True, schema: SchemaDict | None = None, hive_schema: SchemaDict | None = None, try_parse_hive_dates: bool = True, rechunk: bool = False, low_memory: bool = False, storage_options: StorageOptionsDict | None = None, credential_provider: CredentialProviderFunction | Literal['auto'] | None = 'auto', retries: int | None = None, use_pyarrow: bool = False, pyarrow_options: dict[str, Any] | None = None, memory_map: bool = True, include_file_paths: str | None = None, missing_columns: Literal['insert', 'raise'] = 'raise', allow_missing_columns: bool | None = None, ) → DataFrame[source]#

Read into a DataFrame from a parquet file.

Changed in version 0.20.4: * The row_count_name parameter was renamed row_index_name. * The row_count_offset parameter was renamed row_index_offset.

Parameters:

source

Path(s) to a file or directory When needing to authenticate for scanning cloud locations, see the storage_options parameter.

File-like objects are supported (by “file-like object” we refer to objects that have a read() method, such as a file handler like the builtin open function, or a BytesIO instance). For file-like objects, the stream position may not be updated accordingly after reading.

columns

Columns to select. Accepts a list of column indices (starting at zero) or a list of column names.

n_rows

Stop reading from parquet file after reading n_rows. Only valid when use_pyarrow=False.

row_index_name

Insert a row index column with the given name into the DataFrame as the first column. If set to None (default), no row index column is created.

row_index_offset

Start the row index at this offset. Cannot be negative. Only used if row_index_name is set.

parallel{‘auto’, ‘columns’, ‘row_groups’, ‘none’}

This determines the direction of parallelism. ‘auto’ will try to determine the optimal direction.

use_statistics

Use statistics in the parquet to determine if pages can be skipped from reading.

hive_partitioning

Infer statistics and schema from Hive partitioned URL and use them to prune reads. This is unset by default (i.e. None), meaning it is automatically enabled when a single directory is passed, and otherwise disabled.

glob

Expand path given via globbing rules.

schema

Specify the datatypes of the columns. The datatypes must match the datatypes in the file(s). If there are extra columns that are not in the file(s), consider also passing missing_columns='insert'.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

hive_schema

The column names and data types of the columns by which the data is partitioned. If set to None (default), the schema of the Hive partitions is inferred.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

try_parse_hive_dates

Whether to try parsing hive values as date/datetime types.

rechunk

Make sure that all columns are contiguous in memory by aggregating the chunks into a single array.

low_memory

Reduce memory pressure at the expense of performance.

storage_options

Options that indicate how to connect to a cloud provider.

The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:

aws
gcp
azure
Hugging Face (hf://): Accepts an API key under the token parameter: {'token': '...'}, or by setting the HF_TOKEN environment variable.

If storage_options is not provided, Polars will try to infer the information from environment variables.

credential_provider

Provide a function that can be called to provide cloud storage credentials. The function is expected to return a dictionary of credential keys along with an optional credential expiry time.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

retries

Number of retries if accessing a cloud instance fails.

Deprecated since version 1.37.1: Pass {“max_retries”: n} via storage_options instead.

use_pyarrow

Use PyArrow instead of the Rust-native Parquet reader. The PyArrow reader is more stable.

pyarrow_options

Keyword arguments for pyarrow.parquet.read_table.

memory_map

Memory map underlying file. This will likely increase performance. Only used when use_pyarrow=True.

include_file_paths

Include the path of the source file(s) as a column with this name. Only valid when use_pyarrow=False.

missing_columns

Configuration for behavior when columns defined in the schema are missing from the data:

insert: Inserts the missing columns using NULLs as the row values.
raise: Raises an error.

allow_missing_columns

When reading a list of parquet files, if a column existing in the first file cannot be found in subsequent files, the default behavior is to raise an error. However, if allow_missing_columns is set to True, a full-NULL column is returned instead of erroring for the files that do not contain the column.

Deprecated since version 1.30.0: Use the parameter missing_columns instead and pass one of ('insert', 'raise').

Returns:

DataFrame

Warning

Calling read_parquet().lazy() is an antipattern as this forces Polars to materialize a full parquet file and therefore cannot push any optimizations into the reader. Therefore always prefer scan_parquet if you want to work with LazyFrame s.