polars.read_parquet#

polars.read_parquet( source: str | Path | list[str] | list[Path] | IO[bytes] | bytes, *, columns: list[int] | list[str] | None = None, n_rows: int | None = None, row_index_name: str | None = None, row_index_offset: int = 0, parallel: ParallelStrategy = 'auto', use_statistics: bool = True, hive_partitioning: bool | None = None, glob: bool = True, hive_schema: SchemaDict | None = None, try_parse_hive_dates: bool = True, rechunk: bool = False, low_memory: bool = False, storage_options: dict[str, Any] | None = None, retries: int = 2, use_pyarrow: bool = False, pyarrow_options: dict[str, Any] | None = None, memory_map: bool = True, ) → DataFrame[source]#

Read into a DataFrame from a parquet file.

Parameters:

source

Path to a file or a file-like object (by “file-like object” we refer to objects that have a read() method, such as a file handler like the builtin open function, or a BytesIO instance). If the path is a directory, files in that directory will all be read. For file-like objects, stream position may not be updated accordingly after reading.

columns

Columns to select. Accepts a list of column indices (starting at zero) or a list of column names.

n_rows

Stop reading from parquet file after reading n_rows. Only valid when use_pyarrow=False.

row_index_name

Insert a row index column with the given name into the DataFrame as the first column. If set to None (default), no row index column is created.

row_index_offset

Start the row index at this offset. Cannot be negative. Only used if row_index_name is set.

parallel{‘auto’, ‘columns’, ‘row_groups’, ‘none’}

This determines the direction of parallelism. ‘auto’ will try to determine the optimal direction.

use_statistics

Use statistics in the parquet to determine if pages can be skipped from reading.

hive_partitioning

Infer statistics and schema from Hive partitioned URL and use them to prune reads. This is unset by default (i.e. None), meaning it is automatically enabled when a single directory is passed, and otherwise disabled.

glob

Expand path given via globbing rules.

hive_schema

The column names and data types of the columns by which the data is partitioned. If set to None (default), the schema of the Hive partitions is inferred.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

try_parse_hive_dates

Whether to try parsing hive values as date/datetime types.

rechunk

Make sure that all columns are contiguous in memory by aggregating the chunks into a single array.

low_memory

Reduce memory pressure at the expense of performance.

storage_options

Options that indicate how to connect to a cloud provider. If the cloud provider is not supported by Polars, the storage options are passed to fsspec.open().

The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:

If storage_options is not provided, Polars will try to infer the information from environment variables.

retries

Number of retries if accessing a cloud instance fails.

use_pyarrow

Use PyArrow instead of the Rust-native Parquet reader. The PyArrow reader is more stable.

pyarrow_options

Keyword arguments for pyarrow.parquet.read_table.

memory_map

Memory map underlying file. This will likely increase performance. Only used when use_pyarrow=True.

Returns:

DataFrame