source: str | Path | list[str] | list[Path] | IO[bytes] | bytes,
columns: list[int] | list[str] | None = None,
n_rows: int | None = None,
row_index_name: str | None = None,
row_index_offset: int = 0,
parallel: ParallelStrategy = 'auto',
use_statistics: bool = True,
hive_partitioning: bool = True,
rechunk: bool = True,
low_memory: bool = False,
storage_options: dict[str, Any] | None = None,
retries: int = 0,
use_pyarrow: bool = False,
pyarrow_options: dict[str, Any] | None = None,
memory_map: bool = True,
) DataFrame[source]#

Read into a DataFrame from a parquet file.


Path to a file, or a file-like object (by file-like object, we refer to objects that have a read() method, such as a file handler (e.g. via builtin open function) or BytesIO). If the path is a directory, files in that directory will all be read.


Columns to select. Accepts a list of column indices (starting at zero) or a list of column names.


Stop reading from parquet file after reading n_rows. Only valid when use_pyarrow=False.


Insert a row index column with the given name into the DataFrame as the first column. If set to None (default), no row index column is created.


Start the row index at this offset. Cannot be negative. Only used if row_index_name is set.

parallel{‘auto’, ‘columns’, ‘row_groups’, ‘none’}

This determines the direction of parallelism. ‘auto’ will try to determine the optimal direction.


Use statistics in the parquet to determine if pages can be skipped from reading.


Infer statistics and schema from hive partitioned URL and use them to prune reads.


Make sure that all columns are contiguous in memory by aggregating the chunks into a single array.


Reduce memory pressure at the expense of performance.


Options that indicate how to connect to a cloud provider. If the cloud provider is not supported by Polars, the storage options are passed to fsspec.open().

The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:

If storage_options is not provided, Polars will try to infer the information from environment variables.


Number of retries if accessing a cloud instance fails.


Use pyarrow instead of the Rust native parquet reader. The pyarrow reader is more stable.


Keyword arguments for pyarrow.parquet.read_table.


Memory map underlying file. This will likely increase performance. Only used when use_pyarrow=True.



  • Partitioned files:

    If you have a directory-nested (hive-style) partitioned dataset, you should use the scan_pyarrow_dataset() method instead.

  • When benchmarking:

    This operation defaults to a rechunk operation at the end, meaning that all data will be stored continuously in memory. Set rechunk=False if you are benchmarking the parquet-reader as rechunk can be an expensive operation that should not contribute to the timings.