polars.read_parquet#
- polars.read_parquet(
 - source: str | Path | BinaryIO | BytesIO | bytes,
 - *,
 - columns: list[int] | list[str] | None = None,
 - n_rows: int | None = None,
 - use_pyarrow: bool = False,
 - memory_map: bool = True,
 - storage_options: dict[str, Any] | None = None,
 - parallel: ParallelStrategy = 'auto',
 - row_count_name: str | None = None,
 - row_count_offset: int = 0,
 - low_memory: bool = False,
 - pyarrow_options: dict[str, Any] | None = None,
 - use_statistics: bool = True,
 - rechunk: bool = True,
 Read into a DataFrame from a parquet file.
- Parameters:
 - source
 Path to a file, or a file-like object. If the path is a directory, files in that directory will all be read. If
fsspecis installed, it will be used to open remote files.- columns
 Columns to select. Accepts a list of column indices (starting at zero) or a list of column names.
- n_rows
 Stop reading from parquet file after reading
n_rows. Only valid whenuse_pyarrow=False.- use_pyarrow
 Use pyarrow instead of the Rust native parquet reader. The pyarrow reader is more stable.
- memory_map
 Memory map underlying file. This will likely increase performance. Only used when
use_pyarrow=True.- storage_options
 Extra options that make sense for
fsspec.open()or a particular storage connection, e.g. host, port, username, password, etc.- parallel{‘auto’, ‘columns’, ‘row_groups’, ‘none’}
 This determines the direction of parallelism. ‘auto’ will try to determine the optimal direction.
- row_count_name
 If not None, this will insert a row count column with give name into the DataFrame.
- row_count_offset
 Offset to start the row_count column (only use if the name is set).
- low_memory
 Reduce memory pressure at the expense of performance.
- pyarrow_options
 Keyword arguments for pyarrow.parquet.read_table.
- use_statistics
 Use statistics in the parquet to determine if pages can be skipped from reading.
- rechunk
 Make sure that all columns are contiguous in memory by aggregating the chunks into a single array.
- Returns:
 - DataFrame
 
See also
Notes
- Partitioned files:
 If you have a directory-nested (hive-style) partitioned dataset, you should use the
scan_pyarrow_dataset()method instead.
- When benchmarking:
 This operation defaults to a
rechunkoperation at the end, meaning that all data will be stored continuously in memory. Setrechunk=Falseif you are benchmarking the parquet-reader asrechunkcan be an expensive operation that should not contribute to the timings.