polars.read_parquet#
- polars.read_parquet(
- source: str | Path | IO[bytes] | bytes | list[str] | list[Path] | list[IO[bytes]] | list[bytes],
- *,
- columns: list[int] | list[str] | None = None,
- n_rows: int | None = None,
- row_index_name: str | None = None,
- row_index_offset: int = 0,
- parallel: ParallelStrategy = 'auto',
- use_statistics: bool = True,
- hive_partitioning: bool | None = None,
- glob: bool = True,
- schema: SchemaDict | None = None,
- hive_schema: SchemaDict | None = None,
- try_parse_hive_dates: bool = True,
- rechunk: bool = False,
- low_memory: bool = False,
- storage_options: dict[str, Any] | None = None,
- retries: int = 2,
- use_pyarrow: bool = False,
- pyarrow_options: dict[str, Any] | None = None,
- memory_map: bool = True,
- include_file_paths: str | None = None,
- allow_missing_columns: bool = False,
Read into a DataFrame from a parquet file.
- Parameters:
- source
Path(s) to a file or directory When needing to authenticate for scanning cloud locations, see the
storage_options
parameter.File-like objects are supported (by “file-like object” we refer to objects that have a
read()
method, such as a file handler like the builtinopen
function, or aBytesIO
instance) For file-like objects, stream position may not be updated accordingly after reading.- columns
Columns to select. Accepts a list of column indices (starting at zero) or a list of column names.
- n_rows
Stop reading from parquet file after reading
n_rows
. Only valid whenuse_pyarrow=False
.- row_index_name
Insert a row index column with the given name into the DataFrame as the first column. If set to
None
(default), no row index column is created.- row_index_offset
Start the row index at this offset. Cannot be negative. Only used if
row_index_name
is set.- parallel{‘auto’, ‘columns’, ‘row_groups’, ‘none’}
This determines the direction of parallelism. ‘auto’ will try to determine the optimal direction.
- use_statistics
Use statistics in the parquet to determine if pages can be skipped from reading.
- hive_partitioning
Infer statistics and schema from Hive partitioned URL and use them to prune reads. This is unset by default (i.e.
None
), meaning it is automatically enabled when a single directory is passed, and otherwise disabled.- glob
Expand path given via globbing rules.
- schema
Specify the datatypes of the columns. The datatypes must match the datatypes in the file(s). If there are extra columns that are not in the file(s), consider also enabling
allow_missing_columns
.Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- hive_schema
The column names and data types of the columns by which the data is partitioned. If set to
None
(default), the schema of the Hive partitions is inferred.Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- try_parse_hive_dates
Whether to try parsing hive values as date/datetime types.
- rechunk
Make sure that all columns are contiguous in memory by aggregating the chunks into a single array.
- low_memory
Reduce memory pressure at the expense of performance.
- storage_options
Options that indicate how to connect to a cloud provider.
The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:
Hugging Face (
hf://
): Accepts an API key under thetoken
parameter:{'token': '...'}
, or by setting theHF_TOKEN
environment variable.
If
storage_options
is not provided, Polars will try to infer the information from environment variables.- retries
Number of retries if accessing a cloud instance fails.
- use_pyarrow
Use PyArrow instead of the Rust-native Parquet reader. The PyArrow reader is more stable.
- pyarrow_options
Keyword arguments for pyarrow.parquet.read_table.
- memory_map
Memory map underlying file. This will likely increase performance. Only used when
use_pyarrow=True
.- include_file_paths
Include the path of the source file(s) as a column with this name. Only valid when
use_pyarrow=False
.- allow_missing_columns
When reading a list of parquet files, if a column existing in the first file cannot be found in subsequent files, the default behavior is to raise an error. However, if
allow_missing_columns
is set toTrue
, a full-NULL column is returned instead of erroring for the files that do not contain the column.
- Returns:
- DataFrame
See also