polars.DataFrame.write_parquet#

DataFrame.write_parquet( file: str | Path | IO[bytes], *, compression: ParquetCompression = 'zstd', compression_level: int | None = None, statistics: bool | str | dict[str, bool] = True, row_group_size: int | None = None, data_page_size: int | None = None, use_pyarrow: bool = False, pyarrow_options: dict[str, Any] | None = None, partition_by: str | Sequence[str] | None = None, partition_chunk_size_bytes: int = 4294967296, storage_options: dict[str, Any] | None = None, credential_provider: CredentialProviderFunction | Literal['auto'] | None = 'auto', retries: int = 2, metadata: ParquetMetadata | None = None, mkdir: bool = False, ) → None[source]#

Write to Apache Parquet file.

Parameters:

file

File path or writable file-like object to which the result will be written. This should be a path to a directory if writing a partitioned dataset.

compression{‘lz4’, ‘uncompressed’, ‘snappy’, ‘gzip’, ‘brotli’, ‘zstd’}

Choose “zstd” for good compression performance. Choose “lz4” for fast compression/decompression. Choose “snappy” for more backwards compatibility guarantees when you deal with older parquet readers.

compression_level

The level of compression to use. Higher compression means smaller files on disk.

“gzip” : min-level: 0, max-level: 9, default: 6.
“brotli” : min-level: 0, max-level: 11, default: 1.
“zstd” : min-level: 1, max-level: 22, default: 3.

statistics

Write statistics to the parquet headers. This is the default behavior.

Possible values:

True: enable default set of statistics (default). Some statistics may be disabled.
False: disable all statistics
“full”: calculate and write all available statistics. Cannot be combined with use_pyarrow.
{ "statistic-key": True / False, ... }. Cannot be combined with use_pyarrow. Available keys:
- “min”: column minimum value (default: True)
- “max”: column maximum value (default: True)
- “distinct_count”: number of unique column values (default: False)
- “null_count”: number of null values in column (default: True)

row_group_size

Size of the row groups in number of rows. Defaults to 512^2 rows.

data_page_size

Size of the data page in bytes. Defaults to 1024^2 bytes.

use_pyarrow

Use C++ parquet implementation vs Rust parquet implementation. At the moment C++ supports more features.

pyarrow_options

Arguments passed to pyarrow.parquet.write_table.

If you pass partition_cols here, the dataset will be written using pyarrow.parquet.write_to_dataset. The partition_cols parameter leads to write the dataset to a directory. Similar to Spark’s partitioned datasets.

partition_by

Column(s) to partition by. A partitioned dataset will be written if this is specified. This parameter is considered unstable and is subject to change.

partition_chunk_size_bytes

Approximate size to split DataFrames within a single partition when writing. Note this is calculated using the size of the DataFrame in memory - the size of the output file may differ depending on the file format / compression.

storage_options

Options that indicate how to connect to a cloud provider.

The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:

aws
gcp
azure
Hugging Face (hf://): Accepts an API key under the token parameter: {'token': '...'}, or by setting the HF_TOKEN environment variable.

If storage_options is not provided, Polars will try to infer the information from environment variables.

credential_provider

Provide a function that can be called to provide cloud storage credentials. The function is expected to return a dictionary of credential keys along with an optional credential expiry time.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

retries

Number of retries if accessing a cloud instance fails.

metadata

A dictionary or callback to add key-values to the file-level Parquet metadata.

Warning

This functionality is considered experimental. It may be removed or changed at any point without it being considered a breaking change.

mkdir: bool

Recursively create all the directories in the path.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

Examples

>>> import pathlib
>>>
>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3, 4, 5],
...         "bar": [6, 7, 8, 9, 10],
...         "ham": ["a", "b", "c", "d", "e"],
...     }
... )
>>> path: pathlib.Path = dirpath / "new_file.parquet"
>>> df.write_parquet(path)

We can use pyarrow with use_pyarrow_write_to_dataset=True to write partitioned datasets. The following example will write the first row to ../watermark=1/.parquet and the other rows to ../watermark=2/.parquet.

>>> df = pl.DataFrame({"a": [1, 2, 3], "watermark": [1, 2, 2]})
>>> path: pathlib.Path = dirpath / "partitioned_object"
>>> df.write_parquet(
...     path,
...     use_pyarrow=True,
...     pyarrow_options={"partition_cols": ["watermark"]},
... )