polars.LazyFrame.sink_parquet#

LazyFrame.sink_parquet( path: str | Path, *, compression: str = 'zstd', compression_level: int | None = None, statistics: bool = False, row_group_size: int | None = None, data_pagesize_limit: int | None = None, maintain_order: bool = True, type_coercion: bool = True, predicate_pushdown: bool = True, projection_pushdown: bool = True, simplify_expression: bool = True, no_optimization: bool = False, slice_pushdown: bool = True, ) → DataFrame[source]#

Persists a LazyFrame at the provided path.

This allows streaming results that are larger than RAM to be written to disk.

Parameters:

path

File path to which the file should be written.

compression{‘lz4’, ‘uncompressed’, ‘snappy’, ‘gzip’, ‘lzo’, ‘brotli’, ‘zstd’}

Choose “zstd” for good compression performance. Choose “lz4” for fast compression/decompression. Choose “snappy” for more backwards compatibility guarantees when you deal with older parquet readers.

compression_level

The level of compression to use. Higher compression means smaller files on disk.

“gzip” : min-level: 0, max-level: 10.
“brotli” : min-level: 0, max-level: 11.
“zstd” : min-level: 1, max-level: 22.

statistics

Write statistics to the parquet headers. This requires extra compute.

row_group_size

Size of the row groups in number of rows. If None (default), the chunks of the DataFrame are used. Writing in smaller chunks may reduce memory pressure and improve writing speeds. If None and use_pyarrow=True, the row group size will be the minimum of the DataFrame size and 64 * 1024 * 1024.

data_pagesize_limit

Size limit of individual data pages. If not set defaults to 1024 * 1024 bytes

maintain_order

Maintain the order in which data is processed. Setting this to False will be slightly faster.

type_coercion

Do type coercion optimization.

predicate_pushdown

Do predicate pushdown optimization.

projection_pushdown

Do projection pushdown optimization.

simplify_expression

Run simplify expressions optimization.

no_optimization

Turn off (certain) optimizations.

slice_pushdown

Slice pushdown optimization.

Returns:

DataFrame

Examples

>>> lf = pl.scan_csv("/path/to/my_larger_than_ram_file.csv")  
>>> lf.sink_parquet("out.parquet")