polars.LazyFrame.sink_parquet#

LazyFrame.sink_parquet( path: str | Path, *, compression: str = 'zstd', compression_level: int | None = None, statistics: bool = False, row_group_size: int | None = None, data_pagesize_limit: int | None = None, maintain_order: bool = True, type_coercion: bool = True, predicate_pushdown: bool = True, projection_pushdown: bool = True, simplify_expression: bool = True, slice_pushdown: bool = True, no_optimization: bool = False, ) → DataFrame[source]#

Evaluate the query in streaming mode and write to a Parquet file.

This allows streaming results that are larger than RAM to be written to disk.

Parameters:

path

File path to which the file should be written.

compression{‘lz4’, ‘uncompressed’, ‘snappy’, ‘gzip’, ‘lzo’, ‘brotli’, ‘zstd’}

Choose “zstd” for good compression performance. Choose “lz4” for fast compression/decompression. Choose “snappy” for more backwards compatibility guarantees when you deal with older parquet readers.

compression_level

The level of compression to use. Higher compression means smaller files on disk.

“gzip” : min-level: 0, max-level: 10.
“brotli” : min-level: 0, max-level: 11.
“zstd” : min-level: 1, max-level: 22.

statistics

Write statistics to the parquet headers. This requires extra compute.

row_group_size

Size of the row groups in number of rows. If None (default), the chunks of the DataFrame are used. Writing in smaller chunks may reduce memory pressure and improve writing speeds.

data_pagesize_limit

Size limit of individual data pages. If not set defaults to 1024 * 1024 bytes

maintain_order

Maintain the order in which data is processed. Setting this to False will be slightly faster.

type_coercion

Do type coercion optimization.

predicate_pushdown

Do predicate pushdown optimization.

projection_pushdown

Do projection pushdown optimization.

simplify_expression

Run simplify expressions optimization.

slice_pushdown

Slice pushdown optimization.

no_optimization

Turn off (certain) optimizations.

Returns:

DataFrame

Examples

>>> lf = pl.scan_csv("/path/to/my_larger_than_ram_file.csv")  
>>> lf.sink_parquet("out.parquet")