polars.LazyFrame.sink_parquet#
- LazyFrame.sink_parquet(
- path: str | Path,
- *,
- compression: str = 'zstd',
- compression_level: int | None = None,
- statistics: bool | str | dict[str, bool] = True,
- row_group_size: int | None = None,
- data_page_size: int | None = None,
- maintain_order: bool = True,
- type_coercion: bool = True,
- _type_check: bool = True,
- predicate_pushdown: bool = True,
- projection_pushdown: bool = True,
- simplify_expression: bool = True,
- slice_pushdown: bool = True,
- collapse_joins: bool = True,
- no_optimization: bool = False,
- storage_options: dict[str, Any] | None = None,
- credential_provider: CredentialProviderFunction | Literal['auto'] | None = 'auto',
- retries: int = 2,
- sync_on_close: SyncOnCloseMethod | None = None,
- mkdir: bool = False,
- lazy: bool = False,
- engine: EngineType = 'auto',
Evaluate the query in streaming mode and write to a Parquet file.
Warning
Streaming mode is considered unstable. It may be changed at any point without it being considered a breaking change.
This allows streaming results that are larger than RAM to be written to disk.
- Parameters:
- path
File path to which the file should be written.
- compression{‘lz4’, ‘uncompressed’, ‘snappy’, ‘gzip’, ‘lzo’, ‘brotli’, ‘zstd’}
Choose “zstd” for good compression performance. Choose “lz4” for fast compression/decompression. Choose “snappy” for more backwards compatibility guarantees when you deal with older parquet readers.
- compression_level
The level of compression to use. Higher compression means smaller files on disk.
“gzip” : min-level: 0, max-level: 10.
“brotli” : min-level: 0, max-level: 11.
“zstd” : min-level: 1, max-level: 22.
- statistics
Write statistics to the parquet headers. This is the default behavior.
Possible values:
True
: enable default set of statistics (default). Some statistics may be disabled.False
: disable all statistics“full”: calculate and write all available statistics. Cannot be combined with
use_pyarrow
.{ "statistic-key": True / False, ... }
. Cannot be combined withuse_pyarrow
. Available keys:“min”: column minimum value (default:
True
)“max”: column maximum value (default:
True
)“distinct_count”: number of unique column values (default:
False
)“null_count”: number of null values in column (default:
True
)
- row_group_size
Size of the row groups in number of rows. If None (default), the chunks of the
DataFrame
are used. Writing in smaller chunks may reduce memory pressure and improve writing speeds.- data_page_size
Size limit of individual data pages. If not set defaults to 1024 * 1024 bytes
- maintain_order
Maintain the order in which data is processed. Setting this to
False
will be slightly faster.- type_coercion
Do type coercion optimization.
- predicate_pushdown
Do predicate pushdown optimization.
- projection_pushdown
Do projection pushdown optimization.
- simplify_expression
Run simplify expressions optimization.
- slice_pushdown
Slice pushdown optimization.
- collapse_joins
Collapse a join and filters into a faster join
- no_optimization
Turn off (certain) optimizations.
- storage_options
Options that indicate how to connect to a cloud provider.
The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:
Hugging Face (
hf://
): Accepts an API key under thetoken
parameter:{'token': '...'}
, or by setting theHF_TOKEN
environment variable.
If
storage_options
is not provided, Polars will try to infer the information from environment variables.- credential_provider
Provide a function that can be called to provide cloud storage credentials. The function is expected to return a dictionary of credential keys along with an optional credential expiry time.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- retries
Number of retries if accessing a cloud instance fails.
- sync_on_close: { None, ‘data’, ‘all’ }
Sync to disk when before closing a file.
None
does not sync.data
syncs the file contents.all
syncs the file contents and metadata.
- mkdir: bool
Recursively create all the directories in the path.
- lazy: bool
Wait to start execution until
collect
is called.- engine
Select the engine used to process the query, optional. At the moment, if set to
"auto"
(default), the query is run using the polars streaming engine. Polars will also attempt to use the engine set by thePOLARS_ENGINE_AFFINITY
environment variable. If it cannot run the query using the selected engine, the query is run using the polars streaming engine.Note
The GPU engine is currently not supported.
- Returns:
- DataFrame
Examples
>>> lf = pl.scan_csv("/path/to/my_larger_than_ram_file.csv") >>> lf.sink_parquet("out.parquet")