polars.PartitionParted#

Partitioning scheme to split parted dataframes.

This is a specialized version of PartitionByKey. Where as PartitionByKey accepts data in any order, this scheme expects the input data to be pre-grouped or pre-sorted. This scheme suffers a lot less overhead than PartitionByKey, but may not be always applicable.

Each new value of the key expressions starts a new partition, therefore repeating the same value multiple times may overwrite previous partitions.

Warning

This functionality is currently considered unstable. It may be changed at any point without it being considered a breaking change.

Parameters:

base_path

The base path for the output files.

Use the mkdir option on the sink_* methods to ensure directories in the path are created.

file_path

A callback to register or modify the output path for each partition relative to the base_path.The callback provides a polars.io.partition.KeyedPartitionContext that contains information about the partition.

If no callback is given, it defaults to {ctx.keys.hive_dirs()}/{ctx.in_part_idx}.{EXT}.

by

The expressions to partition by.

include_keybool

Whether to include the key columns in the output files.

per_partition_sort_by

Columns or expressions to sort over within each partition.

Note that this might increase the memory consumption needed for each partition.

finish_callback

A callback that gets called when the query finishes successfully.

For parquet files, the callback is given a dataframe with metrics about all files written files.

See also

PartitionMaxSize
PartitionByKey
polars.io.partition.KeyedPartitionContext

Examples

Split a parquet file by a column year into CSV files:

>>> pl.scan_parquet("/path/to/file.parquet").sink_csv(
...     pl.PartitionParted("./out/", by="year"),
...     mkdir=True,
... )  

__init__( base_path: str | Path, *, file_path: Callable[[KeyedPartitionContext], Path | str | IO[bytes] | IO[str]] | None = None, by: str | Expr | Sequence[str | Expr] | Mapping[str, Expr], include_key: bool = True, per_partition_sort_by: str | Expr | Sequence[str | Expr] | None = None, finish_callback: Callable[[DataFrame], None] | None = None, ) → None[source]#

Methods

__init__(base_path, *[, file_path, ...])