polars.PartitionParted#

class polars.PartitionParted(
base_path: str | Path,
*,
file_path: Callable[[KeyedPartitionContext], Path | str | IO[bytes] | IO[str]] | None = None,
by: str | Expr | Sequence[str | Expr] | Mapping[str, Expr],
include_key: bool = True,
)[source]#

Partitioning scheme to split parted dataframes.

This is a specialized version of PartitionByKey. Where as PartitionByKey accepts data in any order, this scheme expects the input data to be pre-grouped or pre-sorted. This scheme suffers a lot less overhead than PartitionByKey, but may not be always applicable.

Each new value of the key expressions starts a new partition, therefore repeating the same value multiple times may overwrite previous partitions.

Warning

This functionality is currently considered unstable. It may be changed at any point without it being considered a breaking change.

Parameters:
base_path

The base path for the output files.

Use the mkdir option on the sink_* methods to ensure directories in the path are created.

file_path

A callback to register or modify the output path for each partition relative to the base_path.The callback provides a polars.io.partition.KeyedPartitionContext that contains information about the partition.

If no callback is given, it defaults to {ctx.keys.hive_dirs()}/{ctx.in_part_idx}.{EXT}.

by

The expressions to partition by.

include_keybool

Whether to include the key columns in the output files.

Examples

Split a parquet file by a column year into CSV files:

>>> pl.scan_parquet("/path/to/file.parquet").sink_csv(
...     PartitionParted("./out", by="year"),
...     mkdir=True,
... )  
__init__(
base_path: str | Path,
*,
file_path: Callable[[KeyedPartitionContext], Path | str | IO[bytes] | IO[str]] | None = None,
by: str | Expr | Sequence[str | Expr] | Mapping[str, Expr],
include_key: bool = True,
) None[source]#

Methods

__init__(base_path, *[, file_path, include_key])