polars.PartitionByKey#

class polars.PartitionByKey(
base_path: str | Path,
*,
file_path: Callable[[KeyedPartitionContext], Path | str | IO[bytes] | IO[str]] | None = None,
by: str | Expr | Sequence[str | Expr] | Mapping[str, Expr],
include_key: bool = True,
)[source]#

Partitioning scheme to write files split by the values of keys.

This partitioning scheme generates an arbitrary amount of files splitting the data depending on what the value is of key expressions.

The amount of files that can be written is not limited. However, when writing beyond a certain amount of files, the data for the remaining partitions is buffered before writing to the file.

Warning

This functionality is currently considered unstable. It may be changed at any point without it being considered a breaking change.

Parameters:
base_path

The base path for the output files.

Use the mkdir option on the sink_* methods to ensure directories in the path are created.

file_path

A callback to register or modify the output path for each partition relative to the base_path. The callback provides a polars.io.partition.KeyedPartitionContext that contains information about the partition.

If no callback is given, it defaults to {ctx.keys.hive_dirs()}/{ctx.in_part_idx}.{EXT}.

by

The expressions to partition by.

include_keybool

Whether to include the key columns in the output files.

Examples

Split into a hive-partitioning style partition:

>>> (
...     pl.DataFrame({"a": [1, 2, 3], "b": [5, 7, 9], "c": ["A", "B", "C"]})
...     .lazy()
...     .sink_parquet(
...         PartitionByKey(
...             "./out",
...             by=[pl.col.a, pl.col.b],
...             include_key=False,
...         ),
...         mkdir=True,
...     )
... )  

Split a parquet file by a column year into CSV files:

>>> pl.scan_parquet("/path/to/file.parquet").sink_csv(
...     PartitionByKey(
...         "./out/",
...         file_path=lambda ctx: f"year={ctx.keys[0].str_value}.csv",
...         by="year",
...     ),
... )  
__init__(
base_path: str | Path,
*,
file_path: Callable[[KeyedPartitionContext], Path | str | IO[bytes] | IO[str]] | None = None,
by: str | Expr | Sequence[str | Expr] | Mapping[str, Expr],
include_key: bool = True,
) None[source]#

Methods

__init__(base_path, *[, file_path, include_key])