polars.PartitionByKey#
- class polars.PartitionByKey(
- base_path: str | Path,
- *,
- file_path: Callable[[KeyedPartitionContext], Path | str | IO[bytes] | IO[str]] | None = None,
- by: str | Expr | Sequence[str | Expr] | Mapping[str, Expr],
- include_key: bool = True,
Partitioning scheme to write files split by the values of keys.
This partitioning scheme generates an arbitrary amount of files splitting the data depending on what the value is of key expressions.
The amount of files that can be written is not limited. However, when writing beyond a certain amount of files, the data for the remaining partitions is buffered before writing to the file.
Warning
This functionality is currently considered unstable. It may be changed at any point without it being considered a breaking change.
- Parameters:
- base_path
The base path for the output files.
Use the
mkdir
option on thesink_*
methods to ensure directories in the path are created.- file_path
A callback to register or modify the output path for each partition relative to the
base_path
. The callback provides apolars.io.partition.KeyedPartitionContext
that contains information about the partition.If no callback is given, it defaults to
{ctx.keys.hive_dirs()}/{ctx.in_part_idx}.{EXT}
.- by
The expressions to partition by.
- include_keybool
Whether to include the key columns in the output files.
Examples
Split into a hive-partitioning style partition:
>>> ( ... pl.DataFrame({"a": [1, 2, 3], "b": [5, 7, 9], "c": ["A", "B", "C"]}) ... .lazy() ... .sink_parquet( ... PartitionByKey( ... "./out", ... by=[pl.col.a, pl.col.b], ... include_key=False, ... ), ... mkdir=True, ... ) ... )
Split a parquet file by a column
year
into CSV files:>>> pl.scan_parquet("/path/to/file.parquet").sink_csv( ... PartitionByKey( ... "./out/", ... file_path=lambda ctx: f"year={ctx.keys[0].str_value}.csv", ... by="year", ... ), ... )
- __init__(
- base_path: str | Path,
- *,
- file_path: Callable[[KeyedPartitionContext], Path | str | IO[bytes] | IO[str]] | None = None,
- by: str | Expr | Sequence[str | Expr] | Mapping[str, Expr],
- include_key: bool = True,
Methods
__init__
(base_path, *[, file_path, include_key])