polars.PartitionMaxSize#

class polars.PartitionMaxSize(
base_path: str | Path,
*,
file_path: Callable[[BasePartitionContext], Path | str | IO[bytes] | IO[str]] | None = None,
max_size: int,
per_partition_sort_by: str | Expr | Iterable[str | Expr] | None = None,
finish_callback: Callable[[DataFrame], None] | None = None,
)[source]#

Partitioning scheme to write files with a maximum size.

This partitioning scheme generates files that have a given maximum size. If the size reaches the maximum size, it is closed and a new file is opened.

Warning

This functionality is currently considered unstable. It may be changed at any point without it being considered a breaking change.

Parameters:
base_path

The base path for the output files.

file_path

A callback to register or modify the output path for each partition relative to the base_path. The callback provides a polars.io.partition.BasePartitionContext that contains information about the partition.

If no callback is given, it defaults to {ctx.file_idx}.{EXT}.

max_sizeint

The maximum size in rows of each of the generated files.

per_partition_sort_by

Columns or expressions to sort over within each partition.

Note that this might increase the memory consumption needed for each partition.

finish_callback

A callback that gets called when the query finishes successfully.

For parquet files, the callback is given a dataframe with metrics about all files written files.

Examples

Split a parquet file by over smaller CSV files with 100 000 rows each:

>>> pl.scan_parquet("/path/to/file.parquet").sink_csv(
...     PartitionMax("./out", max_size=100_000),
... )  
__init__(
base_path: str | Path,
*,
file_path: Callable[[BasePartitionContext], Path | str | IO[bytes] | IO[str]] | None = None,
max_size: int,
per_partition_sort_by: str | Expr | Iterable[str | Expr] | None = None,
finish_callback: Callable[[DataFrame], None] | None = None,
) None[source]#

Methods

__init__(base_path, *[, file_path, ...])