polars.LazyFrame.sink_batches#

LazyFrame.sink_batches(
function: Callable[[DataFrame], bool | None],
*,
chunk_size: int | None = None,
maintain_order: bool = True,
lazy: bool = False,
engine: EngineType = 'auto',
optimizations: QueryOptFlags = (),
) LazyFrame | None[source]#

Evaluate the query and call a user-defined function for every ready batch.

This allows streaming results that are larger than RAM in certain cases.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

Warning

This method is much slower than native sinks. Only use it if you cannot implement your logic otherwise.

Parameters:
function

Function to run with a batch that is ready. If the function returns True, this signals that no more results are needed, allowing for early stopping.

chunk_size

The number of rows that are buffered before the callback is called.

maintain_order

Maintain the order in which data is processed. Setting this to False will be slightly faster.

lazy: bool

Wait to start execution until collect is called.

engine

Select the engine used to process the query, optional. At the moment, if set to "auto" (default), the query is run using the polars streaming engine. Polars will also attempt to use the engine set by the POLARS_ENGINE_AFFINITY environment variable. If it cannot run the query using the selected engine, the query is run using the polars streaming engine.

optimizations

The optimization passes done during query optimization.

This has no effect if lazy is set to True.

Examples

>>> lf = pl.scan_csv("/path/to/my_larger_than_ram_file.csv")  
>>> lf.sink_batches(lambda df: print(df))