polars.LazyFrame.collect_batches#

LazyFrame.collect_batches( *, chunk_size: int | None = None, maintain_order: bool = True, lazy: bool = False, engine: EngineType = 'auto', optimizations: QueryOptFlags = (), ) → Iterator[DataFrame][source]#

Evaluate the query in streaming mode and get a generator that returns chunks.

This allows streaming results that are larger than RAM to be written to disk.

The query will always be fully executed unless stop is called, so you should call next until all chunks have been seen.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

Warning

This method is much slower than native sinks. Only use it if you cannot implement your logic otherwise.

Parameters:

chunk_size: The number of rows that are buffered before a chunk is given.
maintain_order: Maintain the order in which data is processed. Setting this to False will be slightly faster.
lazy: Start the query when first requesting a batch.
engine: Select the engine used to process the query, optional. At the moment, if set to "auto" (default), the query is run using the polars streaming engine. Polars will also attempt to use the engine set by the POLARS_ENGINE_AFFINITY environment variable. If it cannot run the query using the selected engine, the query is run using the polars streaming engine.
optimizations: The optimization passes done during query optimization.

Examples

>>> lf = pl.scan_csv("/path/to/my_larger_than_ram_file.csv")  
>>> for df in lf.collect_batches():
...     print(df)