polars.LazyFrame.collect#

LazyFrame.collect(

*,

type_coercion: bool = True,

predicate_pushdown: bool = True,

projection_pushdown: bool = True,

simplify_expression: bool = True,

slice_pushdown: bool = True,

comm_subplan_elim: bool = True,

comm_subexpr_elim: bool = True,

cluster_with_columns: bool = True,

collapse_joins: bool = True,

no_optimization: bool = False,

engine: EngineType = 'auto',

background: bool = False,

optimizations: QueryOptFlags = (),

**_kwargs: Any,

) → DataFrame | InProcessQuery[source]#

Materialize this LazyFrame into a DataFrame.

By default, all query optimizations are enabled. Individual optimizations may be disabled by setting the corresponding parameter to False.

Parameters:

type_coercion: Do type coercion optimization.

Deprecated since version 1.30.0: Use the optimizations parameters.
predicate_pushdown: Do predicate pushdown optimization.

Deprecated since version 1.30.0: Use the optimizations parameters.
projection_pushdown: Do projection pushdown optimization.

Deprecated since version 1.30.0: Use the optimizations parameters.
simplify_expression: Run simplify expressions optimization.

Deprecated since version 1.30.0: Use the optimizations parameters.
slice_pushdown: Slice pushdown optimization.

Deprecated since version 1.30.0: Use the optimizations parameters.
comm_subplan_elim: Will try to cache branching subplans that occur on self-joins or unions.

Deprecated since version 1.30.0: Use the optimizations parameters.
comm_subexpr_elim: Common subexpressions will be cached and reused.

Deprecated since version 1.30.0: Use the optimizations parameters.
cluster_with_columns: Combine sequential independent calls to with_columns

Deprecated since version 1.30.0: Use the optimizations parameters.
collapse_joins: Collapse a join and filters into a faster join

Deprecated since version 1.30.0: Use the optimizations parameters.
no_optimization: Turn off (certain) optimizations.

Deprecated since version 1.30.0: Use the optimizations parameters.
engine: Select the engine used to process the query, optional. At the moment, if set to "auto" (default), the query is run using the polars in-memory engine. Polars will also attempt to use the engine set by the POLARS_ENGINE_AFFINITY environment variable. If it cannot run the query using the selected engine, the query is run using the polars in-memory engine. If set to "gpu", the GPU engine is used. Fine-grained control over the GPU engine, for example which device to use on a system with multiple devices, is possible by providing a GPUEngine object with configuration options.

Note

GPU mode is considered unstable. Not all queries will run successfully on the GPU, however, they should fall back transparently to the default engine if execution is not supported.

Running with POLARS_VERBOSE=1 will provide information if a query falls back (and why).

Note

The GPU engine does not support streaming, or running in the background. If either are enabled, then GPU execution is switched off.
background: Run the query in the background and get a handle to the query. This handle can be used to fetch the result or cancel the query.

Warning

Background mode is considered unstable. It may be changed at any point without it being considered a breaking change.
optimizations: The optimization passes done during query optimization.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

Returns:

DataFrame

See also

explain: Print the query plan that is evaluated with collect.
profile: Collect the LazyFrame and time each node in the computation graph.
polars.collect_all: Collect multiple LazyFrames at the same time.
polars.Config.set_streaming_chunk_size: Set the size of streaming batches.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": ["a", "b", "a", "b", "b", "c"],
...         "b": [1, 2, 3, 4, 5, 6],
...         "c": [6, 5, 4, 3, 2, 1],
...     }
... )
>>> lf.group_by("a").agg(pl.all().sum()).collect()  
shape: (3, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ a   ┆ 4   ┆ 10  │
│ b   ┆ 11  ┆ 10  │
│ c   ┆ 6   ┆ 1   │
└─────┴─────┴─────┘

Collect in streaming mode

>>> lf.group_by("a").agg(pl.all().sum()).collect(
...     engine="streaming"
... )  
shape: (3, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ a   ┆ 4   ┆ 10  │
│ b   ┆ 11  ┆ 10  │
│ c   ┆ 6   ┆ 1   │
└─────┴─────┴─────┘

Collect in GPU mode

>>> lf.group_by("a").agg(pl.all().sum()).collect(engine="gpu")  
shape: (3, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ b   ┆ 11  ┆ 10  │
│ a   ┆ 4   ┆ 10  │
│ c   ┆ 6   ┆ 1   │
└─────┴─────┴─────┘

With control over the device used

>>> lf.group_by("a").agg(pl.all().sum()).collect(
...     engine=pl.GPUEngine(device=1)
... )  
shape: (3, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ b   ┆ 11  ┆ 10  │
│ a   ┆ 4   ┆ 10  │
│ c   ┆ 6   ┆ 1   │
└─────┴─────┴─────┘