polars.lazyframe.groupby.LazyGroupBy.apply#

LazyGroupBy.apply(
function: Callable[[DataFrame], DataFrame],
schema: SchemaDict | None,
) LazyFrame[source]#

Apply a custom/user-defined function (UDF) over the groups as a new DataFrame.

Warning

This method is much slower than the native expressions API. Only use it if you cannot implement your logic otherwise.

Using this is considered an anti-pattern. This will be very slow because:

  • it forces the engine to materialize the whole DataFrames for the groups.

  • it is not parallelized

  • it blocks optimizations as the passed python function is opaque to the optimizer

The idiomatic way to apply custom functions over multiple columns is using:

pl.struct([my_columns]).apply(lambda struct_series: ..)

Parameters:
function

Function to apply over each group of the LazyFrame.

schema

Schema of the output function. This has to be known statically. If the given schema is incorrect, this is a bug in the caller’s query and may lead to errors. If set to None, polars assumes the schema is unchanged.

Examples

>>> df = pl.DataFrame(
...     {
...         "id": [0, 1, 2, 3, 4],
...         "color": ["red", "green", "green", "red", "red"],
...         "shape": ["square", "triangle", "square", "triangle", "square"],
...     }
... )
>>> df
shape: (5, 3)
┌─────┬───────┬──────────┐
│ id  ┆ color ┆ shape    │
│ --- ┆ ---   ┆ ---      │
│ i64 ┆ str   ┆ str      │
╞═════╪═══════╪══════════╡
│ 0   ┆ red   ┆ square   │
│ 1   ┆ green ┆ triangle │
│ 2   ┆ green ┆ square   │
│ 3   ┆ red   ┆ triangle │
│ 4   ┆ red   ┆ square   │
└─────┴───────┴──────────┘

For each color group sample two rows:

>>> (
...     df.lazy()
...     .groupby("color")
...     .apply(lambda group_df: group_df.sample(2), schema=None)
...     .collect()
... )  
shape: (4, 3)
┌─────┬───────┬──────────┐
│ id  ┆ color ┆ shape    │
│ --- ┆ ---   ┆ ---      │
│ i64 ┆ str   ┆ str      │
╞═════╪═══════╪══════════╡
│ 1   ┆ green ┆ triangle │
│ 2   ┆ green ┆ square   │
│ 4   ┆ red   ┆ square   │
│ 3   ┆ red   ┆ triangle │
└─────┴───────┴──────────┘

It is better to implement this with an expression:

>>> (
...     df.lazy()
...     .filter(pl.int_range(0, pl.count()).shuffle().over("color") < 2)
...     .collect()
... )