polars.dataframe.group_by.GroupBy.map_groups#

GroupBy.map_groups(function: Callable[[DataFrame], DataFrame]) DataFrame[source]#

Apply a custom/user-defined function (UDF) over the groups as a sub-DataFrame.

Warning

This method is much slower than the native expressions API. Only use it if you cannot implement your logic otherwise.

Implementing logic using a Python function is almost always significantly slower and more memory intensive than implementing the same logic using the native expression API because:

  • The native expression engine runs in Rust; UDFs run in Python.

  • Use of Python UDFs forces the DataFrame to be materialized in memory.

  • Polars-native expressions can be parallelised (UDFs cannot).

  • Polars-native expressions can be logically optimised (UDFs cannot).

Wherever possible you should strongly prefer the native expression API to achieve the best performance.

Parameters:
function

Custom function that receives a DataFrame and returns a DataFrame.

Returns:
DataFrame

Examples

For each color group sample two rows:

>>> df = pl.DataFrame(
...     {
...         "id": [0, 1, 2, 3, 4],
...         "color": ["red", "green", "green", "red", "red"],
...         "shape": ["square", "triangle", "square", "triangle", "square"],
...     }
... )
>>> df.group_by("color").map_groups(
...     lambda group_df: group_df.sample(2)
... )  
shape: (4, 3)
┌─────┬───────┬──────────┐
│ id  ┆ color ┆ shape    │
│ --- ┆ ---   ┆ ---      │
│ i64 ┆ str   ┆ str      │
╞═════╪═══════╪══════════╡
│ 1   ┆ green ┆ triangle │
│ 2   ┆ green ┆ square   │
│ 4   ┆ red   ┆ square   │
│ 3   ┆ red   ┆ triangle │
└─────┴───────┴──────────┘

It is better to implement this with an expression:

>>> df.filter(
...     pl.int_range(0, pl.count()).shuffle().over("color") < 2
... )