polars.Expr.map_batches#

Expr.map_batches(
function: Callable[[Series], Series | Any],
return_dtype: PolarsDataType | DataTypeExpr | None = None,
*,
agg_list: bool = False,
is_elementwise: bool = False,
returns_scalar: bool = False,
_is_ufunc: bool = False,
) Expr[source]#

Apply a custom python function to a whole Series or sequence of Series.

The output of this custom function is presumed to be either a Series, or a NumPy array (in which case it will be automatically converted into a Series), or a scalar that will be converted into a Series. If the result is a scalar and you want it to stay as a scalar, pass in returns_scalar=True. If you want to apply a custom function elementwise over single values, see map_elements(). A reasonable use case for map functions is transforming the values represented by an expression using a third-party library.

Parameters:
function

Lambda/function to apply.

return_dtype

Dtype of the output Series. If not set, the dtype will be inferred based on the first non-null value that is returned by the function.

agg_list

First implode when in a group-by aggregation.

Deprecated since version 1.32.0: Use expr.implode().map_batches(..) instead.

is_elementwise

Set to true if the operations is elementwise for better performance and optimization.

An elementwise operations has unit or equal length for all inputs and can be ran sequentially on slices without results being affected.

returns_scalar

If the function returns a scalar, by default it will be wrapped in a list in the output, since the assumption is that the function always returns something Series-like. If you want to keep the result as a scalar, set this argument to True.

Warning

If return_dtype is not provided, this may lead to unexpected results. We allow this, but it is considered a bug in the user’s query. In the future this will raise in Lazy queries.

Notes

A UDF passed to map_batches must be pure, meaning that it cannot modify or depend on state other than its arguments.

Examples

>>> df = pl.DataFrame(
...     {
...         "sine": [0.0, 1.0, 0.0, -1.0],
...         "cosine": [1.0, 0.0, -1.0, 0.0],
...     }
... )
>>> df.select(
...     pl.all().map_batches(
...         lambda x: x.to_numpy().argmax(),
...         return_dtype=pl.Int64,
...         returns_scalar=True,
...     )
... )
shape: (1, 2)
┌──────┬────────┐
│ sine ┆ cosine │
│ ---  ┆ ---    │
│ i64  ┆ i64    │
╞══════╪════════╡
│ 1    ┆ 0      │
└──────┴────────┘

Here’s an example of a function that returns a scalar, where we want it to stay as a scalar:

>>> df = pl.DataFrame(
...     {
...         "a": [0, 1, 0, 1],
...         "b": [1, 2, 3, 4],
...     }
... )
>>> df.group_by("a").agg(
...     pl.col("b").map_batches(
...         lambda x: x.max(), returns_scalar=True, return_dtype=pl.self_dtype()
...     )
... )  
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 4   │
│ 0   ┆ 3   │
└─────┴─────┘

Call a function that takes multiple arguments by creating a struct and referencing its fields inside the function call.

>>> df = pl.DataFrame(
...     {
...         "a": [5, 1, 0, 3],
...         "b": [4, 2, 3, 4],
...     }
... )
>>> df.with_columns(
...     a_times_b=pl.struct("a", "b").map_batches(
...         lambda x: np.multiply(x.struct.field("a"), x.struct.field("b")),
...         return_dtype=pl.Int64,
...     )
... )
shape: (4, 3)
┌─────┬─────┬───────────┐
│ a   ┆ b   ┆ a_times_b │
│ --- ┆ --- ┆ ---       │
│ i64 ┆ i64 ┆ i64       │
╞═════╪═════╪═══════════╡
│ 5   ┆ 4   ┆ 20        │
│ 1   ┆ 2   ┆ 2         │
│ 0   ┆ 3   ┆ 0         │
│ 3   ┆ 4   ┆ 12        │
└─────┴─────┴───────────┘