polars.Expr.apply#

Expr.apply(
function: Callable[[Series], Series] | Callable[[Any], Any],
return_dtype: PolarsDataType | None = None,
*,
skip_nulls: bool = True,
pass_name: bool = False,
strategy: ApplyStrategy = 'thread_local',
) Self[source]#

Apply a custom/user-defined function (UDF) in a GroupBy or Projection context.

Warning

This method is much slower than the native expressions API. Only use it if you cannot implement your logic otherwise.

Depending on the context it has the following behavior:

  • Selection

    Expects f to be of type Callable[[Any], Any]. Applies a python function over each individual value in the column.

  • GroupBy

    Expects f to be of type Callable[[Series], Series]. Applies a python function over each group.

Parameters:
function

Lambda/ function to apply.

return_dtype

Dtype of the output Series. If not set, the dtype will be polars.Unknown.

skip_nulls

Don’t apply the function over values that contain nulls. This is faster.

pass_name

Pass the Series name to the custom function This is more expensive.

strategy{‘thread_local’, ‘threading’}

This functionality is in alpha stage. This may be removed /changed without it being considered a breaking change.

  • ‘thread_local’: run the python function on a single thread.

  • ‘threading’: run the python function on separate threads. Use with

    care as this can slow performance. This might only speed up your code if the amount of work per element is significant and the python function releases the GIL (e.g. via calling a c function)

Warning

If return_dtype is not provided, this may lead to unexpected results. We allow this, but it is considered a bug in the user’s query.

Notes

  • Using apply is strongly discouraged as you will be effectively running python “for” loops. This will be very slow. Wherever possible you should strongly prefer the native expression API to achieve the best performance.

  • If your function is expensive and you don’t want it to be called more than once for a given input, consider applying an @lru_cache decorator to it. With suitable data you may achieve order-of-magnitude speedups (or more).

Examples

>>> df = pl.DataFrame(
...     {
...         "a": [1, 2, 3, 1],
...         "b": ["a", "b", "c", "c"],
...     }
... )

In a selection context, the function is applied by row.

>>> df.with_columns(  
...     pl.col("a").apply(lambda x: x * 2).alias("a_times_2"),
... )
shape: (4, 3)
┌─────┬─────┬───────────┐
│ a   ┆ b   ┆ a_times_2 │
│ --- ┆ --- ┆ ---       │
│ i64 ┆ str ┆ i64       │
╞═════╪═════╪═══════════╡
│ 1   ┆ a   ┆ 2         │
│ 2   ┆ b   ┆ 4         │
│ 3   ┆ c   ┆ 6         │
│ 1   ┆ c   ┆ 2         │
└─────┴─────┴───────────┘

It is better to implement this with an expression:

>>> df.with_columns(
...     (pl.col("a") * 2).alias("a_times_2"),
... )  

In a GroupBy context the function is applied by group:

>>> df.lazy().groupby("b", maintain_order=True).agg(
...     pl.col("a").apply(lambda x: x.sum())
... ).collect()
shape: (3, 2)
┌─────┬─────┐
│ b   ┆ a   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ a   ┆ 1   │
│ b   ┆ 2   │
│ c   ┆ 4   │
└─────┴─────┘

It is better to implement this with an expression:

>>> df.groupby("b", maintain_order=True).agg(
...     pl.col("a").sum(),
... )