polars.Expr.apply#
- Expr.apply(
- function: Callable[[Series], Series] | Callable[[Any], Any],
- return_dtype: PolarsDataType | None = None,
- *,
- skip_nulls: bool = True,
- pass_name: bool = False,
- strategy: ApplyStrategy = 'thread_local',
Apply a custom/user-defined function (UDF) in a GroupBy or Projection context.
Warning
This method is much slower than the native expressions API. Only use it if you cannot implement your logic otherwise.
Depending on the context it has the following behavior:
- Selection
Expects f to be of type Callable[[Any], Any]. Applies a python function over each individual value in the column.
- GroupBy
Expects f to be of type Callable[[Series], Series]. Applies a python function over each group.
- Parameters:
- function
Lambda/ function to apply.
- return_dtype
Dtype of the output Series. If not set, the dtype will be
polars.Unknown
.- skip_nulls
Don’t apply the function over values that contain nulls. This is faster.
- pass_name
Pass the Series name to the custom function This is more expensive.
- strategy{‘thread_local’, ‘threading’}
This functionality is in alpha stage. This may be removed /changed without it being considered a breaking change.
‘thread_local’: run the python function on a single thread.
- ‘threading’: run the python function on separate threads. Use with
care as this can slow performance. This might only speed up your code if the amount of work per element is significant and the python function releases the GIL (e.g. via calling a c function)
Warning
If
return_dtype
is not provided, this may lead to unexpected results. We allow this, but it is considered a bug in the user’s query.Notes
Using
apply
is strongly discouraged as you will be effectively running python “for” loops. This will be very slow. Wherever possible you should strongly prefer the native expression API to achieve the best performance.If your function is expensive and you don’t want it to be called more than once for a given input, consider applying an
@lru_cache
decorator to it. With suitable data you may achieve order-of-magnitude speedups (or more).
Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 2, 3, 1], ... "b": ["a", "b", "c", "c"], ... } ... )
In a selection context, the function is applied by row.
>>> df.with_columns( ... pl.col("a").apply(lambda x: x * 2).alias("a_times_2"), ... ) shape: (4, 3) ┌─────┬─────┬───────────┐ │ a ┆ b ┆ a_times_2 │ │ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ i64 │ ╞═════╪═════╪═══════════╡ │ 1 ┆ a ┆ 2 │ │ 2 ┆ b ┆ 4 │ │ 3 ┆ c ┆ 6 │ │ 1 ┆ c ┆ 2 │ └─────┴─────┴───────────┘
It is better to implement this with an expression:
>>> df.with_columns( ... (pl.col("a") * 2).alias("a_times_2"), ... )
In a GroupBy context the function is applied by group:
>>> df.lazy().groupby("b", maintain_order=True).agg( ... pl.col("a").apply(lambda x: x.sum()) ... ).collect() shape: (3, 2) ┌─────┬─────┐ │ b ┆ a │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════╪═════╡ │ a ┆ 1 │ │ b ┆ 2 │ │ c ┆ 4 │ └─────┴─────┘
It is better to implement this with an expression:
>>> df.groupby("b", maintain_order=True).agg( ... pl.col("a").sum(), ... )