polars.Expr.map_elements#
- Expr.map_elements(
- function: Callable[[Any], Any],
- return_dtype: PolarsDataType | None = None,
- *,
- skip_nulls: bool = True,
- pass_name: bool = False,
- strategy: MapElementsStrategy = 'thread_local',
- returns_scalar: bool = False,
Map a custom/user-defined function (UDF) to each element of a column.
Warning
This method is much slower than the native expressions API. Only use it if you cannot implement your logic otherwise.
Suppose that the function is:
x ↦ sqrt(x)
:For mapping elements of a series, consider:
pl.col("col_name").sqrt()
.For mapping inner elements of lists, consider:
pl.col("col_name").list.eval(pl.element().sqrt())
.For mapping elements of struct fields, consider:
pl.col("col_name").struct.field("field_name").sqrt()
.
If you want to replace the original column or field, consider
.with_columns
and.with_fields
.The UDF is applied to each element of a column. Note that, in a GroupBy context, the column will have been pre-aggregated and so each element will itself be a Series. Therefore, depending on the context, requirements for
function
differ:- Selection
Expects
function
to be of typeCallable[[Any], Any]
. Applies a Python function to each individual value in the column.
- GroupBy
Expects
function
to be of typeCallable[[Series], Any]
. For each group, applies a Python function to the slice of the column corresponding to that group.
- Parameters:
- function
Lambda/function to map.
- return_dtype
Dtype of the output Series. If not set, the dtype will be inferred based on the first non-null value that is returned by the function.
- skip_nulls
Don’t map the function over values that contain nulls (this is faster).
- pass_name
Pass the Series name to the custom function (this is more expensive).
- returns_scalar
If the function passed does a reduction (e.g. sum, min, etc), Polars must be informed of this otherwise the schema might be incorrect.
- strategy{‘thread_local’, ‘threading’}
The threading strategy to use.
‘thread_local’: run the python function on a single thread.
‘threading’: run the python function on separate threads. Use with care as this can slow performance. This might only speed up your code if the amount of work per element is significant and the python function releases the GIL (e.g. via calling a c function)
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
Warning
If
return_dtype
is not provided, this may lead to unexpected results. We allow this, but it is considered a bug in the user’s query.Notes
Using
map_elements
is strongly discouraged as you will be effectively running python “for” loops, which will be very slow. Wherever possible you should prefer the native expression API to achieve the best performance.If your function is expensive and you don’t want it to be called more than once for a given input, consider applying an
@lru_cache
decorator to it. If your data is suitable you may achieve significant speedups.Window function application using
over
is considered a GroupBy context here, somap_elements
can be used to map functions over window groups.
Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 2, 3, 1], ... "b": ["a", "b", "c", "c"], ... } ... )
The function is applied to each element of column
'a'
:>>> df.with_columns( ... pl.col("a") ... .map_elements(lambda x: x * 2, return_dtype=pl.Int64) ... .alias("a_times_2"), ... ) shape: (4, 3) ┌─────┬─────┬───────────┐ │ a ┆ b ┆ a_times_2 │ │ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ i64 │ ╞═════╪═════╪═══════════╡ │ 1 ┆ a ┆ 2 │ │ 2 ┆ b ┆ 4 │ │ 3 ┆ c ┆ 6 │ │ 1 ┆ c ┆ 2 │ └─────┴─────┴───────────┘
Tip: it is better to implement this with an expression:
>>> df.with_columns( ... (pl.col("a") * 2).alias("a_times_2"), ... )
In a GroupBy context, each element of the column is itself a Series:
>>> ( ... df.lazy().group_by("b").agg(pl.col("a")).collect() ... ) shape: (3, 2) ┌─────┬───────────┐ │ b ┆ a │ │ --- ┆ --- │ │ str ┆ list[i64] │ ╞═════╪═══════════╡ │ a ┆ [1] │ │ b ┆ [2] │ │ c ┆ [3, 1] │ └─────┴───────────┘
Therefore, from the user’s point-of-view, the function is applied per-group:
>>> ( ... df.lazy() ... .group_by("b") ... .agg(pl.col("a").map_elements(lambda x: x.sum(), return_dtype=pl.Int64)) ... .collect() ... ) shape: (3, 2) ┌─────┬─────┐ │ b ┆ a │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════╪═════╡ │ a ┆ 1 │ │ b ┆ 2 │ │ c ┆ 4 │ └─────┴─────┘
Tip: again, it is better to implement this with an expression:
>>> ( ... df.lazy() ... .group_by("b", maintain_order=True) ... .agg(pl.col("a").sum()) ... .collect() ... )
Window function application using
over
will behave as a GroupBy context, with your function receiving individual window groups:>>> df = pl.DataFrame( ... { ... "key": ["x", "x", "y", "x", "y", "z"], ... "val": [1, 1, 1, 1, 1, 1], ... } ... ) >>> df.with_columns( ... scaled=pl.col("val") ... .map_elements(lambda s: s * len(s), return_dtype=pl.List(pl.Int64)) ... .over("key"), ... ).sort("key") shape: (6, 3) ┌─────┬─────┬────────┐ │ key ┆ val ┆ scaled │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪════════╡ │ x ┆ 1 ┆ 3 │ │ x ┆ 1 ┆ 3 │ │ x ┆ 1 ┆ 3 │ │ y ┆ 1 ┆ 2 │ │ y ┆ 1 ┆ 2 │ │ z ┆ 1 ┆ 1 │ └─────┴─────┴────────┘
Note that this function would also be better-implemented natively:
>>> df.with_columns( ... scaled=(pl.col("val") * pl.col("val").count()).over("key"), ... ).sort("key")