polars.Expr.str.extract#

Expr.str.extract(pattern: str, group_index: int = 1) Expr[source]#

Extract the target capture group from provided patterns.

Parameters:
pattern

A valid regular expression pattern, compatible with the regex crate.

group_index

Index of the targeted capture group. Group 0 mean the whole pattern, first group begin at index 1 Default to the first capture group

Returns:
Expr

Expression of data type Utf8. Contains null values if original value is null or the regex captures nothing.

Notes

To modify regular expression behaviour (such as multi-line matching) with flags, use the inline (?iLmsuxU) syntax. For example:

>>> df = pl.DataFrame(
...     data={
...         "lines": [
...             "I Like\nThose\nOdds",
...             "This is\nThe Way",
...         ]
...     }
... )
>>> df.select(
...     pl.col("lines").str.extract(r"(?m)^(T\w+)", 1).alias("matches"),
... )
shape: (2, 1)
┌─────────┐
│ matches │
│ ---     │
│ str     │
╞═════════╡
│ Those   │
│ This    │
└─────────┘

See the regex crate’s section on grouping and flags for additional information about the use of inline expression modifiers.

Examples

>>> df = pl.DataFrame(
...     {
...         "url": [
...             "http://vote.com/ballon_dor?error=404&ref=unknown",
...             "http://vote.com/ballon_dor?ref=polars&candidate=messi",
...             "http://vote.com/ballon_dor?candidate=ronaldo&ref=polars",
...         ]
...     }
... )
>>> df.select(
...     pl.col("url").str.extract(r"candidate=(\w+)", 1).alias("candidate"),
...     pl.col("url").str.extract(r"ref=(\w+)", 1).alias("referer"),
...     pl.col("url").str.extract(r"error=(\w+)", 1).alias("error"),
... )
shape: (3, 3)
┌───────────┬─────────┬───────┐
│ candidate ┆ referer ┆ error │
│ ---       ┆ ---     ┆ ---   │
│ str       ┆ str     ┆ str   │
╞═══════════╪═════════╪═══════╡
│ null      ┆ unknown ┆ 404   │
│ messi     ┆ polars  ┆ null  │
│ ronaldo   ┆ polars  ┆ null  │
└───────────┴─────────┴───────┘