polars.Expr.str.extract#
- Expr.str.extract(pattern: str, group_index: int = 1) Expr [source]#
Extract the target capture group from provided patterns.
- Parameters:
- pattern
A valid regular expression pattern, compatible with the regex crate.
- group_index
Index of the targeted capture group. Group 0 means the whole pattern, the first group begins at index 1. Defaults to the first capture group.
- Returns:
- Expr
Expression of data type
Utf8
. Contains null values if original value is null or the regex captures nothing.
Notes
To modify regular expression behaviour (such as multi-line matching) with flags, use the inline
(?iLmsuxU)
syntax. For example:>>> df = pl.DataFrame( ... data={ ... "lines": [ ... "I Like\nThose\nOdds", ... "This is\nThe Way", ... ] ... } ... ) >>> df.with_columns( ... pl.col("lines").str.extract(r"(?m)^(T\w+)", 1).alias("matches"), ... ) shape: (2, 2) ┌─────────┬─────────┐ │ lines ┆ matches │ │ --- ┆ --- │ │ str ┆ str │ ╞═════════╪═════════╡ │ I Like ┆ Those │ │ Those ┆ │ │ Odds ┆ │ │ This is ┆ This │ │ The Way ┆ │ └─────────┴─────────┘
See the regex crate’s section on grouping and flags for additional information about the use of inline expression modifiers.
Examples
>>> df = pl.DataFrame( ... { ... "url": [ ... "http://vote.com/ballon_dor?error=404&ref=unknown", ... "http://vote.com/ballon_dor?ref=polars&candidate=messi", ... "http://vote.com/ballon_dor?candidate=ronaldo&ref=polars", ... ] ... } ... ) >>> df.select( ... pl.col("url").str.extract(r"candidate=(\w+)", 1).alias("candidate"), ... pl.col("url").str.extract(r"ref=(\w+)", 1).alias("referer"), ... pl.col("url").str.extract(r"error=(\w+)", 1).alias("error"), ... ) shape: (3, 3) ┌───────────┬─────────┬───────┐ │ candidate ┆ referer ┆ error │ │ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str │ ╞═══════════╪═════════╪═══════╡ │ null ┆ unknown ┆ 404 │ │ messi ┆ polars ┆ null │ │ ronaldo ┆ polars ┆ null │ └───────────┴─────────┴───────┘