polars.Expr.str.extract_all#
- Expr.str.extract_all(pattern: str | Expr) Expr [source]#
Extract all matches for the given regex pattern.
Extract each successive non-overlapping regex match in an individual string as a list. Extracted matches contain
null
if the original value is null or the regex did not capture anything.- Parameters:
- pattern
A valid regular expression pattern, compatible with the regex crate.
- Returns:
- Expr
Expression of data type
List(Utf8)
.
Notes
To modify regular expression behaviour (such as “verbose” mode and/or case-sensitive matching) with flags, use the inline
(?iLmsuxU)
syntax. For example:>>> df = pl.DataFrame( ... data={ ... "email": [ ... "real.email@spam.com", ... "some_account@somewhere.net", ... "abc.def.ghi.jkl@uvw.xyz.co.uk", ... ] ... } ... ) >>> # extract name/domain parts from the addresses, using verbose regex >>> df.with_columns( ... pl.col("email") ... .str.extract_all( ... r"""(?xi) # activate 'verbose' and 'case-insensitive' flags ... [ # (start character group) ... A-Z # letters ... 0-9 # digits ... ._%+\- # special chars ... ] # (end character group) ... + # 'one or more' quantifier ... """ ... ) ... .list.to_struct(fields=["name", "domain"]) ... .alias("email_parts") ... ).unnest("email_parts") shape: (3, 3) ┌───────────────────────────────┬─────────────────┬───────────────┐ │ email ┆ name ┆ domain │ │ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str │ ╞═══════════════════════════════╪═════════════════╪═══════════════╡ │ real.email@spam.com ┆ real.email ┆ spam.com │ │ some_account@somewhere.net ┆ some_account ┆ somewhere.net │ │ abc.def.ghi.jkl@uvw.xyz.co.uk ┆ abc.def.ghi.jkl ┆ uvw.xyz.co.uk │ └───────────────────────────────┴─────────────────┴───────────────┘
See the regex crate’s section on grouping and flags for additional information about the use of inline expression modifiers.
Examples
>>> df = pl.DataFrame({"foo": ["123 bla 45 asd", "xyz 678 910t"]}) >>> df.select( ... pl.col("foo").str.extract_all(r"\d+").alias("extracted_nrs"), ... ) shape: (2, 1) ┌────────────────┐ │ extracted_nrs │ │ --- │ │ list[str] │ ╞════════════════╡ │ ["123", "45"] │ │ ["678", "910"] │ └────────────────┘