polars.Series.str.extract_all#

Series.str.extract_all(pattern: str | Series) Series[source]#

Extract all matches for the given regex pattern.

Extract each successive non-overlapping regex match in an individual string as a list. Extracted matches contain null if the original value is null or the regex did not capture anything.

Parameters:
pattern

A valid regular expression pattern, compatible with the regex crate.

Returns:
Series

Series of data type List(Utf8).

Notes

To modify regular expression behaviour (such as “verbose” mode and/or case-sensitive matching) with flags, use the inline (?iLmsuxU) syntax. For example:

>>> s = pl.Series(
...     name="email",
...     values=[
...         "real.email@spam.com",
...         "some_account@somewhere.net",
...         "abc.def.ghi.jkl@uvw.xyz.co.uk",
...     ],
... )
>>> # extract name/domain parts from email, using verbose regex
>>> s.str.extract_all(
...     r"""(?xi)   # activate 'verbose' and 'case-insensitive' flags
...       [         # (start character group)
...         A-Z     # letters
...         0-9     # digits
...         ._%+\-  # special chars
...       ]         # (end character group)
...       +         # 'one or more' quantifier
...     """
... ).alias("email_parts")
shape: (3,)
Series: 'email_parts' [list[str]]
[
    ["real.email", "spam.com"]
    ["some_account", "somewhere.net"]
    ["abc.def.ghi.jkl", "uvw.xyz.co.uk"]
]

See the regex crate’s section on grouping and flags for additional information about the use of inline expression modifiers.

Examples

>>> s = pl.Series("foo", ["123 bla 45 asd", "xyz 678 910t"])
>>> s.str.extract_all(r"\d+")
shape: (2,)
Series: 'foo' [list[str]]
[
    ["123", "45"]
    ["678", "910"]
]