polars.Expr.str.normalize#

Expr.str.normalize(form: UnicodeForm = 'NFC') Expr[source]#

Returns the Unicode normal form of the string values.

This uses the forms described in Unicode Standard Annex 15: <https://www.unicode.org/reports/tr15/>.

Parameters:
form{‘NFC’, ‘NFKC’, ‘NFD’, ‘NFKD’}

Unicode form to use.

Examples

>>> df = pl.DataFrame({"text": ["01²", "KADOKAWA"]})
>>> new = df.with_columns(
...     nfc=pl.col("text").str.normalize("NFC"),
...     nfkc=pl.col("text").str.normalize("NFKC"),
... )
>>> new
shape: (2, 3)
┌──────────────────┬──────────────────┬──────────┐
│ text             ┆ nfc              ┆ nfkc     │
│ ---              ┆ ---              ┆ ---      │
│ str              ┆ str              ┆ str      │
╞══════════════════╪══════════════════╪══════════╡
│ 01²              ┆ 01²              ┆ 012      │
│ KADOKAWA    ┆ KADOKAWA    ┆ KADOKAWA │
└──────────────────┴──────────────────┴──────────┘
>>> new.select(pl.all().str.len_bytes())
shape: (2, 3)
┌──────┬─────┬──────┐
│ text ┆ nfc ┆ nfkc │
│ ---  ┆ --- ┆ ---  │
│ u32  ┆ u32 ┆ u32  │
╞══════╪═════╪══════╡
│ 4    ┆ 4   ┆ 3    │
│ 24   ┆ 24  ┆ 8    │
└──────┴─────┴──────┘