polars.DataFrame.unique#

DataFrame.unique( subset: IntoExpr | Collection[IntoExpr] | None = None, *, keep: UniqueKeepStrategy = 'any', maintain_order: bool = False, ) → DataFrame[source]#

Drop duplicate rows from this DataFrame.

Parameters:

subset

Column name(s), selector(s), or expressions to consider when identifying duplicate rows. If set to None (default), all columns are considered.

keep{‘first’, ‘last’, ‘any’, ‘none’}

Which of the duplicate rows to keep.

‘any’: Does not give any guarantee of which row is kept.
This allows more optimizations.
‘none’: Don’t keep duplicate rows.
‘first’: Keep the first unique row.
‘last’: Keep the last unique row.

maintain_order

Keep the same order as the original DataFrame. This is more expensive to compute. Settings this to True blocks the possibility to run on the streaming engine.

Returns:

DataFrame: DataFrame with unique rows.

Warning

This method will fail if there is a column of type List in the DataFrame (or in the “subset” parameter).

Notes

If you’re coming from Pandas, this is similar to pandas.DataFrame.drop_duplicates.

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3, 1, 1],
...         "bar": ["a", "a", "a", "x", "x"],
...         "ham": ["b", "b", "b", "y", "y"],
...     }
... )

By default, all columns are considered when determining which rows are unique:

>>> df.unique(maintain_order=True)
shape: (4, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞═════╪═════╪═════╡
│ 1   ┆ a   ┆ b   │
│ 2   ┆ a   ┆ b   │
│ 3   ┆ a   ┆ b   │
│ 1   ┆ x   ┆ y   │
└─────┴─────┴─────┘

We can also consider only a subset of columns when determining uniqueness, controlling which row we keep when duplicates are found:

>>> df.unique(subset="foo", keep="first", maintain_order=True)
shape: (3, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞═════╪═════╪═════╡
│ 1   ┆ a   ┆ b   │
│ 2   ┆ a   ┆ b   │
│ 3   ┆ a   ┆ b   │
└─────┴─────┴─────┘
>>> df.unique(subset="foo", keep="last", maintain_order=True)
shape: (3, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞═════╪═════╪═════╡
│ 2   ┆ a   ┆ b   │
│ 3   ┆ a   ┆ b   │
│ 1   ┆ x   ┆ y   │
└─────┴─────┴─────┘
>>> df.unique(subset="foo", keep="none", maintain_order=True)
shape: (2, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞═════╪═════╪═════╡
│ 2   ┆ a   ┆ b   │
│ 3   ┆ a   ┆ b   │
└─────┴─────┴─────┘

Selectors can be used to define the “subset” parameter:

>>> import polars.selectors as cs
>>> df.unique(subset=cs.string(), maintain_order=True)
shape: (2, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞═════╪═════╪═════╡
│ 1   ┆ a   ┆ b   │
│ 1   ┆ x   ┆ y   │
└─────┴─────┴─────┘

We can also use an arbitrary expression in the “subset” parameter; in this example we use the part of the label in front of “:” to determine uniqueness:

>>> df = pl.DataFrame(
...     {
...         "label": ["xx:1", "xx:2", "yy:3", "yy:4"],
...         "value": [100, 200, 300, 400],
...     }
... )
>>> df.unique(
...     subset=pl.col("label").str.extract(r"^(\w+):"),
...     maintain_order=True,
...     keep="first",
... )
shape: (2, 2)
┌───────┬───────┐
│ label ┆ value │
│ ---   ┆ ---   │
│ str   ┆ i64   │
╞═══════╪═══════╡
│ xx:1  ┆ 100   │
│ yy:3  ┆ 300   │
└───────┴───────┘