polars.DataFrame.unique#
- DataFrame.unique(
- subset: IntoExpr | Collection[IntoExpr] | None = None,
- *,
- keep: UniqueKeepStrategy = 'any',
- maintain_order: bool = False,
Drop duplicate rows from this DataFrame.
- Parameters:
- subset
Column name(s), selector(s), or expressions to consider when identifying duplicate rows. If set to
None(default), all columns are considered.- keep{‘first’, ‘last’, ‘any’, ‘none’}
Which of the duplicate rows to keep.
- ‘any’: Does not give any guarantee of which row is kept.
This allows more optimizations.
‘none’: Don’t keep duplicate rows.
‘first’: Keep the first unique row.
‘last’: Keep the last unique row.
- maintain_order
Keep the same order as the original DataFrame. This is more expensive to compute. Settings this to
Trueblocks the possibility to run on the streaming engine.
- Returns:
- DataFrame
DataFrame with unique rows.
Warning
This method will fail if there is a column of type
Listin the DataFrame (or in the “subset” parameter).Notes
If you’re coming from Pandas, this is similar to
pandas.DataFrame.drop_duplicates.Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3, 1, 1], ... "bar": ["a", "a", "a", "x", "x"], ... "ham": ["b", "b", "b", "y", "y"], ... } ... )
By default, all columns are considered when determining which rows are unique:
>>> df.unique(maintain_order=True) shape: (4, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ a ┆ b │ │ 2 ┆ a ┆ b │ │ 3 ┆ a ┆ b │ │ 1 ┆ x ┆ y │ └─────┴─────┴─────┘
We can also consider only a subset of columns when determining uniqueness, controlling which row we keep when duplicates are found:
>>> df.unique(subset="foo", keep="first", maintain_order=True) shape: (3, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ a ┆ b │ │ 2 ┆ a ┆ b │ │ 3 ┆ a ┆ b │ └─────┴─────┴─────┘ >>> df.unique(subset="foo", keep="last", maintain_order=True) shape: (3, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ str │ ╞═════╪═════╪═════╡ │ 2 ┆ a ┆ b │ │ 3 ┆ a ┆ b │ │ 1 ┆ x ┆ y │ └─────┴─────┴─────┘ >>> df.unique(subset="foo", keep="none", maintain_order=True) shape: (2, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ str │ ╞═════╪═════╪═════╡ │ 2 ┆ a ┆ b │ │ 3 ┆ a ┆ b │ └─────┴─────┴─────┘
Selectors can be used to define the “subset” parameter:
>>> import polars.selectors as cs >>> df.unique(subset=cs.string(), maintain_order=True) shape: (2, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ a ┆ b │ │ 1 ┆ x ┆ y │ └─────┴─────┴─────┘
We can also use an arbitrary expression in the “subset” parameter; in this example we use the part of the label in front of “:” to determine uniqueness:
>>> df = pl.DataFrame( ... { ... "label": ["xx:1", "xx:2", "yy:3", "yy:4"], ... "value": [100, 200, 300, 400], ... } ... ) >>> df.unique( ... subset=pl.col("label").str.extract(r"^(\w+):"), ... maintain_order=True, ... keep="first", ... ) shape: (2, 2) ┌───────┬───────┐ │ label ┆ value │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═══════╪═══════╡ │ xx:1 ┆ 100 │ │ yy:3 ┆ 300 │ └───────┴───────┘