Basic operations

This section shows how to do basic operations on dataframe columns, like do basic arithmetic calculations, perform comparisons, and other general-purpose operations. We will use the following dataframe for the examples that follow:

Python Rust

DataFrame

import polars as pl
import numpy as np

np.random.seed(42)  # For reproducibility.

df = pl.DataFrame(
    {
        "nrs": [1, 2, 3, None, 5],
        "names": ["foo", "ham", "spam", "egg", "spam"],
        "random": np.random.rand(5),
        "groups": ["A", "A", "B", "A", "B"],
    }
)
print(df)

DataFrame

use polars::prelude::*;

let df = df! (
    "nrs" => &[Some(1), Some(2), Some(3), None, Some(5)],
    "names" => &["foo", "ham", "spam", "egg", "spam"],
    "random" => &[0.37454, 0.950714, 0.731994, 0.598658, 0.156019],
    "groups" => &["A", "A", "B", "A", "B"],
)?;

println!("{}", &df);

shape: (5, 4)
┌──────┬───────┬──────────┬────────┐
│ nrs  ┆ names ┆ random   ┆ groups │
│ ---  ┆ ---   ┆ ---      ┆ ---    │
│ i64  ┆ str   ┆ f64      ┆ str    │
╞══════╪═══════╪══════════╪════════╡
│ 1    ┆ foo   ┆ 0.37454  ┆ A      │
│ 2    ┆ ham   ┆ 0.950714 ┆ A      │
│ 3    ┆ spam  ┆ 0.731994 ┆ B      │
│ null ┆ egg   ┆ 0.598658 ┆ A      │
│ 5    ┆ spam  ┆ 0.156019 ┆ B      │
└──────┴───────┴──────────┴────────┘

Basic arithmetic

Polars supports basic arithmetic between series of the same length, or between series and literals. When literals are mixed with series, the literals are broadcast to match the length of the series they are being used with.

Python Rust

operators

result = df.select(
    (pl.col("nrs") + 5).alias("nrs + 5"),
    (pl.col("nrs") - 5).alias("nrs - 5"),
    (pl.col("nrs") * pl.col("random")).alias("nrs * random"),
    (pl.col("nrs") / pl.col("random")).alias("nrs / random"),
    (pl.col("nrs") ** 2).alias("nrs ** 2"),
    (pl.col("nrs") % 3).alias("nrs % 3"),
)

print(result)

operators

let result = df
    .clone()
    .lazy()
    .select([
        (col("nrs") + lit(5)).alias("nrs + 5"),
        (col("nrs") - lit(5)).alias("nrs - 5"),
        (col("nrs") * col("random")).alias("nrs * random"),
        (col("nrs") / col("random")).alias("nrs / random"),
        (col("nrs").pow(lit(2))).alias("nrs ** 2"),
        (col("nrs") % lit(3)).alias("nrs % 3"),
    ])
    .collect()?;
println!("{}", result);

shape: (5, 6)
┌─────────┬─────────┬──────────────┬──────────────┬──────────┬─────────┐
│ nrs + 5 ┆ nrs - 5 ┆ nrs * random ┆ nrs / random ┆ nrs ** 2 ┆ nrs % 3 │
│ ---     ┆ ---     ┆ ---          ┆ ---          ┆ ---      ┆ ---     │
│ i64     ┆ i64     ┆ f64          ┆ f64          ┆ i64      ┆ i64     │
╞═════════╪═════════╪══════════════╪══════════════╪══════════╪═════════╡
│ 6       ┆ -4      ┆ 0.37454      ┆ 2.669941     ┆ 1        ┆ 1       │
│ 7       ┆ -3      ┆ 1.901429     ┆ 2.103681     ┆ 4        ┆ 2       │
│ 8       ┆ -2      ┆ 2.195982     ┆ 4.098395     ┆ 9        ┆ 0       │
│ null    ┆ null    ┆ null         ┆ null         ┆ null     ┆ null    │
│ 10      ┆ 0       ┆ 0.780093     ┆ 32.047453    ┆ 25       ┆ 2       │
└─────────┴─────────┴──────────────┴──────────────┴──────────┴─────────┘

The example above shows that when an arithmetic operation takes null as one of its operands, the result is null.

Polars uses operator overloading to allow you to use your language's native arithmetic operators within your expressions. If you prefer, in Python you can use the corresponding named functions, as the snippet below demonstrates:

# Python only:
result_named_operators = df.select(
    (pl.col("nrs").add(5)).alias("nrs + 5"),
    (pl.col("nrs").sub(5)).alias("nrs - 5"),
    (pl.col("nrs").mul(pl.col("random"))).alias("nrs * random"),
    (pl.col("nrs").truediv(pl.col("random"))).alias("nrs / random"),
    (pl.col("nrs").pow(2)).alias("nrs ** 2"),
    (pl.col("nrs").mod(3)).alias("nrs % 3"),
)

print(result.equals(result_named_operators))

True

Comparisons

Like with arithmetic operations, Polars supports comparisons via the overloaded operators or named functions:

Python Rust

operators

result = df.select(
    (pl.col("nrs") > 1).alias("nrs > 1"),  # .gt
    (pl.col("nrs") >= 3).alias("nrs >= 3"),  # ge
    (pl.col("random") < 0.2).alias("random < .2"),  # .lt
    (pl.col("random") <= 0.5).alias("random <= .5"),  # .le
    (pl.col("nrs") != 1).alias("nrs != 1"),  # .ne
    (pl.col("nrs") == 1).alias("nrs == 1"),  # .eq
)
print(result)

operators

let result = df
    .clone()
    .lazy()
    .select([
        col("nrs").gt(1).alias("nrs > 1"),
        col("nrs").gt_eq(3).alias("nrs >= 3"),
        col("random").lt_eq(0.2).alias("random < .2"),
        col("random").lt_eq(0.5).alias("random <= .5"),
        col("nrs").neq(1).alias("nrs != 1"),
        col("nrs").eq(1).alias("nrs == 1"),
    ])
    .collect()?;
println!("{}", result);

shape: (5, 6)
┌─────────┬──────────┬─────────────┬──────────────┬──────────┬──────────┐
│ nrs > 1 ┆ nrs >= 3 ┆ random < .2 ┆ random <= .5 ┆ nrs != 1 ┆ nrs == 1 │
│ ---     ┆ ---      ┆ ---         ┆ ---          ┆ ---      ┆ ---      │
│ bool    ┆ bool     ┆ bool        ┆ bool         ┆ bool     ┆ bool     │
╞═════════╪══════════╪═════════════╪══════════════╪══════════╪══════════╡
│ false   ┆ false    ┆ false       ┆ true         ┆ false    ┆ true     │
│ true    ┆ false    ┆ false       ┆ false        ┆ true     ┆ false    │
│ true    ┆ true     ┆ false       ┆ false        ┆ true     ┆ false    │
│ null    ┆ null     ┆ false       ┆ false        ┆ null     ┆ null     │
│ true    ┆ true     ┆ true        ┆ true         ┆ true     ┆ false    │
└─────────┴──────────┴─────────────┴──────────────┴──────────┴──────────┘

Boolean and bitwise operations

Depending on the language, you may use the operators &, |, and ~, for the Boolean operations “and”, “or”, and “not”, respectively, or the functions of the same name:

Python Rust

operators

# Boolean operators & | ~
result = df.select(
    ((~pl.col("nrs").is_null()) & (pl.col("groups") == "A")).alias(
        "number not null and group A"
    ),
    ((pl.col("random") < 0.5) | (pl.col("groups") == "B")).alias(
        "random < 0.5 or group B"
    ),
)

print(result)

# Corresponding named functions `and_`, `or_`, and `not_`.
result2 = df.select(
    (pl.col("nrs").is_null().not_().and_(pl.col("groups") == "A")).alias(
        "number not null and group A"
    ),
    ((pl.col("random") < 0.5).or_(pl.col("groups") == "B")).alias(
        "random < 0.5 or group B"
    ),
)
print(result.equals(result2))

operators

let result = df
    .clone()
    .lazy()
    .select([
        ((col("nrs").is_null()).not().and(col("groups").eq(lit("A"))))
            .alias("number not null and group A"),
        (col("random").lt(lit(0.5)).or(col("groups").eq(lit("B"))))
            .alias("random < 0.5 or group B"),
    ])
    .collect()?;
println!("{}", result);

shape: (5, 2)
┌─────────────────────────────┬─────────────────────────┐
│ number not null and group A ┆ random < 0.5 or group B │
│ ---                         ┆ ---                     │
│ bool                        ┆ bool                    │
╞═════════════════════════════╪═════════════════════════╡
│ true                        ┆ true                    │
│ true                        ┆ false                   │
│ false                       ┆ true                    │
│ false                       ┆ false                   │
│ false                       ┆ true                    │
└─────────────────────────────┴─────────────────────────┘
True

Python trivia

The Python functions are called and_, or_, and not_, because the words and, or, and not are reserved keywords in Python. Similarly, we cannot use the keywords and, or, and not, as the Boolean operators because these Python keywords will interpret their operands in the context of Truthy and Falsy through the dunder method __bool__. Thus, we overload the bitwise operators &, |, and ~, as the Boolean operators because they are the second best choice.

These operators/functions can also be used for the respective bitwise operations, alongside the bitwise operator ^ / function xor:

Python Rust

result = df.select(
    pl.col("nrs"),
    (pl.col("nrs") & 6).alias("nrs & 6"),
    (pl.col("nrs") | 6).alias("nrs | 6"),
    (~pl.col("nrs")).alias("not nrs"),
    (pl.col("nrs") ^ 6).alias("nrs ^ 6"),
)

print(result)

let result = df
    .clone()
    .lazy()
    .select([
        col("nrs"),
        col("nrs").and(lit(6)).alias("nrs & 6"),
        col("nrs").or(lit(6)).alias("nrs | 6"),
        col("nrs").not().alias("not nrs"),
        col("nrs").xor(lit(6)).alias("nrs ^ 6"),
    ])
    .collect()?;
println!("{}", result);

shape: (5, 5)
┌──────┬─────────┬─────────┬─────────┬─────────┐
│ nrs  ┆ nrs & 6 ┆ nrs | 6 ┆ not nrs ┆ nrs ^ 6 │
│ ---  ┆ ---     ┆ ---     ┆ ---     ┆ ---     │
│ i64  ┆ i64     ┆ i64     ┆ i64     ┆ i64     │
╞══════╪═════════╪═════════╪═════════╪═════════╡
│ 1    ┆ 0       ┆ 7       ┆ -2      ┆ 7       │
│ 2    ┆ 2       ┆ 6       ┆ -3      ┆ 4       │
│ 3    ┆ 2       ┆ 7       ┆ -4      ┆ 5       │
│ null ┆ null    ┆ null    ┆ null    ┆ null    │
│ 5    ┆ 4       ┆ 7       ┆ -6      ┆ 3       │
└──────┴─────────┴─────────┴─────────┴─────────┘

Counting (unique) values

Polars has two functions to count the number of unique values in a series. The function n_unique can be used to count the exact number of unique values in a series. However, for very large data sets, this operation can be quite slow. In those cases, if an approximation is good enough, you can use the function approx_n_unique that uses the algorithm HyperLogLog++ to estimate the result.

The example below shows an example series where the approx_n_unique estimation is wrong by 0.9%:

Python Rust

n_unique · approx_n_unique

long_df = pl.DataFrame({"numbers": np.random.randint(0, 100_000, 100_000)})

result = long_df.select(
    pl.col("numbers").n_unique().alias("n_unique"),
    pl.col("numbers").approx_n_unique().alias("approx_n_unique"),
)

print(result)

n_unique · approx_n_unique · Available on feature approx_unique

use rand::distributions::{Distribution, Uniform};
use rand::thread_rng;

let mut rng = thread_rng();
let between = Uniform::new_inclusive(0, 100_000);
let arr: Vec<u32> = between.sample_iter(&mut rng).take(100_100).collect();

let long_df = df!(
    "numbers" => &arr
)?;

let result = long_df
    .clone()
    .lazy()
    .select([
        col("numbers").n_unique().alias("n_unique"),
        col("numbers").approx_n_unique().alias("approx_n_unique"),
    ])
    .collect()?;
println!("{}", result);

shape: (1, 2)
┌──────────┬─────────────────┐
│ n_unique ┆ approx_n_unique │
│ ---      ┆ ---             │
│ u32      ┆ u32             │
╞══════════╪═════════════════╡
│ 63218    ┆ 64141           │
└──────────┴─────────────────┘

You can get more information about the unique values and their counts with the function value_counts, that Polars also provides:

Python Rust

value_counts

result = df.select(
    pl.col("names").value_counts().alias("value_counts"),
)

print(result)

value_counts · Available on feature dtype-struct

let result = df
    .clone()
    .lazy()
    .select([col("names")
        .value_counts(false, false, "count", false)
        .alias("value_counts")])
    .collect()?;
println!("{}", result);

shape: (4, 1)
┌──────────────┐
│ value_counts │
│ ---          │
│ struct[2]    │
╞══════════════╡
│ {"ham",1}    │
│ {"egg",1}    │
│ {"spam",2}   │
│ {"foo",1}    │
└──────────────┘

The function value_counts returns the results in structs, a data type that we will explore in a later section.

Alternatively, if you only need a series with the unique values or a series with the unique counts, they are one function away:

Python Rust

unique · unique_counts

result = df.select(
    pl.col("names").unique(maintain_order=True).alias("unique"),
    pl.col("names").unique_counts().alias("unique_counts"),
)

print(result)

unique · unique_counts · Available on feature unique_counts

let result = df
    .clone()
    .lazy()
    .select([
        col("names").unique_stable().alias("unique"),
        col("names").unique_counts().alias("unique_counts"),
    ])
    .collect()?;
println!("{}", result);

shape: (4, 2)
┌────────┬───────────────┐
│ unique ┆ unique_counts │
│ ---    ┆ ---           │
│ str    ┆ u32           │
╞════════╪═══════════════╡
│ foo    ┆ 1             │
│ ham    ┆ 1             │
│ spam   ┆ 2             │
│ egg    ┆ 1             │
└────────┴───────────────┘

Note that we need to specify maintain_order=True in the function unique so that the order of the results is consistent with the order of the results in unique_counts. See the API reference for more information.

Conditionals

Polars supports something akin to a ternary operator through the function when, which is followed by one function then and an optional function otherwise.

The function when accepts a predicate expression. The values that evaluate to True are replaced by the corresponding values of the expression inside the function then. The values that evaluate to False are replaced by the corresponding values of the expression inside the function otherwise or null, if otherwise is not provided.

The example below applies one step of the Collatz conjecture to the numbers in the column “nrs”:

Python Rust

when

result = df.select(
    pl.col("nrs"),
    pl.when(pl.col("nrs") % 2 == 1)  # Is the number odd?
    .then(3 * pl.col("nrs") + 1)  # If so, multiply by 3 and add 1.
    .otherwise(pl.col("nrs") // 2)  # If not, divide by 2.
    .alias("Collatz"),
)

print(result)

when

let result = df
    .clone()
    .lazy()
    .select([
        col("nrs"),
        when((col("nrs") % lit(2)).eq(lit(1)))
            .then(lit(3) * col("nrs") + lit(1))
            .otherwise(col("nrs") / lit(2))
            .alias("Collatz"),
    ])
    .collect()?;
println!("{}", result);

shape: (5, 2)
┌──────┬─────────┐
│ nrs  ┆ Collatz │
│ ---  ┆ ---     │
│ i64  ┆ i64     │
╞══════╪═════════╡
│ 1    ┆ 4       │
│ 2    ┆ 1       │
│ 3    ┆ 10      │
│ null ┆ null    │
│ 5    ┆ 16      │
└──────┴─────────┘

You can also emulate a chain of an arbitrary number of conditionals, akin to Python's elif statement, by chaining an arbitrary number of consecutive blocks of .when(...).then(...). In those cases, and for each given value, Polars will only consider a replacement expression that is deeper within the chain if the previous predicates all failed for that value.