Basic operations
This section shows how to do basic operations on dataframe columns, like do basic arithmetic calculations, perform comparisons, and other general-purpose operations. We will use the following dataframe for the examples that follow:
import polars as pl
import numpy as np
np.random.seed(42) # For reproducibility.
df = pl.DataFrame(
{
"nrs": [1, 2, 3, None, 5],
"names": ["foo", "ham", "spam", "egg", "spam"],
"random": np.random.rand(5),
"groups": ["A", "A", "B", "A", "B"],
}
)
print(df)
use polars::prelude::*;
let df = df! (
"nrs" => &[Some(1), Some(2), Some(3), None, Some(5)],
"names" => &["foo", "ham", "spam", "egg", "spam"],
"random" => &[0.37454, 0.950714, 0.731994, 0.598658, 0.156019],
"groups" => &["A", "A", "B", "A", "B"],
)?;
println!("{}", &df);
shape: (5, 4)
โโโโโโโโฌโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโ
โ nrs โ names โ random โ groups โ
โ --- โ --- โ --- โ --- โ
โ i64 โ str โ f64 โ str โ
โโโโโโโโชโโโโโโโโชโโโโโโโโโโโชโโโโโโโโโก
โ 1 โ foo โ 0.37454 โ A โ
โ 2 โ ham โ 0.950714 โ A โ
โ 3 โ spam โ 0.731994 โ B โ
โ null โ egg โ 0.598658 โ A โ
โ 5 โ spam โ 0.156019 โ B โ
โโโโโโโโดโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโ
Basic arithmetic
Polars supports basic arithmetic between series of the same length, or between series and literals. When literals are mixed with series, the literals are broadcast to match the length of the series they are being used with.
result = df.select(
(pl.col("nrs") + 5).alias("nrs + 5"),
(pl.col("nrs") - 5).alias("nrs - 5"),
(pl.col("nrs") * pl.col("random")).alias("nrs * random"),
(pl.col("nrs") / pl.col("random")).alias("nrs / random"),
(pl.col("nrs") ** 2).alias("nrs ** 2"),
(pl.col("nrs") % 3).alias("nrs % 3"),
)
print(result)
let result = df
.clone()
.lazy()
.select([
(col("nrs") + lit(5)).alias("nrs + 5"),
(col("nrs") - lit(5)).alias("nrs - 5"),
(col("nrs") * col("random")).alias("nrs * random"),
(col("nrs") / col("random")).alias("nrs / random"),
(col("nrs").pow(lit(2))).alias("nrs ** 2"),
(col("nrs") % lit(3)).alias("nrs % 3"),
])
.collect()?;
println!("{}", result);
shape: (5, 6)
โโโโโโโโโโโฌโโโโโโโโโโฌโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโ
โ nrs + 5 โ nrs - 5 โ nrs * random โ nrs / random โ nrs ** 2 โ nrs % 3 โ
โ --- โ --- โ --- โ --- โ --- โ --- โ
โ i64 โ i64 โ f64 โ f64 โ i64 โ i64 โ
โโโโโโโโโโโชโโโโโโโโโโชโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโชโโโโโโโโโโโชโโโโโโโโโโก
โ 6 โ -4 โ 0.37454 โ 2.669941 โ 1 โ 1 โ
โ 7 โ -3 โ 1.901429 โ 2.103681 โ 4 โ 2 โ
โ 8 โ -2 โ 2.195982 โ 4.098395 โ 9 โ 0 โ
โ null โ null โ null โ null โ null โ null โ
โ 10 โ 0 โ 0.780093 โ 32.047453 โ 25 โ 2 โ
โโโโโโโโโโโดโโโโโโโโโโดโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโ
The example above shows that when an arithmetic operation takes null
as one of its operands, the
result is null
.
Polars uses operator overloading to allow you to use your language's native arithmetic operators within your expressions. If you prefer, in Python you can use the corresponding named functions, as the snippet below demonstrates:
# Python only:
result_named_operators = df.select(
(pl.col("nrs").add(5)).alias("nrs + 5"),
(pl.col("nrs").sub(5)).alias("nrs - 5"),
(pl.col("nrs").mul(pl.col("random"))).alias("nrs * random"),
(pl.col("nrs").truediv(pl.col("random"))).alias("nrs / random"),
(pl.col("nrs").pow(2)).alias("nrs ** 2"),
(pl.col("nrs").mod(3)).alias("nrs % 3"),
)
print(result.equals(result_named_operators))
True
Comparisons
Like with arithmetic operations, Polars supports comparisons via the overloaded operators or named functions:
result = df.select(
(pl.col("nrs") > 1).alias("nrs > 1"), # .gt
(pl.col("nrs") >= 3).alias("nrs >= 3"), # ge
(pl.col("random") < 0.2).alias("random < .2"), # .lt
(pl.col("random") <= 0.5).alias("random <= .5"), # .le
(pl.col("nrs") != 1).alias("nrs != 1"), # .ne
(pl.col("nrs") == 1).alias("nrs == 1"), # .eq
)
print(result)
let result = df
.clone()
.lazy()
.select([
col("nrs").gt(1).alias("nrs > 1"),
col("nrs").gt_eq(3).alias("nrs >= 3"),
col("random").lt_eq(0.2).alias("random < .2"),
col("random").lt_eq(0.5).alias("random <= .5"),
col("nrs").neq(1).alias("nrs != 1"),
col("nrs").eq(1).alias("nrs == 1"),
])
.collect()?;
println!("{}", result);
shape: (5, 6)
โโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโโ
โ nrs > 1 โ nrs >= 3 โ random < .2 โ random <= .5 โ nrs != 1 โ nrs == 1 โ
โ --- โ --- โ --- โ --- โ --- โ --- โ
โ bool โ bool โ bool โ bool โ bool โ bool โ
โโโโโโโโโโโชโโโโโโโโโโโชโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโชโโโโโโโโโโโชโโโโโโโโโโโก
โ false โ false โ false โ true โ false โ true โ
โ true โ false โ false โ false โ true โ false โ
โ true โ true โ false โ false โ true โ false โ
โ null โ null โ false โ false โ null โ null โ
โ true โ true โ true โ true โ true โ false โ
โโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโโ
Boolean and bitwise operations
Depending on the language, you may use the operators &
, |
, and ~
, for the Boolean operations
โandโ, โorโ, and โnotโ, respectively, or the functions of the same name:
# Boolean operators & | ~
result = df.select(
((~pl.col("nrs").is_null()) & (pl.col("groups") == "A")).alias(
"number not null and group A"
),
((pl.col("random") < 0.5) | (pl.col("groups") == "B")).alias(
"random < 0.5 or group B"
),
)
print(result)
# Corresponding named functions `and_`, `or_`, and `not_`.
result2 = df.select(
(pl.col("nrs").is_null().not_().and_(pl.col("groups") == "A")).alias(
"number not null and group A"
),
((pl.col("random") < 0.5).or_(pl.col("groups") == "B")).alias(
"random < 0.5 or group B"
),
)
print(result.equals(result2))
let result = df
.clone()
.lazy()
.select([
((col("nrs").is_null()).not().and(col("groups").eq(lit("A"))))
.alias("number not null and group A"),
(col("random").lt(lit(0.5)).or(col("groups").eq(lit("B"))))
.alias("random < 0.5 or group B"),
])
.collect()?;
println!("{}", result);
shape: (5, 2)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ number not null and group A โ random < 0.5 or group B โ
โ --- โ --- โ
โ bool โ bool โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโโก
โ true โ true โ
โ true โ false โ
โ false โ true โ
โ false โ false โ
โ false โ true โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโ
True
Python trivia
The Python functions are called and_
, or_
, and not_
, because the words and
, or
, and not
are reserved keywords in Python.
Similarly, we cannot use the keywords and
, or
, and not
, as the Boolean operators because these Python keywords will interpret their operands in the context of Truthy and Falsy through the dunder method __bool__
.
Thus, we overload the bitwise operators &
, |
, and ~
, as the Boolean operators because they are the second best choice.
These operators/functions can also be used for the respective bitwise operations, alongside the
bitwise operator ^
/ function xor
:
result = df.select(
pl.col("nrs"),
(pl.col("nrs") & 6).alias("nrs & 6"),
(pl.col("nrs") | 6).alias("nrs | 6"),
(~pl.col("nrs")).alias("not nrs"),
(pl.col("nrs") ^ 6).alias("nrs ^ 6"),
)
print(result)
let result = df
.clone()
.lazy()
.select([
col("nrs"),
col("nrs").and(lit(6)).alias("nrs & 6"),
col("nrs").or(lit(6)).alias("nrs | 6"),
col("nrs").not().alias("not nrs"),
col("nrs").xor(lit(6)).alias("nrs ^ 6"),
])
.collect()?;
println!("{}", result);
shape: (5, 5)
โโโโโโโโฌโโโโโโโโโโฌโโโโโโโโโโฌโโโโโโโโโโฌโโโโโโโโโโ
โ nrs โ nrs & 6 โ nrs | 6 โ not nrs โ nrs ^ 6 โ
โ --- โ --- โ --- โ --- โ --- โ
โ i64 โ i64 โ i64 โ i64 โ i64 โ
โโโโโโโโชโโโโโโโโโโชโโโโโโโโโโชโโโโโโโโโโชโโโโโโโโโโก
โ 1 โ 0 โ 7 โ -2 โ 7 โ
โ 2 โ 2 โ 6 โ -3 โ 4 โ
โ 3 โ 2 โ 7 โ -4 โ 5 โ
โ null โ null โ null โ null โ null โ
โ 5 โ 4 โ 7 โ -6 โ 3 โ
โโโโโโโโดโโโโโโโโโโดโโโโโโโโโโดโโโโโโโโโโดโโโโโโโโโโ
Counting (unique) values
Polars has two functions to count the number of unique values in a series. The function n_unique
can be used to count the exact number of unique values in a series. However, for very large data
sets, this operation can be quite slow. In those cases, if an approximation is good enough, you can
use the function approx_n_unique
that uses the algorithm
HyperLogLog++ to estimate the result.
The example below shows an example series where the approx_n_unique
estimation is wrong by 0.9%:
long_df = pl.DataFrame({"numbers": np.random.randint(0, 100_000, 100_000)})
result = long_df.select(
pl.col("numbers").n_unique().alias("n_unique"),
pl.col("numbers").approx_n_unique().alias("approx_n_unique"),
)
print(result)
n_unique
ยท approx_n_unique
ยท Available on feature approx_unique
use rand::distributions::{Distribution, Uniform};
use rand::thread_rng;
let mut rng = thread_rng();
let between = Uniform::new_inclusive(0, 100_000);
let arr: Vec<u32> = between.sample_iter(&mut rng).take(100_100).collect();
let long_df = df!(
"numbers" => &arr
)?;
let result = long_df
.clone()
.lazy()
.select([
col("numbers").n_unique().alias("n_unique"),
col("numbers").approx_n_unique().alias("approx_n_unique"),
])
.collect()?;
println!("{}", result);
shape: (1, 2)
โโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโ
โ n_unique โ approx_n_unique โ
โ --- โ --- โ
โ u32 โ u32 โ
โโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโก
โ 63218 โ 63784 โ
โโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโ
You can get more information about the unique values and their counts with the function
value_counts
, that Polars also provides:
result = df.select(
pl.col("names").value_counts().alias("value_counts"),
)
print(result)
value_counts
ยท Available on feature dtype-struct
let result = df
.clone()
.lazy()
.select([col("names")
.value_counts(false, false, "count", false)
.alias("value_counts")])
.collect()?;
println!("{}", result);
shape: (4, 1)
โโโโโโโโโโโโโโโโ
โ value_counts โ
โ --- โ
โ struct[2] โ
โโโโโโโโโโโโโโโโก
โ {"ham",1} โ
โ {"spam",2} โ
โ {"egg",1} โ
โ {"foo",1} โ
โโโโโโโโโโโโโโโโ
The function value_counts
returns the results in
structs, a data type that we will explore in a later section.
Alternatively, if you only need a series with the unique values or a series with the unique counts, they are one function away:
result = df.select(
pl.col("names").unique(maintain_order=True).alias("unique"),
pl.col("names").unique_counts().alias("unique_counts"),
)
print(result)
unique
ยท unique_counts
ยท Available on feature unique_counts
let result = df
.clone()
.lazy()
.select([
col("names").unique_stable().alias("unique"),
col("names").unique_counts().alias("unique_counts"),
])
.collect()?;
println!("{}", result);
shape: (4, 2)
โโโโโโโโโโฌโโโโโโโโโโโโโโโโ
โ unique โ unique_counts โ
โ --- โ --- โ
โ str โ u32 โ
โโโโโโโโโโชโโโโโโโโโโโโโโโโก
โ foo โ 1 โ
โ ham โ 1 โ
โ spam โ 2 โ
โ egg โ 1 โ
โโโโโโโโโโดโโโโโโโโโโโโโโโโ
Note that we need to specify maintain_order=True
in the function unique
so that the order of the
results is consistent with the order of the results in unique_counts
. See the API reference for
more information.
Conditionals
Polars supports something akin to a ternary operator through the function when
, which is followed
by one function then
and an optional function otherwise
.
The function when
accepts a predicate expression. The values that evaluate to True
are replaced
by the corresponding values of the expression inside the function then
. The values that evaluate
to False
are replaced by the corresponding values of the expression inside the function
otherwise
or null
, if otherwise
is not provided.
The example below applies one step of the Collatz conjecture to the numbers in the column โnrsโ:
result = df.select(
pl.col("nrs"),
pl.when(pl.col("nrs") % 2 == 1) # Is the number odd?
.then(3 * pl.col("nrs") + 1) # If so, multiply by 3 and add 1.
.otherwise(pl.col("nrs") // 2) # If not, divide by 2.
.alias("Collatz"),
)
print(result)
let result = df
.clone()
.lazy()
.select([
col("nrs"),
when((col("nrs") % lit(2)).eq(lit(1)))
.then(lit(3) * col("nrs") + lit(1))
.otherwise(col("nrs") / lit(2))
.alias("Collatz"),
])
.collect()?;
println!("{}", result);
shape: (5, 2)
โโโโโโโโฌโโโโโโโโโโ
โ nrs โ Collatz โ
โ --- โ --- โ
โ i64 โ i64 โ
โโโโโโโโชโโโโโโโโโโก
โ 1 โ 4 โ
โ 2 โ 1 โ
โ 3 โ 10 โ
โ null โ null โ
โ 5 โ 16 โ
โโโโโโโโดโโโโโโโโโโ
You can also emulate a chain of an arbitrary number of conditionals, akin to Python's elif
statement, by chaining an arbitrary number of consecutive blocks of .when(...).then(...)
. In those
cases, and for each given value, Polars will only consider a replacement expression that is deeper
within the chain if the previous predicates all failed for that value.