Expressions and contexts
Polars has developed its own Domain Specific Language (DSL) for transforming data. The language is very easy to use and allows for complex queries that remain human readable. Expressions and contexts, which will be introduced here, are very important in achieving this readability while also allowing the Polars query engine to optimize your queries to make them run as fast as possible.
Expressions
In Polars, an expression is a lazy representation of a data transformation. Expressions are modular and flexible, which means you can use them as building blocks to build more complex expressions. Here is an example of a Polars expression:
import polars as pl
pl.col("weight") / (pl.col("height") ** 2)
As you might be able to guess, this expression takes a column named “weight” and divides its values by the square of the values in a column “height”, computing a person's BMI.
The code above expresses an abstract computation that we can save in a variable, manipulate further, or just print:
bmi_expr = pl.col("weight") / (pl.col("height") ** 2)
print(bmi_expr)
[(col("weight")) / (col("height").pow([dyn int: 2]))]
Because expressions are lazy, no computations have taken place yet. That's what we need contexts for.
Contexts
Polars expressions need a context in which they are executed to produce a result. Depending on the context it is used in, the same Polars expression can produce different results. In this section, we will learn about the four most common contexts that Polars provides1:
select
with_columns
filter
group_by
We use the dataframe below to show how each of the contexts works.
from datetime import date
df = pl.DataFrame(
{
"name": ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
"birthdate": [
date(1997, 1, 10),
date(1985, 2, 15),
date(1983, 3, 22),
date(1981, 4, 30),
],
"weight": [57.9, 72.5, 53.6, 83.1], # (kg)
"height": [1.56, 1.77, 1.65, 1.75], # (m)
}
)
print(df)
use chrono::prelude::*;
use polars::prelude::*;
let df: DataFrame = df!(
"name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
"birthdate" => [
NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
NaiveDate::from_ymd_opt(1983, 3, 22).unwrap(),
NaiveDate::from_ymd_opt(1981, 4, 30).unwrap(),
],
"weight" => [57.9, 72.5, 53.6, 83.1], // (kg)
"height" => [1.56, 1.77, 1.65, 1.75], // (m)
)
.unwrap();
println!("{}", df);
shape: (4, 4)
┌────────────────┬────────────┬────────┬────────┐
│ name ┆ birthdate ┆ weight ┆ height │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ date ┆ f64 ┆ f64 │
╞════════════════╪════════════╪════════╪════════╡
│ Alice Archer ┆ 1997-01-10 ┆ 57.9 ┆ 1.56 │
│ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 │
│ Chloe Cooper ┆ 1983-03-22 ┆ 53.6 ┆ 1.65 │
│ Daniel Donovan ┆ 1981-04-30 ┆ 83.1 ┆ 1.75 │
└────────────────┴────────────┴────────┴────────┘
select
The selection context select
applies expressions over columns. The context select
may produce
new columns that are aggregations, combinations of other columns, or literals:
result = df.select(
bmi=bmi_expr,
avg_bmi=bmi_expr.mean(),
ideal_max_bmi=25,
)
print(result)
let bmi = col("weight") / col("height").pow(2);
let result = df
.clone()
.lazy()
.select([
bmi.clone().alias("bmi"),
bmi.clone().mean().alias("avg_bmi"),
lit(25).alias("ideal_max_bmi"),
])
.collect()?;
println!("{}", result);
shape: (4, 3)
┌───────────┬───────────┬───────────────┐
│ bmi ┆ avg_bmi ┆ ideal_max_bmi │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ i32 │
╞═══════════╪═══════════╪═══════════════╡
│ 23.791913 ┆ 23.438973 ┆ 25 │
│ 23.141498 ┆ 23.438973 ┆ 25 │
│ 19.687787 ┆ 23.438973 ┆ 25 │
│ 27.134694 ┆ 23.438973 ┆ 25 │
└───────────┴───────────┴───────────────┘
The expressions in a context select
must produce series that are all the same length or they must
produce a scalar. Scalars will be broadcast to match the length of the remaining series. Literals,
like the number used above, are also broadcast.
Note that broadcasting can also occur within expressions. For instance, consider the expression below:
shape: (4, 1)
┌───────────┐
│ deviation │
│ --- │
│ f64 │
╞═══════════╡
│ 0.115645 │
│ -0.097471 │
│ -1.22912 │
│ 1.210946 │
└───────────┘
Both the subtraction and the division use broadcasting within the expression because the subexpressions that compute the mean and the standard deviation evaluate to single values.
The context select
is very flexible and powerful and allows you to evaluate arbitrary expressions
independent of, and in parallel to, each other. This is also true of the other contexts that we will
see next.
with_columns
The context with_columns
is very similar to the context select
. The main difference between the
two is that the context with_columns
creates a new dataframe that contains the columns from the
original dataframe and the new columns according to its input expressions, whereas the context
select
only includes the columns selected by its input expressions:
result = df.with_columns(
bmi=bmi_expr,
avg_bmi=bmi_expr.mean(),
ideal_max_bmi=25,
)
print(result)
let result = df
.clone()
.lazy()
.with_columns([
bmi.clone().alias("bmi"),
bmi.clone().mean().alias("avg_bmi"),
lit(25).alias("ideal_max_bmi"),
])
.collect()?;
println!("{}", result);
shape: (4, 7)
┌────────────────┬────────────┬────────┬────────┬───────────┬───────────┬───────────────┐
│ name ┆ birthdate ┆ weight ┆ height ┆ bmi ┆ avg_bmi ┆ ideal_max_bmi │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ date ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ i32 │
╞════════════════╪════════════╪════════╪════════╪═══════════╪═══════════╪═══════════════╡
│ Alice Archer ┆ 1997-01-10 ┆ 57.9 ┆ 1.56 ┆ 23.791913 ┆ 23.438973 ┆ 25 │
│ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 ┆ 23.141498 ┆ 23.438973 ┆ 25 │
│ Chloe Cooper ┆ 1983-03-22 ┆ 53.6 ┆ 1.65 ┆ 19.687787 ┆ 23.438973 ┆ 25 │
│ Daniel Donovan ┆ 1981-04-30 ┆ 83.1 ┆ 1.75 ┆ 27.134694 ┆ 23.438973 ┆ 25 │
└────────────────┴────────────┴────────┴────────┴───────────┴───────────┴───────────────┘
Because of this difference between select
and with_columns
, the expressions used in a context
with_columns
must produce series that have the same length as the original columns in the
dataframe, whereas it is enough for the expressions in the context select
to produce series that
have the same length among them.
filter
The context filter
filters the rows of a dataframe based on one or more expressions that evaluate
to the Boolean data type.
result = df.filter(
pl.col("birthdate").is_between(date(1982, 12, 31), date(1996, 1, 1)),
pl.col("height") > 1.7,
)
print(result)
let result = df
.clone()
.lazy()
.filter(
col("birthdate")
.is_between(
lit(NaiveDate::from_ymd_opt(1982, 12, 31).unwrap()),
lit(NaiveDate::from_ymd_opt(1996, 1, 1).unwrap()),
ClosedInterval::Both,
)
.and(col("height").gt(lit(1.7))),
)
.collect()?;
println!("{}", result);
shape: (1, 4)
┌───────────┬────────────┬────────┬────────┐
│ name ┆ birthdate ┆ weight ┆ height │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ date ┆ f64 ┆ f64 │
╞═══════════╪════════════╪════════╪════════╡
│ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 │
└───────────┴────────────┴────────┴────────┘
group_by
and aggregations
In the context group_by
, rows are grouped according to the unique values of the grouping
expressions. You can then apply expressions to the resulting groups, which may be of variable
lengths.
When using the context group_by
, you can use an expression to compute the groupings dynamically:
result = df.group_by(
(pl.col("birthdate").dt.year() // 10 * 10).alias("decade"),
).agg(pl.col("name"))
print(result)
let result = df
.clone()
.lazy()
.group_by([(col("birthdate").dt().year() / lit(10) * lit(10)).alias("decade")])
.agg([col("name")])
.collect()?;
println!("{}", result);
shape: (2, 2)
┌────────┬─────────────────────────────────┐
│ decade ┆ name │
│ --- ┆ --- │
│ i32 ┆ list[str] │
╞════════╪═════════════════════════════════╡
│ 1990 ┆ ["Alice Archer"] │
│ 1980 ┆ ["Ben Brown", "Chloe Cooper", … │
└────────┴─────────────────────────────────┘
After using group_by
we use agg
to apply aggregating expressions to the groups. Since in the
example above we only specified the name of a column, we get the groups of that column as lists.
We can specify as many grouping expressions as we'd like and the context group_by
will group the
rows according to the distinct values across the expressions specified. Here, we group by a
combination of decade of birth and whether the person is shorter than 1.7 metres:
result = df.group_by(
(pl.col("birthdate").dt.year() // 10 * 10).alias("decade"),
(pl.col("height") < 1.7).alias("short?"),
).agg(pl.col("name"))
print(result)
let result = df
.clone()
.lazy()
.group_by([
(col("birthdate").dt().year() / lit(10) * lit(10)).alias("decade"),
(col("height").lt(lit(1.7)).alias("short?")),
])
.agg([col("name")])
.collect()?;
println!("{}", result);
shape: (3, 3)
┌────────┬────────┬─────────────────────────────────┐
│ decade ┆ short? ┆ name │
│ --- ┆ --- ┆ --- │
│ i32 ┆ bool ┆ list[str] │
╞════════╪════════╪═════════════════════════════════╡
│ 1980 ┆ true ┆ ["Chloe Cooper"] │
│ 1980 ┆ false ┆ ["Ben Brown", "Daniel Donovan"… │
│ 1990 ┆ true ┆ ["Alice Archer"] │
└────────┴────────┴─────────────────────────────────┘
The resulting dataframe, after applying aggregating expressions, contains one column per each grouping expression on the left and then as many columns as needed to represent the results of the aggregating expressions. In turn, we can specify as many aggregating expressions as we want:
result = df.group_by(
(pl.col("birthdate").dt.year() // 10 * 10).alias("decade"),
(pl.col("height") < 1.7).alias("short?"),
).agg(
pl.len(),
pl.col("height").max().alias("tallest"),
pl.col("weight", "height").mean().name.prefix("avg_"),
)
print(result)
let result = df
.clone()
.lazy()
.group_by([
(col("birthdate").dt().year() / lit(10) * lit(10)).alias("decade"),
(col("height").lt(lit(1.7)).alias("short?")),
])
.agg([
len(),
col("height").max().alias("tallest"),
cols(["weight", "height"]).mean().name().prefix("avg_"),
])
.collect()?;
println!("{}", result);
shape: (3, 6)
┌────────┬────────┬─────┬─────────┬────────────┬────────────┐
│ decade ┆ short? ┆ len ┆ tallest ┆ avg_weight ┆ avg_height │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ bool ┆ u32 ┆ f64 ┆ f64 ┆ f64 │
╞════════╪════════╪═════╪═════════╪════════════╪════════════╡
│ 1980 ┆ false ┆ 2 ┆ 1.77 ┆ 77.8 ┆ 1.76 │
│ 1990 ┆ true ┆ 1 ┆ 1.56 ┆ 57.9 ┆ 1.56 │
│ 1980 ┆ true ┆ 1 ┆ 1.65 ┆ 53.6 ┆ 1.65 │
└────────┴────────┴─────┴─────────┴────────────┴────────────┘
See also group_by_dynamic
and group_by_rolling
for other grouping contexts.
Expression expansion
The last example contained two grouping expressions and three aggregating expressions, and yet the resulting dataframe contained six columns instead of five. If we look closely, the last aggregating expression mentioned two different columns: “weight” and “height”.
Polars expressions support a feature called expression expansion. Expression expansion is like a shorthand notation for when you want to apply the same transformation to multiple columns. As we have seen, the expression
pl.col("weight", "height").mean().name.prefix("avg_")
will compute the mean value of the columns “weight” and “height” and will rename them as “avg_weight” and “avg_height”, respectively. In fact, the expression above is equivalent to using the two following expressions:
[
pl.col("weight").mean().alias("avg_weight"),
pl.col("height").mean().alias("avg_height"),
]
In this case, this expression expands into two independent expressions that Polars can execute in parallel. In other cases, we may not be able to know in advance how many independent expressions an expression will unfold into.
Consider this simple but elucidative example:
(pl.col(pl.Float64) * 1.1).name.suffix("*1.1")
This expression will multiply all columns with data type Float64
by 1.1
. The number of columns
this applies to depends on the schema of each dataframe. In the case of the dataframe we have been
using, it applies to two columns:
shape: (4, 2)
┌────────────┬────────────┐
│ weight*1.1 ┆ height*1.1 │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞════════════╪════════════╡
│ 63.69 ┆ 1.716 │
│ 79.75 ┆ 1.947 │
│ 58.96 ┆ 1.815 │
│ 91.41 ┆ 1.925 │
└────────────┴────────────┘
In the case of the dataframe df2
below, the same expression expands to 0 columns because no column
has the data type Float64
:
df2 = pl.DataFrame(
{
"ints": [1, 2, 3, 4],
"letters": ["A", "B", "C", "D"],
}
)
result = df2.select(expr)
print(result)
let df2: DataFrame = df!(
"ints" => [1, 2, 3, 4],
"letters" => ["A", "B", "C", "D"],
)
.unwrap();
let result = df2.clone().lazy().select([expr.clone()]).collect()?;
println!("{}", result);
shape: (0, 0)
┌┐
╞╡
└┘
It is equally easy to imagine a scenario where the same expression would expand to dozens of columns.
Next, you will learn about
the lazy API and the function explain
, which you can use
to preview what an expression will expand to given a schema.
Conclusion
Because expressions are lazy, when you use an expression inside a context Polars can try to simplify your expression before running the data transformation it expresses. Separate expressions within a context are embarrassingly parallel and Polars will take advantage of that, while also parallelizing expression execution when using expression expansion. Further performance gains can be obtained when using the lazy API of Polars, which is introduced next.
We have only scratched the surface of the capabilities of expressions. There are a ton more expressions and they can be combined in a variety of ways. See the section on expressions for a deeper dive on the different types of expressions available.
-
There are additional List and SQL contexts which are covered later in this guide. But for simplicity, we leave them out of scope for now. ↩