Skip to content

Expressions and contexts

Polars has developed its own Domain Specific Language (DSL) for transforming data. The language is very easy to use and allows for complex queries that remain human readable. Expressions and contexts, which will be introduced here, are very important in achieving this readability while also allowing the Polars query engine to optimize your queries to make them run as fast as possible.

Expressions

In Polars, an expression is a lazy representation of a data transformation. Expressions are modular and flexible, which means you can use them as building blocks to build more complex expressions. Here is an example of a Polars expression:

import polars as pl

pl.col("weight") / (pl.col("height") ** 2)

As you might be able to guess, this expression takes a column named “weight” and divides its values by the square of the values in a column “height”, computing a person's BMI.

The code above expresses an abstract computation that we can save in a variable, manipulate further, or just print:

bmi_expr = pl.col("weight") / (pl.col("height") ** 2)
print(bmi_expr)
[(col("weight")) / (col("height").pow([dyn int: 2]))]

Because expressions are lazy, no computations have taken place yet. That's what we need contexts for.

Contexts

Polars expressions need a context in which they are executed to produce a result. Depending on the context it is used in, the same Polars expression can produce different results. In this section, we will learn about the four most common contexts that Polars provides1:

  1. select
  2. with_columns
  3. filter
  4. group_by

We use the dataframe below to show how each of the contexts works.

from datetime import date

df = pl.DataFrame(
    {
        "name": ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
        "birthdate": [
            date(1997, 1, 10),
            date(1985, 2, 15),
            date(1983, 3, 22),
            date(1981, 4, 30),
        ],
        "weight": [57.9, 72.5, 53.6, 83.1],  # (kg)
        "height": [1.56, 1.77, 1.65, 1.75],  # (m)
    }
)

print(df)
use chrono::prelude::*;
use polars::prelude::*;

let df: DataFrame = df!(
    "name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
    "birthdate" => [
        NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
        NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
        NaiveDate::from_ymd_opt(1983, 3, 22).unwrap(),
        NaiveDate::from_ymd_opt(1981, 4, 30).unwrap(),
    ],
    "weight" => [57.9, 72.5, 53.6, 83.1],  // (kg)
    "height" => [1.56, 1.77, 1.65, 1.75],  // (m)
)
.unwrap();
println!("{}", df);
shape: (4, 4)
┌────────────────┬────────────┬────────┬────────┐
│ name           ┆ birthdate  ┆ weight ┆ height │
│ ---            ┆ ---        ┆ ---    ┆ ---    │
│ str            ┆ date       ┆ f64    ┆ f64    │
╞════════════════╪════════════╪════════╪════════╡
│ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   │
│ Ben Brown      ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
│ Chloe Cooper   ┆ 1983-03-22 ┆ 53.6   ┆ 1.65   │
│ Daniel Donovan ┆ 1981-04-30 ┆ 83.1   ┆ 1.75   │
└────────────────┴────────────┴────────┴────────┘

select

The selection context select applies expressions over columns. The context select may produce new columns that are aggregations, combinations of other columns, or literals:

select

result = df.select(
    bmi=bmi_expr,
    avg_bmi=bmi_expr.mean(),
    ideal_max_bmi=25,
)
print(result)

select

let bmi = col("weight") / col("height").pow(2);
let result = df
    .clone()
    .lazy()
    .select([
        bmi.clone().alias("bmi"),
        bmi.clone().mean().alias("avg_bmi"),
        lit(25).alias("ideal_max_bmi"),
    ])
    .collect()?;
println!("{}", result);

shape: (4, 3)
┌───────────┬───────────┬───────────────┐
│ bmi       ┆ avg_bmi   ┆ ideal_max_bmi │
│ ---       ┆ ---       ┆ ---           │
│ f64       ┆ f64       ┆ i32           │
╞═══════════╪═══════════╪═══════════════╡
│ 23.791913 ┆ 23.438973 ┆ 25            │
│ 23.141498 ┆ 23.438973 ┆ 25            │
│ 19.687787 ┆ 23.438973 ┆ 25            │
│ 27.134694 ┆ 23.438973 ┆ 25            │
└───────────┴───────────┴───────────────┘

The expressions in a context select must produce series that are all the same length or they must produce a scalar. Scalars will be broadcast to match the length of the remaining series. Literals, like the number used above, are also broadcast.

Note that broadcasting can also occur within expressions. For instance, consider the expression below:

select

result = df.select(deviation=(bmi_expr - bmi_expr.mean()) / bmi_expr.std())
print(result)

select

let result = df
    .clone()
    .lazy()
    .select([((bmi.clone() - bmi.clone().mean()) / bmi.clone().std(1)).alias("deviation")])
    .collect()?;
println!("{}", result);

shape: (4, 1)
┌───────────┐
│ deviation │
│ ---       │
│ f64       │
╞═══════════╡
│ 0.115645  │
│ -0.097471 │
│ -1.22912  │
│ 1.210946  │
└───────────┘

Both the subtraction and the division use broadcasting within the expression because the subexpressions that compute the mean and the standard deviation evaluate to single values.

The context select is very flexible and powerful and allows you to evaluate arbitrary expressions independent of, and in parallel to, each other. This is also true of the other contexts that we will see next.

with_columns

The context with_columns is very similar to the context select. The main difference between the two is that the context with_columns creates a new dataframe that contains the columns from the original dataframe and the new columns according to its input expressions, whereas the context select only includes the columns selected by its input expressions:

with_columns

result = df.with_columns(
    bmi=bmi_expr,
    avg_bmi=bmi_expr.mean(),
    ideal_max_bmi=25,
)
print(result)

with_columns

let result = df
    .clone()
    .lazy()
    .with_columns([
        bmi.clone().alias("bmi"),
        bmi.clone().mean().alias("avg_bmi"),
        lit(25).alias("ideal_max_bmi"),
    ])
    .collect()?;
println!("{}", result);

shape: (4, 7)
┌────────────────┬────────────┬────────┬────────┬───────────┬───────────┬───────────────┐
│ name           ┆ birthdate  ┆ weight ┆ height ┆ bmi       ┆ avg_bmi   ┆ ideal_max_bmi │
│ ---            ┆ ---        ┆ ---    ┆ ---    ┆ ---       ┆ ---       ┆ ---           │
│ str            ┆ date       ┆ f64    ┆ f64    ┆ f64       ┆ f64       ┆ i32           │
╞════════════════╪════════════╪════════╪════════╪═══════════╪═══════════╪═══════════════╡
│ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   ┆ 23.791913 ┆ 23.438973 ┆ 25            │
│ Ben Brown      ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   ┆ 23.141498 ┆ 23.438973 ┆ 25            │
│ Chloe Cooper   ┆ 1983-03-22 ┆ 53.6   ┆ 1.65   ┆ 19.687787 ┆ 23.438973 ┆ 25            │
│ Daniel Donovan ┆ 1981-04-30 ┆ 83.1   ┆ 1.75   ┆ 27.134694 ┆ 23.438973 ┆ 25            │
└────────────────┴────────────┴────────┴────────┴───────────┴───────────┴───────────────┘

Because of this difference between select and with_columns, the expressions used in a context with_columns must produce series that have the same length as the original columns in the dataframe, whereas it is enough for the expressions in the context select to produce series that have the same length among them.

filter

The context filter filters the rows of a dataframe based on one or more expressions that evaluate to the Boolean data type.

filter

result = df.filter(
    pl.col("birthdate").is_between(date(1982, 12, 31), date(1996, 1, 1)),
    pl.col("height") > 1.7,
)
print(result)

filter

let result = df
    .clone()
    .lazy()
    .filter(
        col("birthdate")
            .is_between(
                lit(NaiveDate::from_ymd_opt(1982, 12, 31).unwrap()),
                lit(NaiveDate::from_ymd_opt(1996, 1, 1).unwrap()),
                ClosedInterval::Both,
            )
            .and(col("height").gt(lit(1.7))),
    )
    .collect()?;
println!("{}", result);

shape: (1, 4)
┌───────────┬────────────┬────────┬────────┐
│ name      ┆ birthdate  ┆ weight ┆ height │
│ ---       ┆ ---        ┆ ---    ┆ ---    │
│ str       ┆ date       ┆ f64    ┆ f64    │
╞═══════════╪════════════╪════════╪════════╡
│ Ben Brown ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
└───────────┴────────────┴────────┴────────┘

group_by and aggregations

In the context group_by, rows are grouped according to the unique values of the grouping expressions. You can then apply expressions to the resulting groups, which may be of variable lengths.

When using the context group_by, you can use an expression to compute the groupings dynamically:

group_by

result = df.group_by(
    (pl.col("birthdate").dt.year() // 10 * 10).alias("decade"),
).agg(pl.col("name"))
print(result)

group_by

let result = df
    .clone()
    .lazy()
    .group_by([(col("birthdate").dt().year() / lit(10) * lit(10)).alias("decade")])
    .agg([col("name")])
    .collect()?;
println!("{}", result);

shape: (2, 2)
┌────────┬─────────────────────────────────┐
│ decade ┆ name                            │
│ ---    ┆ ---                             │
│ i32    ┆ list[str]                       │
╞════════╪═════════════════════════════════╡
│ 1990   ┆ ["Alice Archer"]                │
│ 1980   ┆ ["Ben Brown", "Chloe Cooper", … │
└────────┴─────────────────────────────────┘

After using group_by we use agg to apply aggregating expressions to the groups. Since in the example above we only specified the name of a column, we get the groups of that column as lists.

We can specify as many grouping expressions as we'd like and the context group_by will group the rows according to the distinct values across the expressions specified. Here, we group by a combination of decade of birth and whether the person is shorter than 1.7 metres:

group_by

result = df.group_by(
    (pl.col("birthdate").dt.year() // 10 * 10).alias("decade"),
    (pl.col("height") < 1.7).alias("short?"),
).agg(pl.col("name"))
print(result)

group_by

let result = df
    .clone()
    .lazy()
    .group_by([
        (col("birthdate").dt().year() / lit(10) * lit(10)).alias("decade"),
        (col("height").lt(lit(1.7)).alias("short?")),
    ])
    .agg([col("name")])
    .collect()?;
println!("{}", result);

shape: (3, 3)
┌────────┬────────┬─────────────────────────────────┐
│ decade ┆ short? ┆ name                            │
│ ---    ┆ ---    ┆ ---                             │
│ i32    ┆ bool   ┆ list[str]                       │
╞════════╪════════╪═════════════════════════════════╡
│ 1980   ┆ true   ┆ ["Chloe Cooper"]                │
│ 1990   ┆ true   ┆ ["Alice Archer"]                │
│ 1980   ┆ false  ┆ ["Ben Brown", "Daniel Donovan"… │
└────────┴────────┴─────────────────────────────────┘

The resulting dataframe, after applying aggregating expressions, contains one column per each grouping expression on the left and then as many columns as needed to represent the results of the aggregating expressions. In turn, we can specify as many aggregating expressions as we want:

group_by

result = df.group_by(
    (pl.col("birthdate").dt.year() // 10 * 10).alias("decade"),
    (pl.col("height") < 1.7).alias("short?"),
).agg(
    pl.len(),
    pl.col("height").max().alias("tallest"),
    pl.col("weight", "height").mean().name.prefix("avg_"),
)
print(result)

group_by

let result = df
    .clone()
    .lazy()
    .group_by([
        (col("birthdate").dt().year() / lit(10) * lit(10)).alias("decade"),
        (col("height").lt(lit(1.7)).alias("short?")),
    ])
    .agg([
        len(),
        col("height").max().alias("tallest"),
        cols(["weight", "height"]).mean().name().prefix("avg_"),
    ])
    .collect()?;
println!("{}", result);

shape: (3, 6)
┌────────┬────────┬─────┬─────────┬────────────┬────────────┐
│ decade ┆ short? ┆ len ┆ tallest ┆ avg_weight ┆ avg_height │
│ ---    ┆ ---    ┆ --- ┆ ---     ┆ ---        ┆ ---        │
│ i32    ┆ bool   ┆ u32 ┆ f64     ┆ f64        ┆ f64        │
╞════════╪════════╪═════╪═════════╪════════════╪════════════╡
│ 1980   ┆ false  ┆ 2   ┆ 1.77    ┆ 77.8       ┆ 1.76       │
│ 1990   ┆ true   ┆ 1   ┆ 1.56    ┆ 57.9       ┆ 1.56       │
│ 1980   ┆ true   ┆ 1   ┆ 1.65    ┆ 53.6       ┆ 1.65       │
└────────┴────────┴─────┴─────────┴────────────┴────────────┘

See also group_by_dynamic and rolling for other grouping contexts.

Expression expansion

The last example contained two grouping expressions and three aggregating expressions, and yet the resulting dataframe contained six columns instead of five. If we look closely, the last aggregating expression mentioned two different columns: “weight” and “height”.

Polars expressions support a feature called expression expansion. Expression expansion is like a shorthand notation for when you want to apply the same transformation to multiple columns. As we have seen, the expression

pl.col("weight", "height").mean().name.prefix("avg_")

will compute the mean value of the columns “weight” and “height” and will rename them as “avg_weight” and “avg_height”, respectively. In fact, the expression above is equivalent to using the two following expressions:

[
    pl.col("weight").mean().alias("avg_weight"),
    pl.col("height").mean().alias("avg_height"),
]

In this case, this expression expands into two independent expressions that Polars can execute in parallel. In other cases, we may not be able to know in advance how many independent expressions an expression will unfold into.

Consider this simple but elucidative example:

(pl.col(pl.Float64) * 1.1).name.suffix("*1.1")

This expression will multiply all columns with data type Float64 by 1.1. The number of columns this applies to depends on the schema of each dataframe. In the case of the dataframe we have been using, it applies to two columns:

group_by

expr = (pl.col(pl.Float64) * 1.1).name.suffix("*1.1")
result = df.select(expr)
print(result)

group_by

let expr = (dtype_col(&DataType::Float64) * lit(1.1))
    .name()
    .suffix("*1.1");
let result = df.clone().lazy().select([expr.clone()]).collect()?;
println!("{}", result);

shape: (4, 2)
┌────────────┬────────────┐
│ weight*1.1 ┆ height*1.1 │
│ ---        ┆ ---        │
│ f64        ┆ f64        │
╞════════════╪════════════╡
│ 63.69      ┆ 1.716      │
│ 79.75      ┆ 1.947      │
│ 58.96      ┆ 1.815      │
│ 91.41      ┆ 1.925      │
└────────────┴────────────┘

In the case of the dataframe df2 below, the same expression expands to 0 columns because no column has the data type Float64:

group_by

df2 = pl.DataFrame(
    {
        "ints": [1, 2, 3, 4],
        "letters": ["A", "B", "C", "D"],
    }
)
result = df2.select(expr)
print(result)

group_by

let df2: DataFrame = df!(
    "ints" => [1, 2, 3, 4],
    "letters" => ["A", "B", "C", "D"],
)
.unwrap();
let result = df2.clone().lazy().select([expr.clone()]).collect()?;
println!("{}", result);

shape: (0, 0)
┌┐
╞╡
└┘

It is equally easy to imagine a scenario where the same expression would expand to dozens of columns.

Next, you will learn about the lazy API and the function explain, which you can use to preview what an expression will expand to given a schema.

Conclusion

Because expressions are lazy, when you use an expression inside a context Polars can try to simplify your expression before running the data transformation it expresses. Separate expressions within a context are embarrassingly parallel and Polars will take advantage of that, while also parallelizing expression execution when using expression expansion. Further performance gains can be obtained when using the lazy API of Polars, which is introduced next.

We have only scratched the surface of the capabilities of expressions. There are a ton more expressions and they can be combined in a variety of ways. See the section on expressions for a deeper dive on the different types of expressions available.


  1. There are additional List and SQL contexts which are covered later in this guide. But for simplicity, we leave them out of scope for now.