Skip to content

Contexts

Polars has developed its own Domain Specific Language (DSL) for transforming data. The language is very easy to use and allows for complex queries that remain human readable. The two core components of the language are Contexts and Expressions, the latter we will cover in the next section.

A context, as implied by the name, refers to the context in which an expression needs to be evaluated. There are three main contexts 1:

  1. Selection: df.select(...), df.with_columns(...)
  2. Filtering: df.filter()
  3. Group by / Aggregation: df.group_by(...).agg(...)

The examples below are performed on the following DataFrame:

DataFrame

df = pl.DataFrame(
    {
        "nrs": [1, 2, 3, None, 5],
        "names": ["foo", "ham", "spam", "egg", None],
        "random": np.random.rand(5),
        "groups": ["A", "A", "B", "C", "B"],
    }
)
print(df)

DataFrame

use rand::{thread_rng, Rng};

let mut arr = [0f64; 5];
thread_rng().fill(&mut arr);

let df = df! (
    "nrs" => &[Some(1), Some(2), Some(3), None, Some(5)],
    "names" => &[Some("foo"), Some("ham"), Some("spam"), Some("eggs"), None],
    "random" => &arr,
    "groups" => &["A", "A", "B", "C", "B"],
)?;

println!("{}", &df);

shape: (5, 4)
┌──────┬───────┬──────────┬────────┐
│ nrs  ┆ names ┆ random   ┆ groups │
│ ---  ┆ ---   ┆ ---      ┆ ---    │
│ i64  ┆ str   ┆ f64      ┆ str    │
╞══════╪═══════╪══════════╪════════╡
│ 1    ┆ foo   ┆ 0.154163 ┆ A      │
│ 2    ┆ ham   ┆ 0.74005  ┆ A      │
│ 3    ┆ spam  ┆ 0.263315 ┆ B      │
│ null ┆ egg   ┆ 0.533739 ┆ C      │
│ 5    ┆ null  ┆ 0.014575 ┆ B      │
└──────┴───────┴──────────┴────────┘

Selection

The selection context applies expressions over columns. A select may produce new columns that are aggregations, combinations of expressions, or literals.

The expressions in a selection context must produce Series that are all the same length or have a length of 1. Literals are treated as length-1 Series.

When some expressions produce length-1 Series and some do not, the length-1 Series will be broadcast to match the length of the remaining Series. Note that broadcasting can also occur within expressions: for instance, in pl.col.value() / pl.col.value.sum(), each element of the value column is divided by the column's sum.

select

out = df.select(
    pl.sum("nrs"),
    pl.col("names").sort(),
    pl.col("names").first().alias("first name"),
    (pl.mean("nrs") * 10).alias("10xnrs"),
)
print(out)

select

let out = df
    .clone()
    .lazy()
    .select([
        sum("nrs"),
        col("names").sort(false),
        col("names").first().alias("first name"),
        (mean("nrs") * lit(10)).alias("10xnrs"),
    ])
    .collect()?;
println!("{}", out);

shape: (5, 4)
┌─────┬───────┬────────────┬────────┐
│ nrs ┆ names ┆ first name ┆ 10xnrs │
│ --- ┆ ---   ┆ ---        ┆ ---    │
│ i64 ┆ str   ┆ str        ┆ f64    │
╞═════╪═══════╪════════════╪════════╡
│ 11  ┆ null  ┆ foo        ┆ 27.5   │
│ 11  ┆ egg   ┆ foo        ┆ 27.5   │
│ 11  ┆ foo   ┆ foo        ┆ 27.5   │
│ 11  ┆ ham   ┆ foo        ┆ 27.5   │
│ 11  ┆ spam  ┆ foo        ┆ 27.5   │
└─────┴───────┴────────────┴────────┘

As you can see from the query, the selection context is very powerful and allows you to evaluate arbitrary expressions independent of (and in parallel to) each other.

Similar to the select statement, the with_columns statement also enters into the selection context. The main difference between with_columns and select is that with_columns retains the original columns and adds new ones, whereas select drops the original columns.

with_columns

df = df.with_columns(
    pl.sum("nrs").alias("nrs_sum"),
    pl.col("random").count().alias("count"),
)
print(df)

with_columns

let out = df
    .clone()
    .lazy()
    .with_columns([
        sum("nrs").alias("nrs_sum"),
        col("random").count().alias("count"),
    ])
    .collect()?;
println!("{}", out);

shape: (5, 6)
┌──────┬───────┬──────────┬────────┬─────────┬───────┐
│ nrs  ┆ names ┆ random   ┆ groups ┆ nrs_sum ┆ count │
│ ---  ┆ ---   ┆ ---      ┆ ---    ┆ ---     ┆ ---   │
│ i64  ┆ str   ┆ f64      ┆ str    ┆ i64     ┆ u32   │
╞══════╪═══════╪══════════╪════════╪═════════╪═══════╡
│ 1    ┆ foo   ┆ 0.154163 ┆ A      ┆ 11      ┆ 5     │
│ 2    ┆ ham   ┆ 0.74005  ┆ A      ┆ 11      ┆ 5     │
│ 3    ┆ spam  ┆ 0.263315 ┆ B      ┆ 11      ┆ 5     │
│ null ┆ egg   ┆ 0.533739 ┆ C      ┆ 11      ┆ 5     │
│ 5    ┆ null  ┆ 0.014575 ┆ B      ┆ 11      ┆ 5     │
└──────┴───────┴──────────┴────────┴─────────┴───────┘

Filtering

The filtering context filters a DataFrame based on one or more expressions that evaluate to the Boolean data type.

filter

out = df.filter(pl.col("nrs") > 2)
print(out)

filter

let out = df.clone().lazy().filter(col("nrs").gt(lit(2))).collect()?;
println!("{}", out);

shape: (2, 6)
┌─────┬───────┬──────────┬────────┬─────────┬───────┐
│ nrs ┆ names ┆ random   ┆ groups ┆ nrs_sum ┆ count │
│ --- ┆ ---   ┆ ---      ┆ ---    ┆ ---     ┆ ---   │
│ i64 ┆ str   ┆ f64      ┆ str    ┆ i64     ┆ u32   │
╞═════╪═══════╪══════════╪════════╪═════════╪═══════╡
│ 3   ┆ spam  ┆ 0.263315 ┆ B      ┆ 11      ┆ 5     │
│ 5   ┆ null  ┆ 0.014575 ┆ B      ┆ 11      ┆ 5     │
└─────┴───────┴──────────┴────────┴─────────┴───────┘

Group by / aggregation

In the group_by context, expressions work on groups and thus may yield results of any length (a group may have many members).

group_by

out = df.group_by("groups").agg(
    pl.sum("nrs"),  # sum nrs by groups
    pl.col("random").count().alias("count"),  # count group members
    # sum random where name != null
    pl.col("random").filter(pl.col("names").is_not_null()).sum().name.suffix("_sum"),
    pl.col("names").reverse().alias("reversed names"),
)
print(out)

group_by

let out = df
    .lazy()
    .group_by([col("groups")])
    .agg([
        sum("nrs"),                           // sum nrs by groups
        col("random").count().alias("count"), // count group members
        // sum random where name != null
        col("random")
            .filter(col("names").is_not_null())
            .sum()
            .name()
            .suffix("_sum"),
        col("names").reverse().alias("reversed names"),
    ])
    .collect()?;
println!("{}", out);

shape: (3, 5)
┌────────┬─────┬───────┬────────────┬────────────────┐
│ groups ┆ nrs ┆ count ┆ random_sum ┆ reversed names │
│ ---    ┆ --- ┆ ---   ┆ ---        ┆ ---            │
│ str    ┆ i64 ┆ u32   ┆ f64        ┆ list[str]      │
╞════════╪═════╪═══════╪════════════╪════════════════╡
│ C      ┆ 0   ┆ 1     ┆ 0.533739   ┆ ["egg"]        │
│ A      ┆ 3   ┆ 2     ┆ 0.894213   ┆ ["ham", "foo"] │
│ B      ┆ 8   ┆ 2     ┆ 0.263315   ┆ [null, "spam"] │
└────────┴─────┴───────┴────────────┴────────────────┘

As you can see from the result all expressions are applied to the group defined by the group_by context. Besides the standard group_by, group_by_dynamic, and group_by_rolling are also entrances to the group by context.


  1. There are additional List and SQL contexts which are covered later in this guide. But for simplicity, we leave them out of scope for now.