Skip to content

Aggregation

The Polars context group_by lets you apply expressions on subsets of columns, as defined by the unique values of the column over which the data is grouped. This is a very powerful capability that we explore in this section of the user guide.

We start by reading in a US congress dataset:

DataFrame · Categorical

import polars as pl

url = "https://theunitedstates.io/congress-legislators/legislators-historical.csv"

schema_overrides = {
    "first_name": pl.Categorical,
    "gender": pl.Categorical,
    "type": pl.Categorical,
    "state": pl.Categorical,
    "party": pl.Categorical,
}

dataset = pl.read_csv(url, schema_overrides=schema_overrides).with_columns(
    pl.col("birthday").str.to_date(strict=False)
)

DataFrame · Categorical · Available on feature dtype-categorical

use std::io::Cursor;

use polars::prelude::*;
use reqwest::blocking::Client;

let url = "https://theunitedstates.io/congress-legislators/legislators-historical.csv";

let mut schema = Schema::default();
schema.with_column(
    "first_name".into(),
    DataType::Categorical(None, Default::default()),
);
schema.with_column(
    "gender".into(),
    DataType::Categorical(None, Default::default()),
);
schema.with_column(
    "type".into(),
    DataType::Categorical(None, Default::default()),
);
schema.with_column(
    "state".into(),
    DataType::Categorical(None, Default::default()),
);
schema.with_column(
    "party".into(),
    DataType::Categorical(None, Default::default()),
);
schema.with_column("birthday".into(), DataType::Date);

let data: Vec<u8> = Client::new().get(url).send()?.text()?.bytes().collect();

let dataset = CsvReadOptions::default()
    .with_has_header(true)
    .with_schema_overwrite(Some(Arc::new(schema)))
    .map_parse_options(|parse_options| parse_options.with_try_parse_dates(true))
    .into_reader_with_file_handle(Cursor::new(data))
    .finish()?;

println!("{}", &dataset);

Basic aggregations

You can easily apply multiple expressions to your aggregated values. Simply list all of the expressions you want inside the function agg. There is no upper bound on the number of aggregations you can do and you can make any combination you want. In the snippet below we will group the data based on the column “first_name” and then we will apply the following aggregations:

  • count the number of rows in the group (which means we count how many people in the data set have each unique first name);
  • combine the values of the column “gender” into a list by referring the column but omitting an aggregate function; and
  • get the first value of the column “last_name” within the group.

After computing the aggregations, we immediately sort the result and limit it to the top five rows so that we have a nice summary overview:

group_by

q = (
    dataset.lazy()
    .group_by("first_name")
    .agg(
        pl.len(),
        pl.col("gender"),
        pl.first("last_name"),  # Short for `pl.col("last_name").first()`
    )
    .sort("len", descending=True)
    .limit(5)
)

df = q.collect()
print(df)

group_by

let df = dataset
    .clone()
    .lazy()
    .group_by(["first_name"])
    .agg([len(), col("gender"), col("last_name").first()])
    .sort(
        ["len"],
        SortMultipleOptions::default()
            .with_order_descending(true)
            .with_nulls_last(true),
    )
    .limit(5)
    .collect()?;

println!("{}", df);

shape: (5, 4)
┌────────────┬──────┬───────────────────┬───────────┐
│ first_name ┆ len  ┆ gender            ┆ last_name │
│ ---        ┆ ---  ┆ ---               ┆ ---       │
│ cat        ┆ u32  ┆ list[cat]         ┆ str       │
╞════════════╪══════╪═══════════════════╪═══════════╡
│ John       ┆ 1256 ┆ ["M", "M", … "M"] ┆ Walker    │
│ William    ┆ 1022 ┆ ["M", "M", … "M"] ┆ Few       │
│ James      ┆ 714  ┆ ["M", "M", … "M"] ┆ Armstrong │
│ Thomas     ┆ 453  ┆ ["M", "M", … "M"] ┆ Tucker    │
│ Charles    ┆ 439  ┆ ["M", "M", … "M"] ┆ Carroll   │
└────────────┴──────┴───────────────────┴───────────┘

It's that easy! Let's turn it up a notch.

Conditionals

Let's say we want to know how many delegates of a state are “Pro” or “Anti” administration. We can query that directly in the aggregation without the need for a lambda or grooming the dataframe:

group_by

q = (
    dataset.lazy()
    .group_by("state")
    .agg(
        (pl.col("party") == "Anti-Administration").sum().alias("anti"),
        (pl.col("party") == "Pro-Administration").sum().alias("pro"),
    )
    .sort("pro", descending=True)
    .limit(5)
)

df = q.collect()
print(df)

group_by

let df = dataset
    .clone()
    .lazy()
    .group_by(["state"])
    .agg([
        (col("party").eq(lit("Anti-Administration")))
            .sum()
            .alias("anti"),
        (col("party").eq(lit("Pro-Administration")))
            .sum()
            .alias("pro"),
    ])
    .sort(
        ["pro"],
        SortMultipleOptions::default().with_order_descending(true),
    )
    .limit(5)
    .collect()?;

println!("{}", df);

shape: (5, 3)
┌───────┬──────┬─────┐
│ state ┆ anti ┆ pro │
│ ---   ┆ ---  ┆ --- │
│ cat   ┆ u32  ┆ u32 │
╞═══════╪══════╪═════╡
│ NJ    ┆ 0    ┆ 3   │
│ CT    ┆ 0    ┆ 3   │
│ NC    ┆ 1    ┆ 2   │
│ SC    ┆ 0    ┆ 1   │
│ MA    ┆ 0    ┆ 1   │
└───────┴──────┴─────┘

Filtering

We can also filter the groups. Let's say we want to compute a mean per group, but we don't want to include all values from that group, and we also don't want to actually filter the rows from the dataframe because we need those rows for another aggregation.

In the example below we show how this can be done.

Note

Note that we can define Python functions for clarity. These functions don't cost us anything because they return Polars expressions, we don't apply a custom function over a series during runtime of the query. Of course, you can write functions that return expressions in Rust, too.

group_by

from datetime import date


def compute_age():
    return date.today().year - pl.col("birthday").dt.year()


def avg_birthday(gender: str) -> pl.Expr:
    return (
        compute_age()
        .filter(pl.col("gender") == gender)
        .mean()
        .alias(f"avg {gender} birthday")
    )


q = (
    dataset.lazy()
    .group_by("state")
    .agg(
        avg_birthday("M"),
        avg_birthday("F"),
        (pl.col("gender") == "M").sum().alias("# male"),
        (pl.col("gender") == "F").sum().alias("# female"),
    )
    .limit(5)
)

df = q.collect()
print(df)

group_by

fn compute_age() -> Expr {
    lit(2024) - col("birthday").dt().year()
}

fn avg_birthday(gender: &str) -> Expr {
    compute_age()
        .filter(col("gender").eq(lit(gender)))
        .mean()
        .alias(format!("avg {} birthday", gender))
}

let df = dataset
    .clone()
    .lazy()
    .group_by(["state"])
    .agg([
        avg_birthday("M"),
        avg_birthday("F"),
        (col("gender").eq(lit("M"))).sum().alias("# male"),
        (col("gender").eq(lit("F"))).sum().alias("# female"),
    ])
    .limit(5)
    .collect()?;

println!("{}", df);

shape: (5, 5)
┌───────┬────────────────┬────────────────┬────────┬──────────┐
│ state ┆ avg M birthday ┆ avg F birthday ┆ # male ┆ # female │
│ ---   ┆ ---            ┆ ---            ┆ ---    ┆ ---      │
│ cat   ┆ f64            ┆ f64            ┆ u32    ┆ u32      │
╞═══════╪════════════════╪════════════════╪════════╪══════════╡
│ OH    ┆ 174.672414     ┆ 82.444444      ┆ 673    ┆ 9        │
│ SC    ┆ 186.018349     ┆ 124.8          ┆ 247    ┆ 5        │
│ AL    ┆ 166.772727     ┆ 100.5          ┆ 207    ┆ 4        │
│ VT    ┆ 217.309735     ┆ null           ┆ 116    ┆ 0        │
│ UT    ┆ 127.730769     ┆ 82.0           ┆ 52     ┆ 4        │
└───────┴────────────────┴────────────────┴────────┴──────────┘

Do the average age values look nonsensical? That's because we are working with historical data that dates back to the 1800s and we are doing our computations assuming everyone represented in the dataset is still alive and kicking.

Nested grouping

The two previous queries could have been done with a nested group_by, but that wouldn't have let us show off some of these features. 😉 To do a nested group_by, simply list the columns that will be used for grouping.

First, we use a nested group_by to figure out how many delegates of a state are “Pro” or “Anti” administration:

group_by

q = (
    dataset.lazy()
    .group_by("state", "party")
    .agg(pl.len().alias("count"))
    .filter(
        (pl.col("party") == "Anti-Administration")
        | (pl.col("party") == "Pro-Administration")
    )
    .sort("count", descending=True)
    .limit(5)
)

df = q.collect()
print(df)

group_by

let df = dataset
    .clone()
    .lazy()
    .group_by(["state", "party"])
    .agg([len().alias("count")])
    .filter(
        col("party")
            .eq(lit("Anti-Administration"))
            .or(col("party").eq(lit("Pro-Administration"))),
    )
    .sort(
        ["count"],
        SortMultipleOptions::default()
            .with_order_descending(true)
            .with_nulls_last(true),
    )
    .limit(5)
    .collect()?;

println!("{}", df);

shape: (5, 3)
┌───────┬─────────────────────┬───────┐
│ state ┆ party               ┆ count │
│ ---   ┆ ---                 ┆ ---   │
│ cat   ┆ cat                 ┆ u32   │
╞═══════╪═════════════════════╪═══════╡
│ CT    ┆ Pro-Administration  ┆ 3     │
│ NJ    ┆ Pro-Administration  ┆ 3     │
│ VA    ┆ Anti-Administration ┆ 3     │
│ NC    ┆ Pro-Administration  ┆ 2     │
│ DE    ┆ Anti-Administration ┆ 1     │
└───────┴─────────────────────┴───────┘

Next, we use a nested group_by to compute the average age of delegates per state and per gender:

group_by

q = (
    dataset.lazy()
    .group_by("state", "gender")
    .agg(
        # The function `avg_birthday` is not needed:
        compute_age().mean().alias("avg birthday"),
        pl.len().alias("#"),
    )
    .sort("#", descending=True)
    .limit(5)
)

df = q.collect()
print(df)

group_by

let df = dataset
    .clone()
    .lazy()
    .group_by(["state", "gender"])
    .agg([compute_age().mean().alias("avg birthday"), len().alias("#")])
    .sort(
        ["#"],
        SortMultipleOptions::default()
            .with_order_descending(true)
            .with_nulls_last(true),
    )
    .limit(5)
    .collect()?;

println!("{}", df);

shape: (5, 4)
┌───────┬────────┬──────────────┬──────┐
│ state ┆ gender ┆ avg birthday ┆ #    │
│ ---   ┆ ---    ┆ ---          ┆ ---  │
│ cat   ┆ cat    ┆ f64          ┆ u32  │
╞═══════╪════════╪══════════════╪══════╡
│ NY    ┆ M      ┆ 188.330663   ┆ 1457 │
│ PA    ┆ M      ┆ 182.724846   ┆ 1050 │
│ OH    ┆ M      ┆ 172.260997   ┆ 717  │
│ VA    ┆ M      ┆ 180.890244   ┆ 630  │
│ MA    ┆ M      ┆ 190.47557    ┆ 630  │
└───────┴────────┴──────────────┴──────┘

Note that we get the same results but the format of the data is different. Depending on the situation, one format may be more suitable than the other.

Sorting

It is common to see a dataframe being sorted for the sole purpose of managing the ordering during a grouping operation. Let's say that we want to get the names of the oldest and youngest politicians per state. We could start by sorting and then grouping:

group_by

def get_name() -> pl.Expr:
    return pl.col("first_name") + pl.lit(" ") + pl.col("last_name")


q = (
    dataset.lazy()
    .sort("birthday", descending=True)
    .group_by("state")
    .agg(
        get_name().first().alias("youngest"),
        get_name().last().alias("oldest"),
    )
    .limit(5)
)

df = q.collect()
print(df)

group_by

fn get_name() -> Expr {
    col("first_name") + lit(" ") + col("last_name")
}

let df = dataset
    .clone()
    .lazy()
    .sort(
        ["birthday"],
        SortMultipleOptions::default()
            .with_order_descending(true)
            .with_nulls_last(true),
    )
    .group_by(["state"])
    .agg([
        get_name().first().alias("youngest"),
        get_name().last().alias("oldest"),
    ])
    .limit(5)
    .collect()?;

println!("{}", df);

shape: (5, 3)
┌───────┬─────────────────┬───────────────────┐
│ state ┆ youngest        ┆ oldest            │
│ ---   ┆ ---             ┆ ---               │
│ cat   ┆ str             ┆ str               │
╞═══════╪═════════════════╪═══════════════════╡
│ RI    ┆ James Mason     ┆ William Bradford  │
│ VA    ┆ William Grayson ┆ Robert Rutherford │
│ DK    ┆ George Mathews  ┆ John Todd         │
│ OR    ┆ George Shiel    ┆ Joseph Lane       │
│ WI    ┆ Mike Gallagher  ┆ Henry Dodge       │
└───────┴─────────────────┴───────────────────┘

However, if we also want to sort the names alphabetically, we need to perform an extra sort operation. Luckily, we can sort in a group_by context without changing the sorting of the underlying dataframe:

group_by

q = (
    dataset.lazy()
    .sort("birthday", descending=True)
    .group_by("state")
    .agg(
        get_name().first().alias("youngest"),
        get_name().last().alias("oldest"),
        get_name().sort().first().alias("alphabetical_first"),
    )
    .limit(5)
)

df = q.collect()
print(df)

group_by

let df = dataset
    .clone()
    .lazy()
    .sort(
        ["birthday"],
        SortMultipleOptions::default()
            .with_order_descending(true)
            .with_nulls_last(true),
    )
    .group_by(["state"])
    .agg([
        get_name().first().alias("youngest"),
        get_name().last().alias("oldest"),
        get_name()
            .sort(Default::default())
            .first()
            .alias("alphabetical_first"),
    ])
    .limit(5)
    .collect()?;

println!("{}", df);

shape: (5, 4)
┌───────┬─────────────────────┬─────────────────┬────────────────────┐
│ state ┆ youngest            ┆ oldest          ┆ alphabetical_first │
│ ---   ┆ ---                 ┆ ---             ┆ ---                │
│ cat   ┆ str                 ┆ str             ┆ str                │
╞═══════╪═════════════════════╪═════════════════╪════════════════════╡
│ VT    ┆ Samuel Shaw         ┆ Moses Robinson  ┆ Ahiman Miner       │
│ MO    ┆ Spencer Pettis      ┆ Rufus Easton    ┆ Abram Comingo      │
│ IN    ┆ Waller Taylor       ┆ John Test       ┆ Abraham Brick      │
│ UT    ┆ Mia Love            ┆ John Bernhisel  ┆ Allan Howe         │
│ PR    ┆ Aníbal Acevedo-Vilá ┆ Tulio Larrinaga ┆ Antonio Colorado   │
└───────┴─────────────────────┴─────────────────┴────────────────────┘

We can even sort a column with the order induced by another column, and this also works inside the context group_by. This modification to the previous query lets us check if the delegate with the first name is male or female:

group_by

q = (
    dataset.lazy()
    .sort("birthday", descending=True)
    .group_by("state")
    .agg(
        get_name().first().alias("youngest"),
        get_name().last().alias("oldest"),
        get_name().sort().first().alias("alphabetical_first"),
        pl.col("gender").sort_by(get_name()).first(),
    )
    .sort("state")
    .limit(5)
)

df = q.collect()
print(df)

group_by

let df = dataset
    .clone()
    .lazy()
    .sort(
        ["birthday"],
        SortMultipleOptions::default()
            .with_order_descending(true)
            .with_nulls_last(true),
    )
    .group_by(["state"])
    .agg([
        get_name().first().alias("youngest"),
        get_name().last().alias("oldest"),
        get_name()
            .sort(Default::default())
            .first()
            .alias("alphabetical_first"),
        col("gender")
            .sort_by(["first_name"], SortMultipleOptions::default())
            .first(),
    ])
    .sort(["state"], SortMultipleOptions::default())
    .limit(5)
    .collect()?;

println!("{}", df);

shape: (5, 5)
┌───────┬───────────────────┬───────────────────┬────────────────────┬────────┐
│ state ┆ youngest          ┆ oldest            ┆ alphabetical_first ┆ gender │
│ ---   ┆ ---               ┆ ---               ┆ ---                ┆ ---    │
│ cat   ┆ str               ┆ str               ┆ str                ┆ cat    │
╞═══════╪═══════════════════╪═══════════════════╪════════════════════╪════════╡
│ DE    ┆ Samuel White      ┆ George Read       ┆ Albert Polk        ┆ M      │
│ VA    ┆ William Grayson   ┆ Robert Rutherford ┆ A. McEachin        ┆ M      │
│ SC    ┆ Ralph Izard       ┆ Thomas Sumter     ┆ Abraham Nott       ┆ M      │
│ MD    ┆ Benjamin Contee   ┆ William Smith     ┆ Albert Blakeney    ┆ M      │
│ PA    ┆ Thomas Fitzsimons ┆ Israel Jacobs     ┆ Aaron Kreider      ┆ M      │
└───────┴───────────────────┴───────────────────┴────────────────────┴────────┘

Do not kill parallelization

Python users only

The following section is specific to Python, and doesn't apply to Rust. Within Rust, blocks and closures (lambdas) can, and will, be executed concurrently.

Python is generally slower than Rust. Besides the overhead of running “slow” bytecode, Python has to remain within the constraints of the Global Interpreter Lock (GIL). This means that if you were to use a lambda or a custom Python function to apply during a parallelized phase, Polars' speed is capped running Python code, preventing any multiple threads from executing the function.

Polars will try to parallelize the computation of the aggregating functions over the groups, so it is recommended that you avoid using lambdas and custom Python functions as much as possible. Instead, try to stay within the realm of the Polars expression API. This is not always possible, though, so if you want to learn more about using lambdas you can go the user guide section on using user-defined functions.