Aggregation
Polars implements a powerful syntax defined not only in its lazy API, but also in its eager API. Let's take a look at what that means.
We can start with the simple US congress dataset
.
url = "https://theunitedstates.io/congress-legislators/legislators-historical.csv"
schema_overrides = {
"first_name": pl.Categorical,
"gender": pl.Categorical,
"type": pl.Categorical,
"state": pl.Categorical,
"party": pl.Categorical,
}
dataset = pl.read_csv(url, schema_overrides=schema_overrides).with_columns(
pl.col("birthday").str.to_date(strict=False)
)
DataFrame
· Categorical
· Available on feature dtype-categorical
use std::io::Cursor;
use reqwest::blocking::Client;
let url = "https://theunitedstates.io/congress-legislators/legislators-historical.csv";
let mut schema = Schema::default();
schema.with_column(
"first_name".into(),
DataType::Categorical(None, Default::default()),
);
schema.with_column(
"gender".into(),
DataType::Categorical(None, Default::default()),
);
schema.with_column(
"type".into(),
DataType::Categorical(None, Default::default()),
);
schema.with_column(
"state".into(),
DataType::Categorical(None, Default::default()),
);
schema.with_column(
"party".into(),
DataType::Categorical(None, Default::default()),
);
schema.with_column("birthday".into(), DataType::Date);
let data: Vec<u8> = Client::new().get(url).send()?.text()?.bytes().collect();
let dataset = CsvReadOptions::default()
.with_has_header(true)
.with_schema(Some(Arc::new(schema)))
.map_parse_options(|parse_options| parse_options.with_try_parse_dates(true))
.into_reader_with_file_handle(Cursor::new(data))
.finish()?;
println!("{}", &dataset);
Basic aggregations
You can easily combine different aggregations by adding multiple expressions in a
list
. There is no upper bound on the number of aggregations you can do, and you can
make any combination you want. In the snippet below we do the following aggregations:
Per GROUP "first_name"
we
- count the number of rows in the group:
- full form:
pl.len()
- full form:
- combine the values of gender into a list by omitting an aggregate function:
- full form:
pl.col("gender")
- full form:
- get the first value of column
"last_name"
in the group:- short form:
pl.first("last_name")
(not available in Rust) - full form:
pl.col("last_name").first()
- short form:
Besides the aggregation, we immediately sort the result and limit to the top 5
so that
we have a nice summary overview.
q = (
dataset.lazy()
.group_by("first_name")
.agg(
pl.len(),
pl.col("gender"),
pl.first("last_name"),
)
.sort("len", descending=True)
.limit(5)
)
df = q.collect()
print(df)
let df = dataset
.clone()
.lazy()
.group_by(["first_name"])
.agg([len(), col("gender"), col("last_name").first()])
.sort(
["len"],
SortMultipleOptions::default()
.with_order_descending(true)
.with_nulls_last(true),
)
.limit(5)
.collect()?;
println!("{}", df);
shape: (5, 4)
┌────────────┬──────┬───────────────────┬───────────┐
│ first_name ┆ len ┆ gender ┆ last_name │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ u32 ┆ list[cat] ┆ str │
╞════════════╪══════╪═══════════════════╪═══════════╡
│ John ┆ 1256 ┆ ["M", "M", … "M"] ┆ Walker │
│ William ┆ 1022 ┆ ["M", "M", … "M"] ┆ Few │
│ James ┆ 714 ┆ ["M", "M", … "M"] ┆ Armstrong │
│ Thomas ┆ 453 ┆ ["M", "M", … "M"] ┆ Tucker │
│ Charles ┆ 439 ┆ ["M", "M", … "M"] ┆ Carroll │
└────────────┴──────┴───────────────────┴───────────┘
Conditionals
It's that easy! Let's turn it up a notch. Let's say we want to know how
many delegates of a "state" are "Pro" or "Anti" administration. We could directly query
that in the aggregation without the need of a lambda
or grooming the DataFrame
.
q = (
dataset.lazy()
.group_by("state")
.agg(
(pl.col("party") == "Anti-Administration").sum().alias("anti"),
(pl.col("party") == "Pro-Administration").sum().alias("pro"),
)
.sort("pro", descending=True)
.limit(5)
)
df = q.collect()
print(df)
let df = dataset
.clone()
.lazy()
.group_by(["state"])
.agg([
(col("party").eq(lit("Anti-Administration")))
.sum()
.alias("anti"),
(col("party").eq(lit("Pro-Administration")))
.sum()
.alias("pro"),
])
.sort(
["pro"],
SortMultipleOptions::default().with_order_descending(true),
)
.limit(5)
.collect()?;
println!("{}", df);
shape: (5, 3)
┌───────┬──────┬─────┐
│ state ┆ anti ┆ pro │
│ --- ┆ --- ┆ --- │
│ cat ┆ u32 ┆ u32 │
╞═══════╪══════╪═════╡
│ NJ ┆ 0 ┆ 3 │
│ CT ┆ 0 ┆ 3 │
│ NC ┆ 1 ┆ 2 │
│ MA ┆ 0 ┆ 1 │
│ VA ┆ 3 ┆ 1 │
└───────┴──────┴─────┘
Similarly, this could also be done with a nested GROUP BY, but that doesn't help show off some of these nice features. 😉
q = (
dataset.lazy()
.group_by("state", "party")
.agg(pl.count("party").alias("count"))
.filter(
(pl.col("party") == "Anti-Administration")
| (pl.col("party") == "Pro-Administration")
)
.sort("count", descending=True)
.limit(5)
)
df = q.collect()
print(df)
let df = dataset
.clone()
.lazy()
.group_by(["state", "party"])
.agg([col("party").count().alias("count")])
.filter(
col("party")
.eq(lit("Anti-Administration"))
.or(col("party").eq(lit("Pro-Administration"))),
)
.sort(
["count"],
SortMultipleOptions::default()
.with_order_descending(true)
.with_nulls_last(true),
)
.limit(5)
.collect()?;
println!("{}", df);
shape: (5, 3)
┌───────┬─────────────────────┬───────┐
│ state ┆ party ┆ count │
│ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ u32 │
╞═══════╪═════════════════════╪═══════╡
│ VA ┆ Anti-Administration ┆ 3 │
│ NJ ┆ Pro-Administration ┆ 3 │
│ CT ┆ Pro-Administration ┆ 3 │
│ NC ┆ Pro-Administration ┆ 2 │
│ GA ┆ Anti-Administration ┆ 1 │
└───────┴─────────────────────┴───────┘
Filtering
We can also filter the groups. Let's say we want to compute a mean per group, but we
don't want to include all values from that group, and we also don't want to filter the
rows from the DataFrame
(because we need those rows for another aggregation).
In the example below we show how this can be done.
Note
Note that we can make Python functions for clarity. These functions don't cost us anything. That is because we only create Polars expressions, we don't apply a custom function over a Series
during runtime of the query. Of course, you can make functions that return expressions in Rust, too.
from datetime import date
def compute_age():
return date.today().year - pl.col("birthday").dt.year()
def avg_birthday(gender: str) -> pl.Expr:
return (
compute_age()
.filter(pl.col("gender") == gender)
.mean()
.alias(f"avg {gender} birthday")
)
q = (
dataset.lazy()
.group_by("state")
.agg(
avg_birthday("M"),
avg_birthday("F"),
(pl.col("gender") == "M").sum().alias("# male"),
(pl.col("gender") == "F").sum().alias("# female"),
)
.limit(5)
)
df = q.collect()
print(df)
fn compute_age() -> Expr {
lit(2022) - col("birthday").dt().year()
}
fn avg_birthday(gender: &str) -> Expr {
compute_age()
.filter(col("gender").eq(lit(gender)))
.mean()
.alias(format!("avg {} birthday", gender))
}
let df = dataset
.clone()
.lazy()
.group_by(["state"])
.agg([
avg_birthday("M"),
avg_birthday("F"),
(col("gender").eq(lit("M"))).sum().alias("# male"),
(col("gender").eq(lit("F"))).sum().alias("# female"),
])
.limit(5)
.collect()?;
println!("{}", df);
shape: (5, 5)
┌───────┬────────────────┬────────────────┬────────┬──────────┐
│ state ┆ avg M birthday ┆ avg F birthday ┆ # male ┆ # female │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ f64 ┆ f64 ┆ u32 ┆ u32 │
╞═══════╪════════════════╪════════════════╪════════╪══════════╡
│ AK ┆ 123.411765 ┆ null ┆ 17 ┆ 0 │
│ MT ┆ 138.921569 ┆ 144.0 ┆ 51 ┆ 1 │
│ OK ┆ 122.934066 ┆ 96.0 ┆ 91 ┆ 3 │
│ KY ┆ 186.360704 ┆ 106.0 ┆ 372 ┆ 2 │
│ TN ┆ 178.949091 ┆ 112.6 ┆ 297 ┆ 5 │
└───────┴────────────────┴────────────────┴────────┴──────────┘
Sorting
It's common to see a DataFrame
being sorted for the sole purpose of managing the ordering during a GROUP BY operation. Let's say that we want to get the names of the oldest and youngest politicians per state. We could SORT and GROUP BY.
def get_person() -> pl.Expr:
return pl.col("first_name") + pl.lit(" ") + pl.col("last_name")
q = (
dataset.lazy()
.sort("birthday", descending=True)
.group_by("state")
.agg(
get_person().first().alias("youngest"),
get_person().last().alias("oldest"),
)
.limit(5)
)
df = q.collect()
print(df)
fn get_person() -> Expr {
col("first_name") + lit(" ") + col("last_name")
}
let df = dataset
.clone()
.lazy()
.sort(
["birthday"],
SortMultipleOptions::default()
.with_order_descending(true)
.with_nulls_last(true),
)
.group_by(["state"])
.agg([
get_person().first().alias("youngest"),
get_person().last().alias("oldest"),
])
.limit(5)
.collect()?;
println!("{}", df);
shape: (5, 3)
┌───────┬───────────────┬──────────────────┐
│ state ┆ youngest ┆ oldest │
│ --- ┆ --- ┆ --- │
│ cat ┆ str ┆ str │
╞═══════╪═══════════════╪══════════════════╡
│ NC ┆ John Ashe ┆ Samuel Johnston │
│ OK ┆ Kendra Horn ┆ David Harvey │
│ TX ┆ John Cranford ┆ Timothy Pilsbury │
│ OH ┆ John Smith ┆ Paul Fearing │
│ TN ┆ William Cocke ┆ John Sevier │
└───────┴───────────────┴──────────────────┘
However, if we also want to sort the names alphabetically, this breaks. Luckily we can sort in a group_by
context separate from the DataFrame
.
def get_person() -> pl.Expr:
return pl.col("first_name") + pl.lit(" ") + pl.col("last_name")
q = (
dataset.lazy()
.sort("birthday", descending=True)
.group_by("state")
.agg(
get_person().first().alias("youngest"),
get_person().last().alias("oldest"),
get_person().sort().first().alias("alphabetical_first"),
)
.limit(5)
)
df = q.collect()
print(df)
let df = dataset
.clone()
.lazy()
.sort(
["birthday"],
SortMultipleOptions::default()
.with_order_descending(true)
.with_nulls_last(true),
)
.group_by(["state"])
.agg([
get_person().first().alias("youngest"),
get_person().last().alias("oldest"),
get_person()
.sort(Default::default())
.first()
.alias("alphabetical_first"),
])
.limit(5)
.collect()?;
println!("{}", df);
shape: (5, 4)
┌───────┬───────────────────┬─────────────────┬────────────────────┐
│ state ┆ youngest ┆ oldest ┆ alphabetical_first │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ str ┆ str ┆ str │
╞═══════╪═══════════════════╪═════════════════╪════════════════════╡
│ MT ┆ Greg Gianforte ┆ James Cavanaugh ┆ Albert Campbell │
│ AL ┆ John McKee ┆ Israel Pickens ┆ Albert Goodwyn │
│ VI ┆ Donna Christensen ┆ Melvin Evans ┆ Donna Christensen │
│ AK ┆ Mark Begich ┆ Thomas Cale ┆ Anthony Dimond │
│ VT ┆ Samuel Shaw ┆ Moses Robinson ┆ Ahiman Miner │
└───────┴───────────────────┴─────────────────┴────────────────────┘
We can even sort by another column in the group_by
context. If we want to know if the alphabetically sorted name is male or female we could add: pl.col("gender").sort_by(get_person()).first()
def get_person() -> pl.Expr:
return pl.col("first_name") + pl.lit(" ") + pl.col("last_name")
q = (
dataset.lazy()
.sort("birthday", descending=True)
.group_by("state")
.agg(
get_person().first().alias("youngest"),
get_person().last().alias("oldest"),
get_person().sort().first().alias("alphabetical_first"),
pl.col("gender").sort_by(get_person()).first(),
)
.sort("state")
.limit(5)
)
df = q.collect()
print(df)
let df = dataset
.clone()
.lazy()
.sort(
["birthday"],
SortMultipleOptions::default()
.with_order_descending(true)
.with_nulls_last(true),
)
.group_by(["state"])
.agg([
get_person().first().alias("youngest"),
get_person().last().alias("oldest"),
get_person()
.sort(Default::default())
.first()
.alias("alphabetical_first"),
col("gender")
.sort_by(["first_name"], SortMultipleOptions::default())
.first()
.alias("gender"),
])
.sort(["state"], SortMultipleOptions::default())
.limit(5)
.collect()?;
println!("{}", df);
shape: (5, 5)
┌───────┬───────────────────┬─────────────────┬────────────────────┬────────┐
│ state ┆ youngest ┆ oldest ┆ alphabetical_first ┆ gender │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ str ┆ str ┆ str ┆ cat │
╞═══════╪═══════════════════╪═════════════════╪════════════════════╪════════╡
│ NC ┆ John Ashe ┆ Samuel Johnston ┆ Abraham Rencher ┆ M │
│ PA ┆ Thomas Fitzsimons ┆ Israel Jacobs ┆ Aaron Kreider ┆ M │
│ IN ┆ Waller Taylor ┆ John Test ┆ Abraham Brick ┆ M │
│ CT ┆ Henry Edwards ┆ Roger Sherman ┆ Abner Sibal ┆ M │
│ CA ┆ Edward Gilbert ┆ William Gwin ┆ Aaron Sargent ┆ M │
└───────┴───────────────────┴─────────────────┴────────────────────┴────────┘
Do not kill parallelization
Python Users Only
The following section is specific to Python, and doesn't apply to Rust. Within Rust, blocks and closures (lambdas) can, and will, be executed concurrently.
We have all heard that Python is slow, and does "not scale." Besides the overhead of
running "slow" bytecode, Python has to remain within the constraints of the Global
Interpreter Lock (GIL). This means that if you were to use a lambda
or a custom Python
function to apply during a parallelized phase, Polars speed is capped running Python
code preventing any multiple threads from executing the function.
This all feels terribly limiting, especially because we often need those lambda
functions in a
.group_by()
step, for example. This approach is still supported by Polars, but
keeping in mind bytecode and the GIL costs have to be paid. It is recommended to try to solve your queries using the expression syntax before moving to lambdas
. If you want to learn more about using lambdas
, go to the user defined functions section.
Conclusion
In the examples above we've seen that we can do a lot by combining expressions. By doing so we delay the use of custom Python functions that slow down the queries (by the slow nature of Python AND the GIL).
If we are missing a type expression let us know by opening a feature request!