Contexts
Polars has developed its own Domain Specific Language (DSL) for transforming data. The language is very easy to use and allows for complex queries that remain human readable. The two core components of the language are Contexts and Expressions, the latter we will cover in the next section.
A context, as implied by the name, refers to the context in which an expression needs to be evaluated. There are three main contexts 1:
- Selection:
df.select(...)
,df.with_columns(...)
- Filtering:
df.filter()
- Group by / Aggregation:
df.group_by(...).agg(...)
The examples below are performed on the following DataFrame
:
df = pl.DataFrame(
{
"nrs": [1, 2, 3, None, 5],
"names": ["foo", "ham", "spam", "egg", None],
"random": np.random.rand(5),
"groups": ["A", "A", "B", "C", "B"],
}
)
print(df)
use rand::{thread_rng, Rng};
let mut arr = [0f64; 5];
thread_rng().fill(&mut arr);
let df = df! (
"nrs" => &[Some(1), Some(2), Some(3), None, Some(5)],
"names" => &[Some("foo"), Some("ham"), Some("spam"), Some("eggs"), None],
"random" => &arr,
"groups" => &["A", "A", "B", "C", "B"],
)?;
println!("{}", &df);
shape: (5, 4)
┌──────┬───────┬──────────┬────────┐
│ nrs ┆ names ┆ random ┆ groups │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 ┆ str │
╞══════╪═══════╪══════════╪════════╡
│ 1 ┆ foo ┆ 0.154163 ┆ A │
│ 2 ┆ ham ┆ 0.74005 ┆ A │
│ 3 ┆ spam ┆ 0.263315 ┆ B │
│ null ┆ egg ┆ 0.533739 ┆ C │
│ 5 ┆ null ┆ 0.014575 ┆ B │
└──────┴───────┴──────────┴────────┘
Selection
The selection context applies expressions over columns. A select
may produce new columns that are aggregations, combinations of expressions, or literals.
The expressions in a selection context must produce Series
that are all the same length or have a length of 1. Literals are treated as length-1 Series
.
When some expressions produce length-1 Series
and some do not, the length-1 Series
will be broadcast to match the length of the remaining Series
.
Note that broadcasting can also occur within expressions: for instance, in pl.col.value() / pl.col.value.sum()
, each element of the value
column is divided by the column's sum.
out = df.select(
pl.sum("nrs"),
pl.col("names").sort(),
pl.col("names").first().alias("first name"),
(pl.mean("nrs") * 10).alias("10xnrs"),
)
print(out)
let out = df
.clone()
.lazy()
.select([
sum("nrs"),
col("names").sort(Default::default()),
col("names").first().alias("first name"),
(mean("nrs") * lit(10)).alias("10xnrs"),
])
.collect()?;
println!("{}", out);
shape: (5, 4)
┌─────┬───────┬────────────┬────────┐
│ nrs ┆ names ┆ first name ┆ 10xnrs │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ f64 │
╞═════╪═══════╪════════════╪════════╡
│ 11 ┆ null ┆ foo ┆ 27.5 │
│ 11 ┆ egg ┆ foo ┆ 27.5 │
│ 11 ┆ foo ┆ foo ┆ 27.5 │
│ 11 ┆ ham ┆ foo ┆ 27.5 │
│ 11 ┆ spam ┆ foo ┆ 27.5 │
└─────┴───────┴────────────┴────────┘
As you can see from the query, the selection context is very powerful and allows you to evaluate arbitrary expressions independent of (and in parallel to) each other.
Similar to the select
statement, the with_columns
statement also enters into the selection context. The main difference between with_columns
and select
is that with_columns
retains the original columns and adds new ones, whereas select
drops the original columns.
df = df.with_columns(
pl.sum("nrs").alias("nrs_sum"),
pl.col("random").count().alias("count"),
)
print(df)
let out = df
.clone()
.lazy()
.with_columns([
sum("nrs").alias("nrs_sum"),
col("random").count().alias("count"),
])
.collect()?;
println!("{}", out);
shape: (5, 6)
┌──────┬───────┬──────────┬────────┬─────────┬───────┐
│ nrs ┆ names ┆ random ┆ groups ┆ nrs_sum ┆ count │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 ┆ str ┆ i64 ┆ u32 │
╞══════╪═══════╪══════════╪════════╪═════════╪═══════╡
│ 1 ┆ foo ┆ 0.154163 ┆ A ┆ 11 ┆ 5 │
│ 2 ┆ ham ┆ 0.74005 ┆ A ┆ 11 ┆ 5 │
│ 3 ┆ spam ┆ 0.263315 ┆ B ┆ 11 ┆ 5 │
│ null ┆ egg ┆ 0.533739 ┆ C ┆ 11 ┆ 5 │
│ 5 ┆ null ┆ 0.014575 ┆ B ┆ 11 ┆ 5 │
└──────┴───────┴──────────┴────────┴─────────┴───────┘
Filtering
The filtering context filters a DataFrame
based on one or more expressions that evaluate to the Boolean
data type.
shape: (2, 6)
┌─────┬───────┬──────────┬────────┬─────────┬───────┐
│ nrs ┆ names ┆ random ┆ groups ┆ nrs_sum ┆ count │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 ┆ str ┆ i64 ┆ u32 │
╞═════╪═══════╪══════════╪════════╪═════════╪═══════╡
│ 3 ┆ spam ┆ 0.263315 ┆ B ┆ 11 ┆ 5 │
│ 5 ┆ null ┆ 0.014575 ┆ B ┆ 11 ┆ 5 │
└─────┴───────┴──────────┴────────┴─────────┴───────┘
Group by / aggregation
In the group_by
context, expressions work on groups and thus may yield results of any length (a group may have many members).
out = df.group_by("groups").agg(
pl.sum("nrs"), # sum nrs by groups
pl.col("random").count().alias("count"), # count group members
# sum random where name != null
pl.col("random").filter(pl.col("names").is_not_null()).sum().name.suffix("_sum"),
pl.col("names").reverse().alias("reversed names"),
)
print(out)
let out = df
.lazy()
.group_by([col("groups")])
.agg([
sum("nrs"), // sum nrs by groups
col("random").count().alias("count"), // count group members
// sum random where name != null
col("random")
.filter(col("names").is_not_null())
.sum()
.name()
.suffix("_sum"),
col("names").reverse().alias("reversed names"),
])
.collect()?;
println!("{}", out);
shape: (3, 5)
┌────────┬─────┬───────┬────────────┬────────────────┐
│ groups ┆ nrs ┆ count ┆ random_sum ┆ reversed names │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ u32 ┆ f64 ┆ list[str] │
╞════════╪═════╪═══════╪════════════╪════════════════╡
│ A ┆ 3 ┆ 2 ┆ 0.894213 ┆ ["ham", "foo"] │
│ B ┆ 8 ┆ 2 ┆ 0.263315 ┆ [null, "spam"] │
│ C ┆ 0 ┆ 1 ┆ 0.533739 ┆ ["egg"] │
└────────┴─────┴───────┴────────────┴────────────────┘
As you can see from the result all expressions are applied to the group defined by the group_by
context. Besides the standard group_by
, group_by_dynamic
, and group_by_rolling
are also entrances to the group by context.
-
There are additional List and SQL contexts which are covered later in this guide. But for simplicity, we leave them out of scope for now. ↩