Structs
The data type Struct
is a composite data type that can store multiple fields in a single column.
Python analogy
For Python users, the data type Struct
is kind of like a Python
dictionary. Even better, if you are familiar with Python typing, you can think of the data type
Struct
as typing.TypedDict
.
In this page of the user guide we will see situations in which the data type Struct
arises, we
will understand why it does arise, and we will see how to work with Struct
values.
Let's start with a dataframe that captures the average rating of a few movies across some states in the US:
import polars as pl
ratings = pl.DataFrame(
{
"Movie": ["Cars", "IT", "ET", "Cars", "Up", "IT", "Cars", "ET", "Up", "Cars"],
"Theatre": ["NE", "ME", "IL", "ND", "NE", "SD", "NE", "IL", "IL", "NE"],
"Avg_Rating": [4.5, 4.4, 4.6, 4.3, 4.8, 4.7, 4.5, 4.9, 4.7, 4.6],
"Count": [30, 27, 26, 29, 31, 28, 28, 26, 33, 28],
}
)
print(ratings)
use polars::prelude::*;
let ratings = df!(
"Movie"=> ["Cars", "IT", "ET", "Cars", "Up", "IT", "Cars", "ET", "Up", "Cars"],
"Theatre"=> ["NE", "ME", "IL", "ND", "NE", "SD", "NE", "IL", "IL", "NE"],
"Avg_Rating"=> [4.5, 4.4, 4.6, 4.3, 4.8, 4.7, 4.5, 4.9, 4.7, 4.6],
"Count"=> [30, 27, 26, 29, 31, 28, 28, 26, 33, 28],
)?;
println!("{}", &ratings);
shape: (10, 4)
┌───────┬─────────┬────────────┬───────┐
│ Movie ┆ Theatre ┆ Avg_Rating ┆ Count │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ i64 │
╞═══════╪═════════╪════════════╪═══════╡
│ Cars ┆ NE ┆ 4.5 ┆ 30 │
│ IT ┆ ME ┆ 4.4 ┆ 27 │
│ ET ┆ IL ┆ 4.6 ┆ 26 │
│ Cars ┆ ND ┆ 4.3 ┆ 29 │
│ Up ┆ NE ┆ 4.8 ┆ 31 │
│ IT ┆ SD ┆ 4.7 ┆ 28 │
│ Cars ┆ NE ┆ 4.5 ┆ 28 │
│ ET ┆ IL ┆ 4.9 ┆ 26 │
│ Up ┆ IL ┆ 4.7 ┆ 33 │
│ Cars ┆ NE ┆ 4.6 ┆ 28 │
└───────┴─────────┴────────────┴───────┘
Encountering the data type Struct
A common operation that will lead to a Struct
column is the ever so popular value_counts
function that is commonly used in exploratory data analysis. Checking the number of times a state
appears in the data is done as so:
result = ratings.select(pl.col("Theatre").value_counts(sort=True))
print(result)
value_counts
· Available on feature dtype-struct
let result = ratings
.clone()
.lazy()
.select([col("Theatre").value_counts(true, true, "count", false)])
.collect()?;
println!("{}", result);
shape: (5, 1)
┌───────────┐
│ Theatre │
│ --- │
│ struct[2] │
╞═══════════╡
│ {"NE",4} │
│ {"IL",3} │
│ {"ME",1} │
│ {"ND",1} │
│ {"SD",1} │
└───────────┘
Quite unexpected an output, especially if coming from tools that do not have such a data type. We're
not in peril, though. To get back to a more familiar output, all we need to do is use the function
unnest
on the Struct
column:
shape: (5, 2)
┌─────────┬───────┐
│ Theatre ┆ count │
│ --- ┆ --- │
│ str ┆ u32 │
╞═════════╪═══════╡
│ NE ┆ 4 │
│ IL ┆ 3 │
│ ME ┆ 1 │
│ ND ┆ 1 │
│ SD ┆ 1 │
└─────────┴───────┘
The function unnest
will turn each field of the Struct
into its own column.
Why value_counts
returns a Struct
Polars expressions always operate on a single series and return another series.
Struct
is the data type that allows us to provide multiple columns as input to an expression, or to output multiple columns from an expression.
Thus, we can use the data type Struct
to specify each value and its count when we use value_counts
.
Inferring the data type Struct
from dictionaries
When building series or dataframes, Polars will convert dictionaries to the data type Struct
:
rating_series = pl.Series(
"ratings",
[
{"Movie": "Cars", "Theatre": "NE", "Avg_Rating": 4.5},
{"Movie": "Toy Story", "Theatre": "ME", "Avg_Rating": 4.9},
],
)
print(rating_series)
// Don't think we can make it the same way in rust, but this works
let rating_series = df!(
"Movie" => &["Cars", "Toy Story"],
"Theatre" => &["NE", "ME"],
"Avg_Rating" => &[4.5, 4.9],
)?
.into_struct("ratings".into())
.into_series();
println!("{}", &rating_series);
shape: (2,)
Series: 'ratings' [struct[3]]
[
{"Cars","NE",4.5}
{"Toy Story","ME",4.9}
]
The number of fields, their names, and their types, are inferred from the first dictionary seen.
Subsequent incongruences can result in null
values or in errors:
null_rating_series = pl.Series(
"ratings",
[
{"Movie": "Cars", "Theatre": "NE", "Avg_Rating": 4.5},
{"Mov": "Toy Story", "Theatre": "ME", "Avg_Rating": 4.9},
{"Movie": "Snow White", "Theatre": "IL", "Avg_Rating": "4.7"},
],
strict=False, # To show the final structs with `null` values.
)
print(null_rating_series)
// Contribute the Rust translation of the Python example by opening a PR.
shape: (3,)
Series: 'ratings' [struct[4]]
[
{"Cars","NE","4.5",null}
{null,"ME","4.9","Toy Story"}
{"Snow White","IL","4.7",null}
]
Extracting individual values of a Struct
Let's say that we needed to obtain just the field "Movie"
from the Struct
in the series that we
created above. We can use the function field
to do so:
result = rating_series.struct.field("Movie")
print(result)
let result = rating_series.struct_()?.field_by_name("Movie")?;
println!("{}", result);
shape: (2,)
Series: 'Movie' [str]
[
"Cars"
"Toy Story"
]
Renaming individual fields of a Struct
What if we need to rename individual fields of a Struct
column? We use the function
rename_fields
:
result = rating_series.struct.rename_fields(["Film", "State", "Value"])
print(result)
// Contribute the Rust translation of the Python example by opening a PR.
shape: (2,)
Series: 'ratings' [struct[3]]
[
{"Cars","NE",4.5}
{"Toy Story","ME",4.9}
]
To be able to actually see that the field names were change, we will create a dataframe where the
only column is the result and then we use the function unnest
so that each field becomes its own
column. The column names will reflect the renaming operation we just did:
print(
result.to_frame().unnest("ratings"),
)
// Contribute the Rust translation of the Python example by opening a PR.
shape: (2, 3)
┌───────────┬───────┬───────┐
│ Film ┆ State ┆ Value │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 │
╞═══════════╪═══════╪═══════╡
│ Cars ┆ NE ┆ 4.5 │
│ Toy Story ┆ ME ┆ 4.9 │
└───────────┴───────┴───────┘
Practical use-cases of Struct
columns
Identifying duplicate rows
Let's get back to the ratings
data. We want to identify cases where there are duplicates at a
“Movie” and “Theatre” level.
This is where the data type Struct
shines:
result = ratings.filter(pl.struct("Movie", "Theatre").is_duplicated())
print(result)
is_duplicated
· Struct
· Available on feature dtype-struct
// Contribute the Rust translation of the Python example by opening a PR.
shape: (5, 4)
┌───────┬─────────┬────────────┬───────┐
│ Movie ┆ Theatre ┆ Avg_Rating ┆ Count │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ i64 │
╞═══════╪═════════╪════════════╪═══════╡
│ Cars ┆ NE ┆ 4.5 ┆ 30 │
│ ET ┆ IL ┆ 4.6 ┆ 26 │
│ Cars ┆ NE ┆ 4.5 ┆ 28 │
│ ET ┆ IL ┆ 4.9 ┆ 26 │
│ Cars ┆ NE ┆ 4.6 ┆ 28 │
└───────┴─────────┴────────────┴───────┘
We can identify the unique cases at this level also with is_unique
!
Multi-column ranking
Suppose, given that we know there are duplicates, we want to choose which rating gets a higher priority. We can say that the column “Count” is the most important, and if there is a tie in the column “Count” then we consider the column “Avg_Rating”.
We can then do:
result = ratings.with_columns(
pl.struct("Count", "Avg_Rating")
.rank("dense", descending=True)
.over("Movie", "Theatre")
.alias("Rank")
).filter(pl.struct("Movie", "Theatre").is_duplicated())
print(result)
is_duplicated
· Struct
· Available on feature dtype-struct
let result = ratings
.clone()
.lazy()
.with_columns([as_struct(vec![col("Count"), col("Avg_Rating")])
.rank(
RankOptions {
method: RankMethod::Dense,
descending: true,
},
None,
)
.over([col("Movie"), col("Theatre")])
.alias("Rank")])
// .filter(as_struct(&[col("Movie"), col("Theatre")]).is_duplicated())
// Error: .is_duplicated() not available if you try that
// https://github.com/pola-rs/polars/issues/3803
.filter(len().over([col("Movie"), col("Theatre")]).gt(lit(1)))
.collect()?;
println!("{}", result);
shape: (5, 5)
┌───────┬─────────┬────────────┬───────┬──────┐
│ Movie ┆ Theatre ┆ Avg_Rating ┆ Count ┆ Rank │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ i64 ┆ u32 │
╞═══════╪═════════╪════════════╪═══════╪══════╡
│ Cars ┆ NE ┆ 4.5 ┆ 30 ┆ 1 │
│ ET ┆ IL ┆ 4.6 ┆ 26 ┆ 2 │
│ Cars ┆ NE ┆ 4.5 ┆ 28 ┆ 3 │
│ ET ┆ IL ┆ 4.9 ┆ 26 ┆ 1 │
│ Cars ┆ NE ┆ 4.6 ┆ 28 ┆ 2 │
└───────┴─────────┴────────────┴───────┴──────┘
That's a pretty complex set of requirements done very elegantly in Polars! To learn more about the
function over
, used above, see the user guide section on window functions.
Using multiple columns in a single expression
As mentioned earlier, the data type Struct
is also useful if you need to pass multiple columns as
input to an expression. As an example, suppose we want to compute
the Ackermann function on two columns of a
dataframe. There is no way of composing Polars expressions to compute the Ackermann function1, so
we define a custom function:
def ack(m, n):
if not m:
return n + 1
if not n:
return ack(m - 1, 1)
return ack(m - 1, ack(m, n - 1))
// Contribute the Rust translation of the Python example by opening a PR.
Now, to compute the values of the Ackermann function on those arguments, we start by creating a
Struct
with fields m
and n
and then use the function map_elements
to apply the function
ack
to each value:
values = pl.DataFrame(
{
"m": [0, 0, 0, 1, 1, 1, 2],
"n": [2, 3, 4, 1, 2, 3, 1],
}
)
result = values.with_columns(
pl.struct(["m", "n"])
.map_elements(lambda s: ack(s["m"], s["n"]), return_dtype=pl.Int64)
.alias("ack")
)
print(result)
// Contribute the Rust translation of the Python example by opening a PR.
shape: (7, 3)
┌─────┬─────┬─────┐
│ m ┆ n ┆ ack │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 0 ┆ 2 ┆ 3 │
│ 0 ┆ 3 ┆ 4 │
│ 0 ┆ 4 ┆ 5 │
│ 1 ┆ 1 ┆ 3 │
│ 1 ┆ 2 ┆ 4 │
│ 1 ┆ 3 ┆ 5 │
│ 2 ┆ 1 ┆ 5 │
└─────┴─────┴─────┘
-
To say that something cannot be done is quite a bold claim. If you prove us wrong, please let us know! ↩