The Struct datatype
Polars Struct
s are the idiomatic way of working with multiple columns. It is also a free operation i.e. moving columns into Struct
s does not copy any data!
For this section, let's start with a DataFrame
that captures the average rating of a few movies across some states in the U.S.:
ratings = pl.DataFrame(
{
"Movie": ["Cars", "IT", "ET", "Cars", "Up", "IT", "Cars", "ET", "Up", "ET"],
"Theatre": ["NE", "ME", "IL", "ND", "NE", "SD", "NE", "IL", "IL", "SD"],
"Avg_Rating": [4.5, 4.4, 4.6, 4.3, 4.8, 4.7, 4.7, 4.9, 4.7, 4.6],
"Count": [30, 27, 26, 29, 31, 28, 28, 26, 33, 26],
}
)
print(ratings)
let ratings = df!(
"Movie"=> &["Cars", "IT", "ET", "Cars", "Up", "IT", "Cars", "ET", "Up", "ET"],
"Theatre"=> &["NE", "ME", "IL", "ND", "NE", "SD", "NE", "IL", "IL", "SD"],
"Avg_Rating"=> &[4.5, 4.4, 4.6, 4.3, 4.8, 4.7, 4.7, 4.9, 4.7, 4.6],
"Count"=> &[30, 27, 26, 29, 31, 28, 28, 26, 33, 26],
)?;
println!("{}", &ratings);
shape: (10, 4)
┌───────┬─────────┬────────────┬───────┐
│ Movie ┆ Theatre ┆ Avg_Rating ┆ Count │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ i64 │
╞═══════╪═════════╪════════════╪═══════╡
│ Cars ┆ NE ┆ 4.5 ┆ 30 │
│ IT ┆ ME ┆ 4.4 ┆ 27 │
│ ET ┆ IL ┆ 4.6 ┆ 26 │
│ Cars ┆ ND ┆ 4.3 ┆ 29 │
│ Up ┆ NE ┆ 4.8 ┆ 31 │
│ IT ┆ SD ┆ 4.7 ┆ 28 │
│ Cars ┆ NE ┆ 4.7 ┆ 28 │
│ ET ┆ IL ┆ 4.9 ┆ 26 │
│ Up ┆ IL ┆ 4.7 ┆ 33 │
│ ET ┆ SD ┆ 4.6 ┆ 26 │
└───────┴─────────┴────────────┴───────┘
Encountering the Struct
type
A common operation that will lead to a Struct
column is the ever so popular value_counts
function that is commonly used in exploratory data analysis. Checking the number of times a state appears the data will be done as so:
out = ratings.select(pl.col("Theatre").value_counts(sort=True))
print(out)
value_counts
· Available on feature dtype-struct
let out = ratings
.clone()
.lazy()
.select([col("Theatre").value_counts(true, true, "count", false)])
.collect()?;
println!("{}", &out);
shape: (5, 1)
┌───────────┐
│ Theatre │
│ --- │
│ struct[2] │
╞═══════════╡
│ {"NE",3} │
│ {"IL",3} │
│ {"SD",2} │
│ {"ME",1} │
│ {"ND",1} │
└───────────┘
Quite unexpected an output, especially if coming from tools that do not have such a data type. We're not in peril though, to get back to a more familiar output, all we need to do is unnest
the Struct
column into its constituent columns:
shape: (5, 2)
┌─────────┬───────┐
│ Theatre ┆ count │
│ --- ┆ --- │
│ str ┆ u32 │
╞═════════╪═══════╡
│ NE ┆ 3 │
│ IL ┆ 3 │
│ SD ┆ 2 │
│ ME ┆ 1 │
│ ND ┆ 1 │
└─────────┴───────┘
Why value_counts
returns a Struct
Polars expressions always have a Fn(Series) -> Series
signature and Struct
is thus the data type that allows us to provide multiple columns as input/output of an expression. In other words, all expressions have to return a Series
object, and Struct
allows us to stay consistent with that requirement.
Structs as dict
s
Polars will interpret a dict
sent to the Series
constructor as a Struct
:
rating_series = pl.Series(
"ratings",
[
{"Movie": "Cars", "Theatre": "NE", "Avg_Rating": 4.5},
{"Movie": "Toy Story", "Theatre": "ME", "Avg_Rating": 4.9},
],
)
print(rating_series)
// Don't think we can make it the same way in rust, but this works
let rating_series = df!(
"Movie" => &["Cars", "Toy Story"],
"Theatre" => &["NE", "ME"],
"Avg_Rating" => &[4.5, 4.9],
)?
.into_struct("ratings".into())
.into_series();
println!("{}", &rating_series);
shape: (2,)
Series: 'ratings' [struct[3]]
[
{"Cars","NE",4.5}
{"Toy Story","ME",4.9}
]
Constructing Series
objects
Note that Series
here was constructed with the name
of the series in the beginning, followed by the values
. Providing the latter first
is considered an anti-pattern in Polars, and must be avoided.
Extracting individual values of a Struct
Let's say that we needed to obtain just the movie
value in the Series
that we created above. We can use the field
method to do so:
out = rating_series.struct.field("Movie")
print(out)
let out = rating_series.struct_()?.field_by_name("Movie")?;
println!("{}", &out);
shape: (2,)
Series: 'Movie' [str]
[
"Cars"
"Toy Story"
]
Renaming individual keys of a Struct
What if we need to rename individual field
s of a Struct
column? We first convert the rating_series
object to a DataFrame
so that we can view the changes easily, and then use the rename_fields
method:
out = (
rating_series.to_frame()
.select(pl.col("ratings").struct.rename_fields(["Film", "State", "Value"]))
.unnest("ratings")
)
print(out)
let out = DataFrame::new([rating_series.into_column()].into())?
.lazy()
.select([col("ratings")
.struct_()
.rename_fields(["Film", "State", "Value"].to_vec())])
.unnest(["ratings"])
.collect()?;
println!("{}", &out);
shape: (2, 3)
┌───────────┬───────┬───────┐
│ Film ┆ State ┆ Value │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 │
╞═══════════╪═══════╪═══════╡
│ Cars ┆ NE ┆ 4.5 │
│ Toy Story ┆ ME ┆ 4.9 │
└───────────┴───────┴───────┘
Practical use-cases of Struct
columns
Identifying duplicate rows
Let's get back to the ratings
data. We want to identify cases where there are duplicates at a Movie
and Theatre
level. This is where the Struct
datatype shines:
out = ratings.filter(pl.struct("Movie", "Theatre").is_duplicated())
print(out)
is_duplicated
· Struct
· Available on feature dtype-struct
let out = ratings
.clone()
.lazy()
// .filter(as_struct(&[col("Movie"), col("Theatre")]).is_duplicated())
// Error: .is_duplicated() not available if you try that
// https://github.com/pola-rs/polars/issues/3803
.filter(len().over([col("Movie"), col("Theatre")]).gt(lit(1)))
.collect()?;
println!("{}", &out);
shape: (4, 4)
┌───────┬─────────┬────────────┬───────┐
│ Movie ┆ Theatre ┆ Avg_Rating ┆ Count │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ i64 │
╞═══════╪═════════╪════════════╪═══════╡
│ Cars ┆ NE ┆ 4.5 ┆ 30 │
│ ET ┆ IL ┆ 4.6 ┆ 26 │
│ Cars ┆ NE ┆ 4.7 ┆ 28 │
│ ET ┆ IL ┆ 4.9 ┆ 26 │
└───────┴─────────┴────────────┴───────┘
We can identify the unique cases at this level also with is_unique
!
Multi-column ranking
Suppose, given that we know there are duplicates, we want to choose which rank gets a higher priority. We define Count
of ratings to be more important than the actual Avg_Rating
themselves, and only use it to break a tie. We can then do:
out = ratings.with_columns(
pl.struct("Count", "Avg_Rating")
.rank("dense", descending=True)
.over("Movie", "Theatre")
.alias("Rank")
).filter(pl.struct("Movie", "Theatre").is_duplicated())
print(out)
is_duplicated
· Struct
· Available on feature dtype-struct
let out = ratings
.clone()
.lazy()
.with_columns([as_struct(vec![col("Count"), col("Avg_Rating")])
.rank(
RankOptions {
method: RankMethod::Dense,
descending: false,
},
None,
)
.over([col("Movie"), col("Theatre")])
.alias("Rank")])
// .filter(as_struct(&[col("Movie"), col("Theatre")]).is_duplicated())
// Error: .is_duplicated() not available if you try that
// https://github.com/pola-rs/polars/issues/3803
.filter(len().over([col("Movie"), col("Theatre")]).gt(lit(1)))
.collect()?;
println!("{}", &out);
shape: (4, 5)
┌───────┬─────────┬────────────┬───────┬──────┐
│ Movie ┆ Theatre ┆ Avg_Rating ┆ Count ┆ Rank │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ i64 ┆ u32 │
╞═══════╪═════════╪════════════╪═══════╪══════╡
│ Cars ┆ NE ┆ 4.5 ┆ 30 ┆ 1 │
│ ET ┆ IL ┆ 4.6 ┆ 26 ┆ 2 │
│ Cars ┆ NE ┆ 4.7 ┆ 28 ┆ 2 │
│ ET ┆ IL ┆ 4.9 ┆ 26 ┆ 1 │
└───────┴─────────┴────────────┴───────┴──────┘
That's a pretty complex set of requirements done very elegantly in Polars!
Using multi-column apply
This was discussed in the previous section on User Defined Functions for the Python case. Here's an example of doing so with both Python and Rust:
df = pl.DataFrame({"keys": ["a", "a", "b"], "values": [10, 7, 1]})
out = df.select(
pl.struct(["keys", "values"])
.map_elements(lambda x: len(x["keys"]) + x["values"], return_dtype=pl.Int64)
.alias("solution_map_elements"),
(pl.col("keys").str.len_bytes() + pl.col("values")).alias("solution_expr"),
)
print(out)
let df = df!(
"keys" => &["a", "a", "b"],
"values" => &[10, 7, 1],
)?;
let out = df
.lazy()
.select([
// pack to struct to get access to multiple fields in a custom `apply/map`
as_struct(vec![col("keys"), col("values")])
// we will compute the len(a) + b
.apply(
|s| {
// downcast to struct
let ca = s.struct_()?;
// get the fields as Series
let s_a = &ca.fields_as_series()[0];
let s_b = &ca.fields_as_series()[1];
// downcast the `Series` to their known type
let ca_a = s_a.str()?;
let ca_b = s_b.i32()?;
// iterate both `ChunkedArrays`
let out: Int32Chunked = ca_a
.into_iter()
.zip(ca_b)
.map(|(opt_a, opt_b)| match (opt_a, opt_b) {
(Some(a), Some(b)) => Some(a.len() as i32 + b),
_ => None,
})
.collect();
Ok(Some(out.into_column()))
},
GetOutput::from_type(DataType::Int32),
)
// note: the `'solution_map_elements'` alias is just there to show how you
// get the same output as in the Python API example.
.alias("solution_map_elements"),
(col("keys").str().count_matches(lit("."), true) + col("values"))
.alias("solution_expr"),
])
.collect()?;
println!("{}", out);
shape: (3, 2)
┌───────────────────────┬───────────────┐
│ solution_map_elements ┆ solution_expr │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═══════════════════════╪═══════════════╡
│ 11 ┆ 11 │
│ 8 ┆ 8 │
│ 2 ┆ 2 │
└───────────────────────┴───────────────┘