The Struct datatype

Polars Structs are the idiomatic way of working with multiple columns. It is also a free operation i.e. moving columns into Structs does not copy any data!

For this section, let's start with a DataFrame that captures the average rating of a few movies across some states in the U.S.:

Python Rust

DataFrame

ratings = pl.DataFrame(
    {
        "Movie": ["Cars", "IT", "ET", "Cars", "Up", "IT", "Cars", "ET", "Up", "ET"],
        "Theatre": ["NE", "ME", "IL", "ND", "NE", "SD", "NE", "IL", "IL", "SD"],
        "Avg_Rating": [4.5, 4.4, 4.6, 4.3, 4.8, 4.7, 4.7, 4.9, 4.7, 4.6],
        "Count": [30, 27, 26, 29, 31, 28, 28, 26, 33, 26],
    }
)
print(ratings)

DataFrame

let ratings = df!(
        "Movie"=> &["Cars", "IT", "ET", "Cars", "Up", "IT", "Cars", "ET", "Up", "ET"],
        "Theatre"=> &["NE", "ME", "IL", "ND", "NE", "SD", "NE", "IL", "IL", "SD"],
        "Avg_Rating"=> &[4.5, 4.4, 4.6, 4.3, 4.8, 4.7, 4.7, 4.9, 4.7, 4.6],
        "Count"=> &[30, 27, 26, 29, 31, 28, 28, 26, 33, 26],

)?;
println!("{}", &ratings);

shape: (10, 4)
┌───────┬─────────┬────────────┬───────┐
│ Movie ┆ Theatre ┆ Avg_Rating ┆ Count │
│ ---   ┆ ---     ┆ ---        ┆ ---   │
│ str   ┆ str     ┆ f64        ┆ i64   │
╞═══════╪═════════╪════════════╪═══════╡
│ Cars  ┆ NE      ┆ 4.5        ┆ 30    │
│ IT    ┆ ME      ┆ 4.4        ┆ 27    │
│ ET    ┆ IL      ┆ 4.6        ┆ 26    │
│ Cars  ┆ ND      ┆ 4.3        ┆ 29    │
│ Up    ┆ NE      ┆ 4.8        ┆ 31    │
│ IT    ┆ SD      ┆ 4.7        ┆ 28    │
│ Cars  ┆ NE      ┆ 4.7        ┆ 28    │
│ ET    ┆ IL      ┆ 4.9        ┆ 26    │
│ Up    ┆ IL      ┆ 4.7        ┆ 33    │
│ ET    ┆ SD      ┆ 4.6        ┆ 26    │
└───────┴─────────┴────────────┴───────┘

Encountering the `Struct` type

A common operation that will lead to a Struct column is the ever so popular value_counts function that is commonly used in exploratory data analysis. Checking the number of times a state appears the data will be done as so:

Python Rust

value_counts

out = ratings.select(pl.col("Theatre").value_counts(sort=True))
print(out)

value_counts · Available on feature dtype-struct

let out = ratings
    .clone()
    .lazy()
    .select([col("Theatre").value_counts(true, true, "count".to_string(), false)])
    .collect()?;
println!("{}", &out);

shape: (5, 1)
┌───────────┐
│ Theatre   │
│ ---       │
│ struct[2] │
╞═══════════╡
│ {"NE",3}  │
│ {"IL",3}  │
│ {"SD",2}  │
│ {"ME",1}  │
│ {"ND",1}  │
└───────────┘

Quite unexpected an output, especially if coming from tools that do not have such a data type. We're not in peril though, to get back to a more familiar output, all we need to do is unnest the Struct column into its constituent columns:

Python Rust

unnest

out = ratings.select(pl.col("Theatre").value_counts(sort=True)).unnest("Theatre")
print(out)

unnest

let out = ratings
    .clone()
    .lazy()
    .select([col("Theatre").value_counts(true, true, "count".to_string(), false)])
    .unnest(["Theatre"])
    .collect()?;
println!("{}", &out);

shape: (5, 2)
┌─────────┬───────┐
│ Theatre ┆ count │
│ ---     ┆ ---   │
│ str     ┆ u32   │
╞═════════╪═══════╡
│ NE      ┆ 3     │
│ IL      ┆ 3     │
│ SD      ┆ 2     │
│ ME      ┆ 1     │
│ ND      ┆ 1     │
└─────────┴───────┘

Why value_counts returns a Struct

Polars expressions always have a Fn(Series) -> Series signature and Struct is thus the data type that allows us to provide multiple columns as input/output of an expression. In other words, all expressions have to return a Series object, and Struct allows us to stay consistent with that requirement.

Structs as `dict`s

Polars will interpret a dict sent to the Series constructor as a Struct:

Python Rust

Series

rating_series = pl.Series(
    "ratings",
    [
        {"Movie": "Cars", "Theatre": "NE", "Avg_Rating": 4.5},
        {"Movie": "Toy Story", "Theatre": "ME", "Avg_Rating": 4.9},
    ],
)
print(rating_series)

Series

// Don't think we can make it the same way in rust, but this works
let rating_series = df!(
    "Movie" => &["Cars", "Toy Story"],
    "Theatre" => &["NE", "ME"],
    "Avg_Rating" => &[4.5, 4.9],
)?
.into_struct("ratings")
.into_series();
println!("{}", &rating_series);

shape: (2,)
Series: 'ratings' [struct[3]]
[
    {"Cars","NE",4.5}
    {"Toy Story","ME",4.9}
]

Constructing Series objects

Note that Series here was constructed with the name of the series in the beginning, followed by the values. Providing the latter first is considered an anti-pattern in Polars, and must be avoided.

Extracting individual values of a `Struct`

Let's say that we needed to obtain just the movie value in the Series that we created above. We can use the field method to do so:

Python Rust

struct.field

out = rating_series.struct.field("Movie")
print(out)

struct.field_by_name

let out = rating_series.struct_()?.field_by_name("Movie")?;
println!("{}", &out);

shape: (2,)
Series: 'Movie' [str]
[
    "Cars"
    "Toy Story"
]

Renaming individual keys of a `Struct`

What if we need to rename individual fields of a Struct column? We first convert the rating_series object to a DataFrame so that we can view the changes easily, and then use the rename_fields method:

Python Rust

struct.rename_fields

out = (
    rating_series.to_frame()
    .select(pl.col("ratings").struct.rename_fields(["Film", "State", "Value"]))
    .unnest("ratings")
)
print(out)

struct.rename_fields

let out = DataFrame::new([rating_series].into())?
    .lazy()
    .select([col("ratings")
        .struct_()
        .rename_fields(["Film".into(), "State".into(), "Value".into()].to_vec())])
    .unnest(["ratings"])
    .collect()?;

println!("{}", &out);

shape: (2, 3)
┌───────────┬───────┬───────┐
│ Film      ┆ State ┆ Value │
│ ---       ┆ ---   ┆ ---   │
│ str       ┆ str   ┆ f64   │
╞═══════════╪═══════╪═══════╡
│ Cars      ┆ NE    ┆ 4.5   │
│ Toy Story ┆ ME    ┆ 4.9   │
└───────────┴───────┴───────┘

Practical use-cases of `Struct` columns

Identifying duplicate rows

Let's get back to the ratings data. We want to identify cases where there are duplicates at a Movie and Theatre level. This is where the Struct datatype shines:

Python Rust

is_duplicated · struct

out = ratings.filter(pl.struct("Movie", "Theatre").is_duplicated())
print(out)

is_duplicated · Struct · Available on feature dtype-struct

let out = ratings
    .clone()
    .lazy()
    // .filter(as_struct(&[col("Movie"), col("Theatre")]).is_duplicated())
    // Error: .is_duplicated() not available if you try that
    // https://github.com/pola-rs/polars/issues/3803
    .filter(len().over([col("Movie"), col("Theatre")]).gt(lit(1)))
    .collect()?;
println!("{}", &out);

shape: (4, 4)
┌───────┬─────────┬────────────┬───────┐
│ Movie ┆ Theatre ┆ Avg_Rating ┆ Count │
│ ---   ┆ ---     ┆ ---        ┆ ---   │
│ str   ┆ str     ┆ f64        ┆ i64   │
╞═══════╪═════════╪════════════╪═══════╡
│ Cars  ┆ NE      ┆ 4.5        ┆ 30    │
│ ET    ┆ IL      ┆ 4.6        ┆ 26    │
│ Cars  ┆ NE      ┆ 4.7        ┆ 28    │
│ ET    ┆ IL      ┆ 4.9        ┆ 26    │
└───────┴─────────┴────────────┴───────┘

We can identify the unique cases at this level also with is_unique!

Multi-column ranking

Suppose, given that we know there are duplicates, we want to choose which rank gets a higher priority. We define Count of ratings to be more important than the actual Avg_Rating themselves, and only use it to break a tie. We can then do:

Python Rust

is_duplicated · struct

out = ratings.with_columns(
    pl.struct("Count", "Avg_Rating")
    .rank("dense", descending=True)
    .over("Movie", "Theatre")
    .alias("Rank")
).filter(pl.struct("Movie", "Theatre").is_duplicated())
print(out)

is_duplicated · Struct · Available on feature dtype-struct

let out = ratings
    .clone()
    .lazy()
    .with_columns([as_struct(vec![col("Count"), col("Avg_Rating")])
        .rank(
            RankOptions {
                method: RankMethod::Dense,
                descending: false,
            },
            None,
        )
        .over([col("Movie"), col("Theatre")])
        .alias("Rank")])
    // .filter(as_struct(&[col("Movie"), col("Theatre")]).is_duplicated())
    // Error: .is_duplicated() not available if you try that
    // https://github.com/pola-rs/polars/issues/3803
    .filter(len().over([col("Movie"), col("Theatre")]).gt(lit(1)))
    .collect()?;
println!("{}", &out);

shape: (4, 5)
┌───────┬─────────┬────────────┬───────┬──────┐
│ Movie ┆ Theatre ┆ Avg_Rating ┆ Count ┆ Rank │
│ ---   ┆ ---     ┆ ---        ┆ ---   ┆ ---  │
│ str   ┆ str     ┆ f64        ┆ i64   ┆ u32  │
╞═══════╪═════════╪════════════╪═══════╪══════╡
│ Cars  ┆ NE      ┆ 4.5        ┆ 30    ┆ 1    │
│ ET    ┆ IL      ┆ 4.6        ┆ 26    ┆ 2    │
│ Cars  ┆ NE      ┆ 4.7        ┆ 28    ┆ 2    │
│ ET    ┆ IL      ┆ 4.9        ┆ 26    ┆ 1    │
└───────┴─────────┴────────────┴───────┴──────┘

That's a pretty complex set of requirements done very elegantly in Polars!

Using multi-column apply

This was discussed in the previous section on User Defined Functions for the Python case. Here's an example of doing so with both Python and Rust:

Python Rust

df = pl.DataFrame({"keys": ["a", "a", "b"], "values": [10, 7, 1]})

out = df.select(
    pl.struct(["keys", "values"])
    .map_elements(lambda x: len(x["keys"]) + x["values"], return_dtype=pl.Int64)
    .alias("solution_map_elements"),
    (pl.col("keys").str.len_bytes() + pl.col("values")).alias("solution_expr"),
)
print(out)

let df = df!(
    "keys" => &["a", "a", "b"],
    "values" => &[10, 7, 1],
)?;

let out = df
    .lazy()
    .select([
        // pack to struct to get access to multiple fields in a custom `apply/map`
        as_struct(vec![col("keys"), col("values")])
            // we will compute the len(a) + b
            .apply(
                |s| {
                    // downcast to struct
                    let ca = s.struct_()?;

                    // get the fields as Series
                    let s_a = &ca.fields()[0];
                    let s_b = &ca.fields()[1];

                    // downcast the `Series` to their known type
                    let ca_a = s_a.str()?;
                    let ca_b = s_b.i32()?;

                    // iterate both `ChunkedArrays`
                    let out: Int32Chunked = ca_a
                        .into_iter()
                        .zip(ca_b)
                        .map(|(opt_a, opt_b)| match (opt_a, opt_b) {
                            (Some(a), Some(b)) => Some(a.len() as i32 + b),
                            _ => None,
                        })
                        .collect();

                    Ok(Some(out.into_series()))
                },
                GetOutput::from_type(DataType::Int32),
            )
            // note: the `'solution_map_elements'` alias is just there to show how you
            // get the same output as in the Python API example.
            .alias("solution_map_elements"),
        (col("keys").str().count_matches(lit("."), true) + col("values"))
            .alias("solution_expr"),
    ])
    .collect()?;
println!("{}", out);

shape: (3, 2)
┌───────────────────────┬───────────────┐
│ solution_map_elements ┆ solution_expr │
│ ---                   ┆ ---           │
│ i64                   ┆ i64           │
╞═══════════════════════╪═══════════════╡
│ 11                    ┆ 11            │
│ 8                     ┆ 8             │
│ 2                     ┆ 2             │
└───────────────────────┴───────────────┘

The Struct datatype

Encountering the Struct type

Structs as dicts

Extracting individual values of a Struct

Renaming individual keys of a Struct

Practical use-cases of Struct columns

Identifying duplicate rows

Multi-column ranking

Using multi-column apply

Encountering the `Struct` type

Structs as `dict`s

Extracting individual values of a `Struct`

Renaming individual keys of a `Struct`

Practical use-cases of `Struct` columns