Structs

The data type Struct is a composite data type that can store multiple fields in a single column.

Python analogy

For Python users, the data type Struct is kind of like a Python dictionary. Even better, if you are familiar with Python typing, you can think of the data type Struct as typing.TypedDict.

In this page of the user guide we will see situations in which the data type Struct arises, we will understand why it does arise, and we will see how to work with Struct values.

Let's start with a dataframe that captures the average rating of a few movies across some states in the US:

Python Rust

DataFrame

import polars as pl

ratings = pl.DataFrame(
    {
        "Movie": ["Cars", "IT", "ET", "Cars", "Up", "IT", "Cars", "ET", "Up", "Cars"],
        "Theatre": ["NE", "ME", "IL", "ND", "NE", "SD", "NE", "IL", "IL", "NE"],
        "Avg_Rating": [4.5, 4.4, 4.6, 4.3, 4.8, 4.7, 4.5, 4.9, 4.7, 4.6],
        "Count": [30, 27, 26, 29, 31, 28, 28, 26, 33, 28],
    }
)
print(ratings)

DataFrame

use polars::prelude::*;
let ratings = df!(
        "Movie"=> ["Cars", "IT", "ET", "Cars", "Up", "IT", "Cars", "ET", "Up", "Cars"],
        "Theatre"=> ["NE", "ME", "IL", "ND", "NE", "SD", "NE", "IL", "IL", "NE"],
        "Avg_Rating"=> [4.5, 4.4, 4.6, 4.3, 4.8, 4.7, 4.5, 4.9, 4.7, 4.6],
        "Count"=> [30, 27, 26, 29, 31, 28, 28, 26, 33, 28],

)?;
println!("{}", &ratings);

shape: (10, 4)
┌───────┬─────────┬────────────┬───────┐
│ Movie ┆ Theatre ┆ Avg_Rating ┆ Count │
│ ---   ┆ ---     ┆ ---        ┆ ---   │
│ str   ┆ str     ┆ f64        ┆ i64   │
╞═══════╪═════════╪════════════╪═══════╡
│ Cars  ┆ NE      ┆ 4.5        ┆ 30    │
│ IT    ┆ ME      ┆ 4.4        ┆ 27    │
│ ET    ┆ IL      ┆ 4.6        ┆ 26    │
│ Cars  ┆ ND      ┆ 4.3        ┆ 29    │
│ Up    ┆ NE      ┆ 4.8        ┆ 31    │
│ IT    ┆ SD      ┆ 4.7        ┆ 28    │
│ Cars  ┆ NE      ┆ 4.5        ┆ 28    │
│ ET    ┆ IL      ┆ 4.9        ┆ 26    │
│ Up    ┆ IL      ┆ 4.7        ┆ 33    │
│ Cars  ┆ NE      ┆ 4.6        ┆ 28    │
└───────┴─────────┴────────────┴───────┘

Encountering the data type `Struct`

A common operation that will lead to a Struct column is the ever so popular value_counts function that is commonly used in exploratory data analysis. Checking the number of times a state appears in the data is done as so:

Python Rust

value_counts

result = ratings.select(pl.col("Theatre").value_counts(sort=True))
print(result)

value_counts · Available on feature dtype-struct

let result = ratings
    .clone()
    .lazy()
    .select([col("Theatre").value_counts(true, true, "count", false)])
    .collect()?;
println!("{}", result);

shape: (5, 1)
┌───────────┐
│ Theatre   │
│ ---       │
│ struct[2] │
╞═══════════╡
│ {"NE",4}  │
│ {"IL",3}  │
│ {"ME",1}  │
│ {"ND",1}  │
│ {"SD",1}  │
└───────────┘

Quite unexpected an output, especially if coming from tools that do not have such a data type. We're not in peril, though. To get back to a more familiar output, all we need to do is use the function unnest on the Struct column:

Python Rust

unnest

result = ratings.select(pl.col("Theatre").value_counts(sort=True)).unnest("Theatre")
print(result)

unnest

let result = ratings
    .clone()
    .lazy()
    .select([col("Theatre").value_counts(true, true, "count", false)])
    .unnest(["Theatre"])
    .collect()?;
println!("{}", result);

shape: (5, 2)
┌─────────┬───────┐
│ Theatre ┆ count │
│ ---     ┆ ---   │
│ str     ┆ u32   │
╞═════════╪═══════╡
│ NE      ┆ 4     │
│ IL      ┆ 3     │
│ ME      ┆ 1     │
│ ND      ┆ 1     │
│ SD      ┆ 1     │
└─────────┴───────┘

The function unnest will turn each field of the Struct into its own column.

Why value_counts returns a Struct

Polars expressions always operate on a single series and return another series. Struct is the data type that allows us to provide multiple columns as input to an expression, or to output multiple columns from an expression. Thus, we can use the data type Struct to specify each value and its count when we use value_counts.

Inferring the data type `Struct` from dictionaries

When building series or dataframes, Polars will convert dictionaries to the data type Struct:

Python Rust

Series

rating_series = pl.Series(
    "ratings",
    [
        {"Movie": "Cars", "Theatre": "NE", "Avg_Rating": 4.5},
        {"Movie": "Toy Story", "Theatre": "ME", "Avg_Rating": 4.9},
    ],
)
print(rating_series)

Series

// Don't think we can make it the same way in rust, but this works
let rating_series = df!(
    "Movie" => &["Cars", "Toy Story"],
    "Theatre" => &["NE", "ME"],
    "Avg_Rating" => &[4.5, 4.9],
)?
.into_struct("ratings".into())
.into_series();
println!("{}", &rating_series);

shape: (2,)
Series: 'ratings' [struct[3]]
[
    {"Cars","NE",4.5}
    {"Toy Story","ME",4.9}
]

The number of fields, their names, and their types, are inferred from the first dictionary seen. Subsequent incongruences can result in null values or in errors:

Python Rust

Series

null_rating_series = pl.Series(
    "ratings",
    [
        {"Movie": "Cars", "Theatre": "NE", "Avg_Rating": 4.5},
        {"Mov": "Toy Story", "Theatre": "ME", "Avg_Rating": 4.9},
        {"Movie": "Snow White", "Theatre": "IL", "Avg_Rating": "4.7"},
    ],
    strict=False,  # To show the final structs with `null` values.
)
print(null_rating_series)

Series

// Contribute the Rust translation of the Python example by opening a PR.

shape: (3,)
Series: 'ratings' [struct[4]]
[
    {"Cars","NE","4.5",null}
    {null,"ME","4.9","Toy Story"}
    {"Snow White","IL","4.7",null}
]

Extracting individual values of a `Struct`

Let's say that we needed to obtain just the field "Movie" from the Struct in the series that we created above. We can use the function field to do so:

Python Rust

struct.field

result = rating_series.struct.field("Movie")
print(result)

struct.field_by_name

let result = rating_series.struct_()?.field_by_name("Movie")?;
println!("{}", result);

shape: (2,)
Series: 'Movie' [str]
[
    "Cars"
    "Toy Story"
]

Renaming individual fields of a `Struct`

What if we need to rename individual fields of a Struct column? We use the function rename_fields:

Python Rust

struct.rename_fields

result = rating_series.struct.rename_fields(["Film", "State", "Value"])
print(result)

struct.rename_fields

// Contribute the Rust translation of the Python example by opening a PR.

shape: (2,)
Series: 'ratings' [struct[3]]
[
    {"Cars","NE",4.5}
    {"Toy Story","ME",4.9}
]

To be able to actually see that the field names were change, we will create a dataframe where the only column is the result and then we use the function unnest so that each field becomes its own column. The column names will reflect the renaming operation we just did:

Python Rust

struct.rename_fields

print(
    result.to_frame().unnest("ratings"),
)

struct.rename_fields

// Contribute the Rust translation of the Python example by opening a PR.

shape: (2, 3)
┌───────────┬───────┬───────┐
│ Film      ┆ State ┆ Value │
│ ---       ┆ ---   ┆ ---   │
│ str       ┆ str   ┆ f64   │
╞═══════════╪═══════╪═══════╡
│ Cars      ┆ NE    ┆ 4.5   │
│ Toy Story ┆ ME    ┆ 4.9   │
└───────────┴───────┴───────┘

Practical use-cases of `Struct` columns

Identifying duplicate rows

Let's get back to the ratings data. We want to identify cases where there are duplicates at a “Movie” and “Theatre” level.

This is where the data type Struct shines:

Python Rust

is_duplicated · struct

result = ratings.filter(pl.struct("Movie", "Theatre").is_duplicated())
print(result)

is_duplicated · Struct · Available on feature dtype-struct

// Contribute the Rust translation of the Python example by opening a PR.

shape: (5, 4)
┌───────┬─────────┬────────────┬───────┐
│ Movie ┆ Theatre ┆ Avg_Rating ┆ Count │
│ ---   ┆ ---     ┆ ---        ┆ ---   │
│ str   ┆ str     ┆ f64        ┆ i64   │
╞═══════╪═════════╪════════════╪═══════╡
│ Cars  ┆ NE      ┆ 4.5        ┆ 30    │
│ ET    ┆ IL      ┆ 4.6        ┆ 26    │
│ Cars  ┆ NE      ┆ 4.5        ┆ 28    │
│ ET    ┆ IL      ┆ 4.9        ┆ 26    │
│ Cars  ┆ NE      ┆ 4.6        ┆ 28    │
└───────┴─────────┴────────────┴───────┘

We can identify the unique cases at this level also with is_unique!

Multi-column ranking

Suppose, given that we know there are duplicates, we want to choose which rating gets a higher priority. We can say that the column “Count” is the most important, and if there is a tie in the column “Count” then we consider the column “Avg_Rating”.

We can then do:

Python Rust

is_duplicated · struct

result = ratings.with_columns(
    pl.struct("Count", "Avg_Rating")
    .rank("dense", descending=True)
    .over("Movie", "Theatre")
    .alias("Rank")
).filter(pl.struct("Movie", "Theatre").is_duplicated())

print(result)

is_duplicated · Struct · Available on feature dtype-struct

let result = ratings
    .clone()
    .lazy()
    .with_columns([as_struct(vec![col("Count"), col("Avg_Rating")])
        .rank(
            RankOptions {
                method: RankMethod::Dense,
                descending: true,
            },
            None,
        )
        .over([col("Movie"), col("Theatre")])
        .alias("Rank")])
    // .filter(as_struct(&[col("Movie"), col("Theatre")]).is_duplicated())
    // Error: .is_duplicated() not available if you try that
    // https://github.com/pola-rs/polars/issues/3803
    .filter(len().over([col("Movie"), col("Theatre")]).gt(lit(1)))
    .collect()?;
println!("{}", result);

shape: (5, 5)
┌───────┬─────────┬────────────┬───────┬──────┐
│ Movie ┆ Theatre ┆ Avg_Rating ┆ Count ┆ Rank │
│ ---   ┆ ---     ┆ ---        ┆ ---   ┆ ---  │
│ str   ┆ str     ┆ f64        ┆ i64   ┆ u32  │
╞═══════╪═════════╪════════════╪═══════╪══════╡
│ Cars  ┆ NE      ┆ 4.5        ┆ 30    ┆ 1    │
│ ET    ┆ IL      ┆ 4.6        ┆ 26    ┆ 2    │
│ Cars  ┆ NE      ┆ 4.5        ┆ 28    ┆ 3    │
│ ET    ┆ IL      ┆ 4.9        ┆ 26    ┆ 1    │
│ Cars  ┆ NE      ┆ 4.6        ┆ 28    ┆ 2    │
└───────┴─────────┴────────────┴───────┴──────┘

That's a pretty complex set of requirements done very elegantly in Polars! To learn more about the function over, used above, see the user guide section on window functions.

Using multiple columns in a single expression

As mentioned earlier, the data type Struct is also useful if you need to pass multiple columns as input to an expression. As an example, suppose we want to compute the Ackermann function on two columns of a dataframe. There is no way of composing Polars expressions to compute the Ackermann function¹, so we define a custom function:

Python Rust

def ack(m, n):
    if not m:
        return n + 1
    if not n:
        return ack(m - 1, 1)
    return ack(m - 1, ack(m, n - 1))

// Contribute the Rust translation of the Python example by opening a PR.

Now, to compute the values of the Ackermann function on those arguments, we start by creating a Struct with fields m and n and then use the function map_elements to apply the function ack to each value:

Python Rust

map_elements

values = pl.DataFrame(
    {
        "m": [0, 0, 0, 1, 1, 1, 2],
        "n": [2, 3, 4, 1, 2, 3, 1],
    }
)
result = values.with_columns(
    pl.struct(["m", "n"])
    .map_elements(lambda s: ack(s["m"], s["n"]), return_dtype=pl.Int64)
    .alias("ack")
)

print(result)

// Contribute the Rust translation of the Python example by opening a PR.

shape: (7, 3)
┌─────┬─────┬─────┐
│ m   ┆ n   ┆ ack │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 0   ┆ 2   ┆ 3   │
│ 0   ┆ 3   ┆ 4   │
│ 0   ┆ 4   ┆ 5   │
│ 1   ┆ 1   ┆ 3   │
│ 1   ┆ 2   ┆ 4   │
│ 1   ┆ 3   ┆ 5   │
│ 2   ┆ 1   ┆ 5   │
└─────┴─────┴─────┘

Refer to this section of the user guide to learn more about applying user-defined Python functions to your data.

To say that something cannot be done is quite a bold claim. If you prove us wrong, please let us know! ↩

Structs

Encountering the data type Struct

Inferring the data type Struct from dictionaries

Extracting individual values of a Struct

Renaming individual fields of a Struct

Practical use-cases of Struct columns

Identifying duplicate rows

Multi-column ranking

Using multiple columns in a single expression

Encountering the data type `Struct`

Inferring the data type `Struct` from dictionaries

Extracting individual values of a `Struct`

Renaming individual fields of a `Struct`

Practical use-cases of `Struct` columns