Skip to content

Structs

The data type Struct is a composite data type that can store multiple fields in a single column.

??? tip "Python analogy" For Python users, the data type Struct is kind of like a Python dictionary. Even better, if you are familiar with Python typing, you can think of the data type Struct as typing.TypedDict.

In this page of the user guide we will see situations in which the data type Struct arises, we will understand why it does arise, and we will see how to work with Struct values.

Let's start with a dataframe that captures the average rating of a few movies across some states in the US:

DataFrame

import polars as pl

ratings = pl.DataFrame(
    {
        "Movie": ["Cars", "IT", "ET", "Cars", "Up", "IT", "Cars", "ET", "Up", "Cars"],
        "Theatre": ["NE", "ME", "IL", "ND", "NE", "SD", "NE", "IL", "IL", "NE"],
        "Avg_Rating": [4.5, 4.4, 4.6, 4.3, 4.8, 4.7, 4.5, 4.9, 4.7, 4.6],
        "Count": [30, 27, 26, 29, 31, 28, 28, 26, 33, 28],
    }
)
print(ratings)

DataFrame

use polars::prelude::*;
let ratings = df!(
        "Movie"=> ["Cars", "IT", "ET", "Cars", "Up", "IT", "Cars", "ET", "Up", "Cars"],
        "Theatre"=> ["NE", "ME", "IL", "ND", "NE", "SD", "NE", "IL", "IL", "NE"],
        "Avg_Rating"=> [4.5, 4.4, 4.6, 4.3, 4.8, 4.7, 4.5, 4.9, 4.7, 4.6],
        "Count"=> [30, 27, 26, 29, 31, 28, 28, 26, 33, 28],

)?;
println!("{}", &ratings);

shape: (10, 4)
┌───────┬─────────┬────────────┬───────┐
│ Movie ┆ Theatre ┆ Avg_Rating ┆ Count │
│ ---   ┆ ---     ┆ ---        ┆ ---   │
│ str   ┆ str     ┆ f64        ┆ i64   │
╞═══════╪═════════╪════════════╪═══════╡
│ Cars  ┆ NE      ┆ 4.5        ┆ 30    │
│ IT    ┆ ME      ┆ 4.4        ┆ 27    │
│ ET    ┆ IL      ┆ 4.6        ┆ 26    │
│ Cars  ┆ ND      ┆ 4.3        ┆ 29    │
│ Up    ┆ NE      ┆ 4.8        ┆ 31    │
│ IT    ┆ SD      ┆ 4.7        ┆ 28    │
│ Cars  ┆ NE      ┆ 4.5        ┆ 28    │
│ ET    ┆ IL      ┆ 4.9        ┆ 26    │
│ Up    ┆ IL      ┆ 4.7        ┆ 33    │
│ Cars  ┆ NE      ┆ 4.6        ┆ 28    │
└───────┴─────────┴────────────┴───────┘

Encountering the data type Struct

A common operation that will lead to a Struct column is the ever so popular value_counts function that is commonly used in exploratory data analysis. Checking the number of times a state appears in the data is done as so:

value_counts

result = ratings.select(pl.col("Theatre").value_counts(sort=True))
print(result)

value_counts · Available on feature dtype-struct

let result = ratings
    .clone()
    .lazy()
    .select([col("Theatre").value_counts(true, true, "count", false)])
    .collect()?;
println!("{}", result);

shape: (5, 1)
┌───────────┐
│ Theatre   │
│ ---       │
│ struct[2] │
╞═══════════╡
│ {"NE",4}  │
│ {"IL",3}  │
│ {"ME",1}  │
│ {"ND",1}  │
│ {"SD",1}  │
└───────────┘

Quite unexpected an output, especially if coming from tools that do not have such a data type. We're not in peril, though. To get back to a more familiar output, all we need to do is use the function unnest on the Struct column:

unnest

result = ratings.select(pl.col("Theatre").value_counts(sort=True)).unnest("Theatre")
print(result)

unnest

let result = ratings
    .clone()
    .lazy()
    .select([col("Theatre").value_counts(true, true, "count", false)])
    .unnest(["Theatre"])
    .collect()?;
println!("{}", result);

shape: (5, 2)
┌─────────┬───────┐
│ Theatre ┆ count │
│ ---     ┆ ---   │
│ str     ┆ u32   │
╞═════════╪═══════╡
│ NE      ┆ 4     │
│ IL      ┆ 3     │
│ ME      ┆ 1     │
│ ND      ┆ 1     │
│ SD      ┆ 1     │
└─────────┴───────┘

The function unnest will turn each field of the Struct into its own column.

Why value_counts returns a Struct

Polars expressions always operate on a single series and return another series. Struct is the data type that allows us to provide multiple columns as input to an expression, or to output multiple columns from an expression. Thus, we can use the data type Struct to specify each value and its count when we use value_counts.

Inferring the data type Struct from dictionaries

When building series or dataframes, Polars will convert dictionaries to the data type Struct:

Series

rating_series = pl.Series(
    "ratings",
    [
        {"Movie": "Cars", "Theatre": "NE", "Avg_Rating": 4.5},
        {"Movie": "Toy Story", "Theatre": "ME", "Avg_Rating": 4.9},
    ],
)
print(rating_series)

Series

// Don't think we can make it the same way in rust, but this works
let rating_series = df!(
    "Movie" => &["Cars", "Toy Story"],
    "Theatre" => &["NE", "ME"],
    "Avg_Rating" => &[4.5, 4.9],
)?
.into_struct("ratings".into())
.into_series();
println!("{}", &rating_series);

shape: (2,)
Series: 'ratings' [struct[3]]
[
    {"Cars","NE",4.5}
    {"Toy Story","ME",4.9}
]

The number of fields, their names, and their types, are inferred from the first dictionary seen. Subsequent incongruences can result in null values or in errors:

Series

null_rating_series = pl.Series(
    "ratings",
    [
        {"Movie": "Cars", "Theatre": "NE", "Avg_Rating": 4.5},
        {"Mov": "Toy Story", "Theatre": "ME", "Avg_Rating": 4.9},
        {"Movie": "Snow White", "Theatre": "IL", "Avg_Rating": "4.7"},
    ],
    strict=False,  # To show the final structs with `null` values.
)
print(null_rating_series)

Series

// Contribute the Rust translation of the Python example by opening a PR.

shape: (3,)
Series: 'ratings' [struct[4]]
[
    {"Cars","NE","4.5",null}
    {null,"ME","4.9","Toy Story"}
    {"Snow White","IL","4.7",null}
]

Extracting individual values of a Struct

Let's say that we needed to obtain just the field "Movie" from the Struct in the series that we created above. We can use the function field to do so:

struct.field

result = rating_series.struct.field("Movie")
print(result)

struct.field_by_name

let result = rating_series.struct_()?.field_by_name("Movie")?;
println!("{}", result);

shape: (2,)
Series: 'Movie' [str]
[
    "Cars"
    "Toy Story"
]

Renaming individual fields of a Struct

What if we need to rename individual fields of a Struct column? We use the function rename_fields:

struct.rename_fields

result = rating_series.struct.rename_fields(["Film", "State", "Value"])
print(result)

struct.rename_fields

// Contribute the Rust translation of the Python example by opening a PR.

shape: (2,)
Series: 'ratings' [struct[3]]
[
    {"Cars","NE",4.5}
    {"Toy Story","ME",4.9}
]

To be able to actually see that the field names were change, we will create a dataframe where the only column is the result and then we use the function unnest so that each field becomes its own column. The column names will reflect the renaming operation we just did:

struct.rename_fields

print(
    result.to_frame().unnest("ratings"),
)

struct.rename_fields

// Contribute the Rust translation of the Python example by opening a PR.

shape: (2, 3)
┌───────────┬───────┬───────┐
│ Film      ┆ State ┆ Value │
│ ---       ┆ ---   ┆ ---   │
│ str       ┆ str   ┆ f64   │
╞═══════════╪═══════╪═══════╡
│ Cars      ┆ NE    ┆ 4.5   │
│ Toy Story ┆ ME    ┆ 4.9   │
└───────────┴───────┴───────┘

Practical use-cases of Struct columns

Identifying duplicate rows

Let's get back to the ratings data. We want to identify cases where there are duplicates at a “Movie” and “Theatre” level.

This is where the data type Struct shines:

is_duplicated · struct

result = ratings.filter(pl.struct("Movie", "Theatre").is_duplicated())
print(result)

is_duplicated · Struct · Available on feature dtype-struct

// Contribute the Rust translation of the Python example by opening a PR.

shape: (5, 4)
┌───────┬─────────┬────────────┬───────┐
│ Movie ┆ Theatre ┆ Avg_Rating ┆ Count │
│ ---   ┆ ---     ┆ ---        ┆ ---   │
│ str   ┆ str     ┆ f64        ┆ i64   │
╞═══════╪═════════╪════════════╪═══════╡
│ Cars  ┆ NE      ┆ 4.5        ┆ 30    │
│ ET    ┆ IL      ┆ 4.6        ┆ 26    │
│ Cars  ┆ NE      ┆ 4.5        ┆ 28    │
│ ET    ┆ IL      ┆ 4.9        ┆ 26    │
│ Cars  ┆ NE      ┆ 4.6        ┆ 28    │
└───────┴─────────┴────────────┴───────┘

We can identify the unique cases at this level also with is_unique!

Multi-column ranking

Suppose, given that we know there are duplicates, we want to choose which rating gets a higher priority. We can say that the column “Count” is the most important, and if there is a tie in the column “Count” then we consider the column “Avg_Rating”.

We can then do:

is_duplicated · struct

result = ratings.with_columns(
    pl.struct("Count", "Avg_Rating")
    .rank("dense", descending=True)
    .over("Movie", "Theatre")
    .alias("Rank")
).filter(pl.struct("Movie", "Theatre").is_duplicated())

print(result)

is_duplicated · Struct · Available on feature dtype-struct

let result = ratings
    .clone()
    .lazy()
    .with_columns([as_struct(vec![col("Count"), col("Avg_Rating")])
        .rank(
            RankOptions {
                method: RankMethod::Dense,
                descending: true,
            },
            None,
        )
        .over([col("Movie"), col("Theatre")])
        .alias("Rank")])
    // .filter(as_struct(&[col("Movie"), col("Theatre")]).is_duplicated())
    // Error: .is_duplicated() not available if you try that
    // https://github.com/pola-rs/polars/issues/3803
    .filter(len().over([col("Movie"), col("Theatre")]).gt(lit(1)))
    .collect()?;
println!("{}", result);

shape: (5, 5)
┌───────┬─────────┬────────────┬───────┬──────┐
│ Movie ┆ Theatre ┆ Avg_Rating ┆ Count ┆ Rank │
│ ---   ┆ ---     ┆ ---        ┆ ---   ┆ ---  │
│ str   ┆ str     ┆ f64        ┆ i64   ┆ u32  │
╞═══════╪═════════╪════════════╪═══════╪══════╡
│ Cars  ┆ NE      ┆ 4.5        ┆ 30    ┆ 1    │
│ ET    ┆ IL      ┆ 4.6        ┆ 26    ┆ 2    │
│ Cars  ┆ NE      ┆ 4.5        ┆ 28    ┆ 3    │
│ ET    ┆ IL      ┆ 4.9        ┆ 26    ┆ 1    │
│ Cars  ┆ NE      ┆ 4.6        ┆ 28    ┆ 2    │
└───────┴─────────┴────────────┴───────┴──────┘

That's a pretty complex set of requirements done very elegantly in Polars! To learn more about the function over, used above, see the user guide section on window functions.

Using multiple columns in a single expression

As mentioned earlier, the data type Struct is also useful if you need to pass multiple columns as input to an expression. As an example, suppose we want to compute the Ackermann function on two columns of a dataframe. There is no way of composing Polars expressions to compute the Ackermann function1, so we define a custom function:

def ack(m, n):
    if not m:
        return n + 1
    if not n:
        return ack(m - 1, 1)
    return ack(m - 1, ack(m, n - 1))
// Contribute the Rust translation of the Python example by opening a PR.

Now, to compute the values of the Ackermann function on those arguments, we start by creating a Struct with fields m and n and then use the function map_elements to apply the function ack to each value:

map_elements

values = pl.DataFrame(
    {
        "m": [0, 0, 0, 1, 1, 1, 2],
        "n": [2, 3, 4, 1, 2, 3, 1],
    }
)
result = values.with_columns(
    pl.struct(["m", "n"])
    .map_elements(lambda s: ack(s["m"], s["n"]), return_dtype=pl.Int64)
    .alias("ack")
)

print(result)

// Contribute the Rust translation of the Python example by opening a PR.
shape: (7, 3)
┌─────┬─────┬─────┐
│ m   ┆ n   ┆ ack │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 0   ┆ 2   ┆ 3   │
│ 0   ┆ 3   ┆ 4   │
│ 0   ┆ 4   ┆ 5   │
│ 1   ┆ 1   ┆ 3   │
│ 1   ┆ 2   ┆ 4   │
│ 1   ┆ 3   ┆ 5   │
│ 2   ┆ 1   ┆ 5   │
└─────┴─────┴─────┘

Refer to this section of the user guide to learn more about applying user-defined Python functions to your data.


  1. To say that something cannot be done is quite a bold claim. If you prove us wrong, please let us know!