Skip to content

Lists and arrays

Polars has first-class support for two homogeneous container data types: List and Array. Polars supports many operations with the two data types and their APIs overlap, so this section of the user guide has the objective of clarifying when one data type should be chosen in favour of the other.

Lists vs arrays

The data type List

The data type list is suitable for columns whose values are homogeneous 1D containers of varying lengths.

The dataframe below contains three examples of columns with the data type List:

List

from datetime import datetime
import polars as pl

df = pl.DataFrame(
    {
        "names": [
            ["Anne", "Averill", "Adams"],
            ["Brandon", "Brooke", "Borden", "Branson"],
            ["Camila", "Campbell"],
            ["Dennis", "Doyle"],
        ],
        "children_ages": [
            [5, 7],
            [],
            [],
            [8, 11, 18],
        ],
        "medical_appointments": [
            [],
            [],
            [],
            [datetime(2022, 5, 22, 16, 30)],
        ],
    }
)

print(df)

List

// Contribute the Rust translation of the Python example by opening a PR.

shape: (4, 3)
┌─────────────────────────────────┬───────────────┬───────────────────────┐
│ names                           ┆ children_ages ┆ medical_appointments  │
│ ---                             ┆ ---           ┆ ---                   │
│ list[str]                       ┆ list[i64]     ┆ list[datetime[μs]]    │
╞═════════════════════════════════╪═══════════════╪═══════════════════════╡
│ ["Anne", "Averill", "Adams"]    ┆ [5, 7]        ┆ []                    │
│ ["Brandon", "Brooke", … "Brans… ┆ []            ┆ []                    │
│ ["Camila", "Campbell"]          ┆ []            ┆ []                    │
│ ["Dennis", "Doyle"]             ┆ [8, 11, 18]   ┆ [2022-05-22 16:30:00] │
└─────────────────────────────────┴───────────────┴───────────────────────┘

Note that the data type List is different from Python's type list, where elements can be of any type. If you want to store true Python lists in a column, you can do so with the data type Object and your column will not have the list manipulation features that we're about to discuss.

The data type Array

The data type Array is suitable for columns whose values are homogeneous containers of an arbitrary dimension with a known and fixed shape.

The dataframe below contains two examples of columns with the data type Array.

Array

df = pl.DataFrame(
    {
        "bit_flags": [
            [True, True, True, True, False],
            [False, True, True, True, True],
        ],
        "tic_tac_toe": [
            [
                [" ", "x", "o"],
                [" ", "x", " "],
                ["o", "x", " "],
            ],
            [
                ["o", "x", "x"],
                [" ", "o", "x"],
                [" ", " ", "o"],
            ],
        ],
    },
    schema={
        "bit_flags": pl.Array(pl.Boolean, 5),
        "tic_tac_toe": pl.Array(pl.String, (3, 3)),
    },
)

print(df)

Array · Available on feature dtype-array

// Contribute the Rust translation of the Python example by opening a PR.

shape: (2, 2)
┌───────────────────────┬─────────────────────────────────┐
│ bit_flags             ┆ tic_tac_toe                     │
│ ---                   ┆ ---                             │
│ array[bool, 5]        ┆ array[str, (3, 3)]              │
╞═══════════════════════╪═════════════════════════════════╡
│ [true, true, … false] ┆ [[" ", "x", "o"], [" ", "x", "… │
│ [false, true, … true] ┆ [["o", "x", "x"], [" ", "o", "… │
└───────────────────────┴─────────────────────────────────┘

The example above shows how to specify that the columns “bit_flags” and “tic_tac_toe” have the data type Array, parametrised by the data type of the elements contained within and by the shape of each array.

In general, Polars does not infer that a column has the data type Array for performance reasons, and defaults to the appropriate variant of the data type List. In Python, an exception to this rule is when you provide a NumPy array to build a column. In that case, Polars has the guarantee from NumPy that all subarrays have the same shape, so an array of \(n + 1\) dimensions will generate a column of \(n\) dimensional arrays:

Array

import numpy as np

array = np.arange(0, 120).reshape((5, 2, 3, 4))  # 4D array

print(pl.Series(array).dtype)  # Column with the 3D subarrays

Array · Available on feature dtype-array

// Contribute the Rust translation of the Python example by opening a PR.

Array(Int64, shape=(2, 3, 4))

When to use each

In short, prefer the data type Array over List because it is more memory efficient and more performant. If you cannot use Array, then use List:

  • when the values within a column do not have a fixed shape; or
  • when you need functions that are only available in the list API.

Working with lists

The namespace list

Polars provides many functions to work with values of the data type List and these are grouped inside the namespace list. We will explore this namespace a bit now.

arr then, list now

In previous versions of Polars, the namespace for list operations used to be arr. arr is now the namespace for the data type Array. If you find references to the namespace arr on StackOverflow or other sources, note that those sources may be outdated.

The dataframe weather defined below contains data from different weather stations across a region. When the weather station is unable to get a result, an error code is recorded instead of the actual temperature at that time.

weather = pl.DataFrame(
    {
        "station": [f"Station {idx}" for idx in range(1, 6)],
        "temperatures": [
            "20 5 5 E1 7 13 19 9 6 20",
            "18 8 16 11 23 E2 8 E2 E2 E2 90 70 40",
            "19 24 E9 16 6 12 10 22",
            "E2 E0 15 7 8 10 E1 24 17 13 6",
            "14 8 E0 16 22 24 E1",
        ],
    }
)

print(weather)
// Contribute the Rust translation of the Python example by opening a PR.
shape: (5, 2)
┌───────────┬─────────────────────────────────┐
│ station   ┆ temperatures                    │
│ ---       ┆ ---                             │
│ str       ┆ str                             │
╞═══════════╪═════════════════════════════════╡
│ Station 1 ┆ 20 5 5 E1 7 13 19 9 6 20        │
│ Station 2 ┆ 18 8 16 11 23 E2 8 E2 E2 E2 90… │
│ Station 3 ┆ 19 24 E9 16 6 12 10 22          │
│ Station 4 ┆ E2 E0 15 7 8 10 E1 24 17 13 6   │
│ Station 5 ┆ 14 8 E0 16 22 24 E1             │
└───────────┴─────────────────────────────────┘

Programmatically creating lists

Given the dataframe weather defined previously, it is very likely we need to run some analysis on the temperatures that are captured by each station. To make this happen, we need to first be able to get individual temperature measurements. We can use the namespace str for this:

str.split

weather = weather.with_columns(
    pl.col("temperatures").str.split(" "),
)
print(weather)

str.split

// Contribute the Rust translation of the Python example by opening a PR.

shape: (5, 2)
┌───────────┬──────────────────────┐
│ station   ┆ temperatures         │
│ ---       ┆ ---                  │
│ str       ┆ list[str]            │
╞═══════════╪══════════════════════╡
│ Station 1 ┆ ["20", "5", … "20"]  │
│ Station 2 ┆ ["18", "8", … "40"]  │
│ Station 3 ┆ ["19", "24", … "22"] │
│ Station 4 ┆ ["E2", "E0", … "6"]  │
│ Station 5 ┆ ["14", "8", … "E1"]  │
└───────────┴──────────────────────┘

A natural follow-up would be to explode the list of temperatures so that each measurement is in its own row:

explode

result = weather.explode("temperatures")
print(result)

explode

// Contribute the Rust translation of the Python example by opening a PR.

shape: (49, 2)
┌───────────┬──────────────┐
│ station   ┆ temperatures │
│ ---       ┆ ---          │
│ str       ┆ str          │
╞═══════════╪══════════════╡
│ Station 1 ┆ 20           │
│ Station 1 ┆ 5            │
│ Station 1 ┆ 5            │
│ Station 1 ┆ E1           │
│ Station 1 ┆ 7            │
│ …         ┆ …            │
│ Station 5 ┆ E0           │
│ Station 5 ┆ 16           │
│ Station 5 ┆ 22           │
│ Station 5 ┆ 24           │
│ Station 5 ┆ E1           │
└───────────┴──────────────┘

However, in Polars we often do not need to do this to operate on the list elements.

Operating on lists

Polars provides several standard operations on columns with the List data type. Similar to what you can do with strings, lists can be sliced with the functions head, tail, and slice:

list namespace

result = weather.with_columns(
    pl.col("temperatures").list.head(3).alias("head"),
    pl.col("temperatures").list.tail(3).alias("tail"),
    pl.col("temperatures").list.slice(-3, 2).alias("two_next_to_last"),
)
print(result)

list namespace

// Contribute the Rust translation of the Python example by opening a PR.

shape: (5, 5)
┌───────────┬──────────────────────┬────────────────────┬────────────────────┬──────────────────┐
│ station   ┆ temperatures         ┆ head               ┆ tail               ┆ two_next_to_last │
│ ---       ┆ ---                  ┆ ---                ┆ ---                ┆ ---              │
│ str       ┆ list[str]            ┆ list[str]          ┆ list[str]          ┆ list[str]        │
╞═══════════╪══════════════════════╪════════════════════╪════════════════════╪══════════════════╡
│ Station 1 ┆ ["20", "5", … "20"]  ┆ ["20", "5", "5"]   ┆ ["9", "6", "20"]   ┆ ["9", "6"]       │
│ Station 2 ┆ ["18", "8", … "40"]  ┆ ["18", "8", "16"]  ┆ ["90", "70", "40"] ┆ ["90", "70"]     │
│ Station 3 ┆ ["19", "24", … "22"] ┆ ["19", "24", "E9"] ┆ ["12", "10", "22"] ┆ ["12", "10"]     │
│ Station 4 ┆ ["E2", "E0", … "6"]  ┆ ["E2", "E0", "15"] ┆ ["17", "13", "6"]  ┆ ["17", "13"]     │
│ Station 5 ┆ ["14", "8", … "E1"]  ┆ ["14", "8", "E0"]  ┆ ["22", "24", "E1"] ┆ ["22", "24"]     │
└───────────┴──────────────────────┴────────────────────┴────────────────────┴──────────────────┘

Element-wise computation within lists

If we need to identify the stations that are giving the most number of errors we need to

  1. try to convert the measurements into numbers;
  2. count the number of non-numeric values (i.e., null values) in the list, by row; and
  3. rename this output column as “errors” so that we can easily identify the stations.

To perform these steps, we need to perform a casting operation on each measurement within the list values. The function eval is used as the entry point to perform operations on the elements of the list. Within it, you can use the context element to refer to each single element of the list individually, and then you can use any Polars expression on the element:

element

result = weather.with_columns(
    pl.col("temperatures")
    .list.eval(pl.element().cast(pl.Int64, strict=False).is_null())
    .list.sum()
    .alias("errors"),
)
print(result)

element

// Contribute the Rust translation of the Python example by opening a PR.

shape: (5, 3)
┌───────────┬──────────────────────┬────────┐
│ station   ┆ temperatures         ┆ errors │
│ ---       ┆ ---                  ┆ ---    │
│ str       ┆ list[str]            ┆ u32    │
╞═══════════╪══════════════════════╪════════╡
│ Station 1 ┆ ["20", "5", … "20"]  ┆ 1      │
│ Station 2 ┆ ["18", "8", … "40"]  ┆ 4      │
│ Station 3 ┆ ["19", "24", … "22"] ┆ 1      │
│ Station 4 ┆ ["E2", "E0", … "6"]  ┆ 3      │
│ Station 5 ┆ ["14", "8", … "E1"]  ┆ 2      │
└───────────┴──────────────────────┴────────┘

Another alternative would be to use a regular expression to check if a measurement starts with a letter:

element

result2 = weather.with_columns(
    pl.col("temperatures")
    .list.eval(pl.element().str.contains("(?i)[a-z]"))
    .list.sum()
    .alias("errors"),
)
print(result.equals(result2))

element

// Contribute the Rust translation of the Python example by opening a PR.

True

If you are unfamiliar with the namespace str or the notation (?i) in the regex, now is a good time to look at how to work with strings and regular expressions in Polars.

Row-wise computations

The function eval gives us access to the list elements and pl.element refers to each individual element, but we can also use pl.all() to refer to all of the elements of the list.

To show this in action, we will start by creating another dataframe with some more weather data:

weather_by_day = pl.DataFrame(
    {
        "station": [f"Station {idx}" for idx in range(1, 11)],
        "day_1": [17, 11, 8, 22, 9, 21, 20, 8, 8, 17],
        "day_2": [15, 11, 10, 8, 7, 14, 18, 21, 15, 13],
        "day_3": [16, 15, 24, 24, 8, 23, 19, 23, 16, 10],
    }
)
print(weather_by_day)
// Contribute the Rust translation of the Python example by opening a PR.
shape: (10, 4)
┌────────────┬───────┬───────┬───────┐
│ station    ┆ day_1 ┆ day_2 ┆ day_3 │
│ ---        ┆ ---   ┆ ---   ┆ ---   │
│ str        ┆ i64   ┆ i64   ┆ i64   │
╞════════════╪═══════╪═══════╪═══════╡
│ Station 1  ┆ 17    ┆ 15    ┆ 16    │
│ Station 2  ┆ 11    ┆ 11    ┆ 15    │
│ Station 3  ┆ 8     ┆ 10    ┆ 24    │
│ Station 4  ┆ 22    ┆ 8     ┆ 24    │
│ Station 5  ┆ 9     ┆ 7     ┆ 8     │
│ Station 6  ┆ 21    ┆ 14    ┆ 23    │
│ Station 7  ┆ 20    ┆ 18    ┆ 19    │
│ Station 8  ┆ 8     ┆ 21    ┆ 23    │
│ Station 9  ┆ 8     ┆ 15    ┆ 16    │
│ Station 10 ┆ 17    ┆ 13    ┆ 10    │
└────────────┴───────┴───────┴───────┘

Now, we will calculate the percentage rank of the temperatures by day, measured across stations. Polars does not provide a function to do this directly, but because expressions are so versatile we can create our own percentage rank expression for highest temperature. Let's try that:

element · rank

rank_pct = (pl.element().rank(descending=True) / pl.all().count()).round(2)

result = weather_by_day.with_columns(
    # create the list of homogeneous data
    pl.concat_list(pl.all().exclude("station")).alias("all_temps")
).select(
    # select all columns except the intermediate list
    pl.all().exclude("all_temps"),
    # compute the rank by calling `list.eval`
    pl.col("all_temps").list.eval(rank_pct, parallel=True).alias("temps_rank"),
)

print(result)

element · rank

// Contribute the Rust translation of the Python example by opening a PR.

shape: (10, 5)
┌────────────┬───────┬───────┬───────┬────────────────────┐
│ station    ┆ day_1 ┆ day_2 ┆ day_3 ┆ temps_rank         │
│ ---        ┆ ---   ┆ ---   ┆ ---   ┆ ---                │
│ str        ┆ i64   ┆ i64   ┆ i64   ┆ list[f64]          │
╞════════════╪═══════╪═══════╪═══════╪════════════════════╡
│ Station 1  ┆ 17    ┆ 15    ┆ 16    ┆ [0.33, 1.0, 0.67]  │
│ Station 2  ┆ 11    ┆ 11    ┆ 15    ┆ [0.83, 0.83, 0.33] │
│ Station 3  ┆ 8     ┆ 10    ┆ 24    ┆ [1.0, 0.67, 0.33]  │
│ Station 4  ┆ 22    ┆ 8     ┆ 24    ┆ [0.67, 1.0, 0.33]  │
│ Station 5  ┆ 9     ┆ 7     ┆ 8     ┆ [0.33, 1.0, 0.67]  │
│ Station 6  ┆ 21    ┆ 14    ┆ 23    ┆ [0.67, 1.0, 0.33]  │
│ Station 7  ┆ 20    ┆ 18    ┆ 19    ┆ [0.33, 1.0, 0.67]  │
│ Station 8  ┆ 8     ┆ 21    ┆ 23    ┆ [1.0, 0.67, 0.33]  │
│ Station 9  ┆ 8     ┆ 15    ┆ 16    ┆ [1.0, 0.67, 0.33]  │
│ Station 10 ┆ 17    ┆ 13    ┆ 10    ┆ [0.33, 0.67, 1.0]  │
└────────────┴───────┴───────┴───────┴────────────────────┘

Working with arrays

Creating an array column

As we have seen above, Polars usually does not infer the data type Array automatically. You have to specify the data type Array when creating a series/dataframe or cast a column explicitly unless you create the column out of a NumPy array.

The namespace arr

The data type Array was recently introduced and is still pretty nascent in features that it offers. Even so, the namespace arr aggregates several functions that you can use to work with arrays.

arr then, list now

In previous versions of Polars, the namespace for list operations used to be arr. arr is now the namespace for the data type Array. If you find references to the namespace arr on StackOverflow or other sources, note that those sources may be outdated.

The API documentation should give you a good overview of the functions in the namespace arr, of which we present a couple:

arr namespace

df = pl.DataFrame(
    {
        "first_last": [
            ["Anne", "Adams"],
            ["Brandon", "Branson"],
            ["Camila", "Campbell"],
            ["Dennis", "Doyle"],
        ],
        "fav_numbers": [
            [42, 0, 1],
            [2, 3, 5],
            [13, 21, 34],
            [73, 3, 7],
        ],
    },
    schema={
        "first_last": pl.Array(pl.String, 2),
        "fav_numbers": pl.Array(pl.Int32, 3),
    },
)

result = df.select(
    pl.col("first_last").arr.join(" ").alias("name"),
    pl.col("fav_numbers").arr.sort(),
    pl.col("fav_numbers").arr.max().alias("largest_fav"),
    pl.col("fav_numbers").arr.sum().alias("summed"),
    pl.col("fav_numbers").arr.contains(3).alias("likes_3"),
)
print(result)

`arr namespace` · Available on feature dtype-array

// Contribute the Rust translation of the Python example by opening a PR.

shape: (4, 5)
┌─────────────────┬───────────────┬─────────────┬────────┬─────────┐
│ name            ┆ fav_numbers   ┆ largest_fav ┆ summed ┆ likes_3 │
│ ---             ┆ ---           ┆ ---         ┆ ---    ┆ ---     │
│ str             ┆ array[i32, 3] ┆ i32         ┆ i32    ┆ bool    │
╞═════════════════╪═══════════════╪═════════════╪════════╪═════════╡
│ Anne Adams      ┆ [0, 1, 42]    ┆ 42          ┆ 43     ┆ false   │
│ Brandon Branson ┆ [2, 3, 5]     ┆ 5           ┆ 10     ┆ true    │
│ Camila Campbell ┆ [13, 21, 34]  ┆ 34          ┆ 68     ┆ false   │
│ Dennis Doyle    ┆ [3, 7, 73]    ┆ 73          ┆ 83     ┆ true    │
└─────────────────┴───────────────┴─────────────┴────────┴─────────┘