Lists and arrays
Polars has first-class support for two homogeneous container data types: List
and Array
. Polars
supports many operations with the two data types and their APIs overlap, so this section of the user
guide has the objective of clarifying when one data type should be chosen in favour of the other.
Lists vs arrays
The data type List
The data type list is suitable for columns whose values are homogeneous 1D containers of varying lengths.
The dataframe below contains three examples of columns with the data type List
:
from datetime import datetime
import polars as pl
df = pl.DataFrame(
{
"names": [
["Anne", "Averill", "Adams"],
["Brandon", "Brooke", "Borden", "Branson"],
["Camila", "Campbell"],
["Dennis", "Doyle"],
],
"children_ages": [
[5, 7],
[],
[],
[8, 11, 18],
],
"medical_appointments": [
[],
[],
[],
[datetime(2022, 5, 22, 16, 30)],
],
}
)
print(df)
// Contribute the Rust translation of the Python example by opening a PR.
shape: (4, 3)
┌─────────────────────────────────┬───────────────┬───────────────────────┐
│ names ┆ children_ages ┆ medical_appointments │
│ --- ┆ --- ┆ --- │
│ list[str] ┆ list[i64] ┆ list[datetime[μs]] │
╞═════════════════════════════════╪═══════════════╪═══════════════════════╡
│ ["Anne", "Averill", "Adams"] ┆ [5, 7] ┆ [] │
│ ["Brandon", "Brooke", … "Brans… ┆ [] ┆ [] │
│ ["Camila", "Campbell"] ┆ [] ┆ [] │
│ ["Dennis", "Doyle"] ┆ [8, 11, 18] ┆ [2022-05-22 16:30:00] │
└─────────────────────────────────┴───────────────┴───────────────────────┘
Note that the data type List
is different from Python's type list
, where elements can be of any
type. If you want to store true Python lists in a column, you can do so with the data type Object
and your column will not have the list manipulation features that we're about to discuss.
The data type Array
The data type Array
is suitable for columns whose values are homogeneous containers of an
arbitrary dimension with a known and fixed shape.
The dataframe below contains two examples of columns with the data type Array
.
df = pl.DataFrame(
{
"bit_flags": [
[True, True, True, True, False],
[False, True, True, True, True],
],
"tic_tac_toe": [
[
[" ", "x", "o"],
[" ", "x", " "],
["o", "x", " "],
],
[
["o", "x", "x"],
[" ", "o", "x"],
[" ", " ", "o"],
],
],
},
schema={
"bit_flags": pl.Array(pl.Boolean, 5),
"tic_tac_toe": pl.Array(pl.String, (3, 3)),
},
)
print(df)
Array
· Available on feature dtype-array
// Contribute the Rust translation of the Python example by opening a PR.
shape: (2, 2)
┌───────────────────────┬─────────────────────────────────┐
│ bit_flags ┆ tic_tac_toe │
│ --- ┆ --- │
│ array[bool, 5] ┆ array[str, (3, 3)] │
╞═══════════════════════╪═════════════════════════════════╡
│ [true, true, … false] ┆ [[" ", "x", "o"], [" ", "x", "… │
│ [false, true, … true] ┆ [["o", "x", "x"], [" ", "o", "… │
└───────────────────────┴─────────────────────────────────┘
The example above shows how to specify that the columns “bit_flags” and “tic_tac_toe” have the data
type Array
, parametrised by the data type of the elements contained within and by the shape of
each array.
In general, Polars does not infer that a column has the data type Array
for performance reasons,
and defaults to the appropriate variant of the data type List
. In Python, an exception to this
rule is when you provide a NumPy array to build a column. In that case, Polars has the guarantee
from NumPy that all subarrays have the same shape, so an array of \(n + 1\) dimensions will generate a
column of \(n\) dimensional arrays:
import numpy as np
array = np.arange(0, 120).reshape((5, 2, 3, 4)) # 4D array
print(pl.Series(array).dtype) # Column with the 3D subarrays
Array
· Available on feature dtype-array
// Contribute the Rust translation of the Python example by opening a PR.
Array(Int64, shape=(2, 3, 4))
When to use each
In short, prefer the data type Array
over List
because it is more memory efficient and more
performant. If you cannot use Array
, then use List
:
- when the values within a column do not have a fixed shape; or
- when you need functions that are only available in the list API.
Working with lists
The namespace list
Polars provides many functions to work with values of the data type List
and these are grouped
inside the namespace list
. We will explore this namespace a bit now.
arr
then, list
now
In previous versions of Polars, the namespace for list operations used to be arr
.
arr
is now the namespace for the data type Array
.
If you find references to the namespace arr
on StackOverflow or other sources, note that those sources may be outdated.
The dataframe weather
defined below contains data from different weather stations across a region.
When the weather station is unable to get a result, an error code is recorded instead of the actual
temperature at that time.
weather = pl.DataFrame(
{
"station": [f"Station {idx}" for idx in range(1, 6)],
"temperatures": [
"20 5 5 E1 7 13 19 9 6 20",
"18 8 16 11 23 E2 8 E2 E2 E2 90 70 40",
"19 24 E9 16 6 12 10 22",
"E2 E0 15 7 8 10 E1 24 17 13 6",
"14 8 E0 16 22 24 E1",
],
}
)
print(weather)
// Contribute the Rust translation of the Python example by opening a PR.
shape: (5, 2)
┌───────────┬─────────────────────────────────┐
│ station ┆ temperatures │
│ --- ┆ --- │
│ str ┆ str │
╞═══════════╪═════════════════════════════════╡
│ Station 1 ┆ 20 5 5 E1 7 13 19 9 6 20 │
│ Station 2 ┆ 18 8 16 11 23 E2 8 E2 E2 E2 90… │
│ Station 3 ┆ 19 24 E9 16 6 12 10 22 │
│ Station 4 ┆ E2 E0 15 7 8 10 E1 24 17 13 6 │
│ Station 5 ┆ 14 8 E0 16 22 24 E1 │
└───────────┴─────────────────────────────────┘
Programmatically creating lists
Given the dataframe weather
defined previously, it is very likely we need to run some analysis on
the temperatures that are captured by each station. To make this happen, we need to first be able to
get individual temperature measurements. We
can use the namespace str
for this:
shape: (5, 2)
┌───────────┬──────────────────────┐
│ station ┆ temperatures │
│ --- ┆ --- │
│ str ┆ list[str] │
╞═══════════╪══════════════════════╡
│ Station 1 ┆ ["20", "5", … "20"] │
│ Station 2 ┆ ["18", "8", … "40"] │
│ Station 3 ┆ ["19", "24", … "22"] │
│ Station 4 ┆ ["E2", "E0", … "6"] │
│ Station 5 ┆ ["14", "8", … "E1"] │
└───────────┴──────────────────────┘
A natural follow-up would be to explode the list of temperatures so that each measurement is in its own row:
shape: (49, 2)
┌───────────┬──────────────┐
│ station ┆ temperatures │
│ --- ┆ --- │
│ str ┆ str │
╞═══════════╪══════════════╡
│ Station 1 ┆ 20 │
│ Station 1 ┆ 5 │
│ Station 1 ┆ 5 │
│ Station 1 ┆ E1 │
│ Station 1 ┆ 7 │
│ … ┆ … │
│ Station 5 ┆ E0 │
│ Station 5 ┆ 16 │
│ Station 5 ┆ 22 │
│ Station 5 ┆ 24 │
│ Station 5 ┆ E1 │
└───────────┴──────────────┘
However, in Polars we often do not need to do this to operate on the list elements.
Operating on lists
Polars provides several standard operations on columns with the List
data type.
Similar to what you can do with strings, lists can be sliced with the
functions head
, tail
, and slice
:
result = weather.with_columns(
pl.col("temperatures").list.head(3).alias("head"),
pl.col("temperatures").list.tail(3).alias("tail"),
pl.col("temperatures").list.slice(-3, 2).alias("two_next_to_last"),
)
print(result)
// Contribute the Rust translation of the Python example by opening a PR.
shape: (5, 5)
┌───────────┬──────────────────────┬────────────────────┬────────────────────┬──────────────────┐
│ station ┆ temperatures ┆ head ┆ tail ┆ two_next_to_last │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ list[str] ┆ list[str] ┆ list[str] ┆ list[str] │
╞═══════════╪══════════════════════╪════════════════════╪════════════════════╪══════════════════╡
│ Station 1 ┆ ["20", "5", … "20"] ┆ ["20", "5", "5"] ┆ ["9", "6", "20"] ┆ ["9", "6"] │
│ Station 2 ┆ ["18", "8", … "40"] ┆ ["18", "8", "16"] ┆ ["90", "70", "40"] ┆ ["90", "70"] │
│ Station 3 ┆ ["19", "24", … "22"] ┆ ["19", "24", "E9"] ┆ ["12", "10", "22"] ┆ ["12", "10"] │
│ Station 4 ┆ ["E2", "E0", … "6"] ┆ ["E2", "E0", "15"] ┆ ["17", "13", "6"] ┆ ["17", "13"] │
│ Station 5 ┆ ["14", "8", … "E1"] ┆ ["14", "8", "E0"] ┆ ["22", "24", "E1"] ┆ ["22", "24"] │
└───────────┴──────────────────────┴────────────────────┴────────────────────┴──────────────────┘
Element-wise computation within lists
If we need to identify the stations that are giving the most number of errors we need to
- try to convert the measurements into numbers;
- count the number of non-numeric values (i.e.,
null
values) in the list, by row; and - rename this output column as “errors” so that we can easily identify the stations.
To perform these steps, we need to perform a casting operation on each measurement within the list
values. The function eval
is used as the entry point to perform operations on the elements of the
list. Within it, you can use the context element
to refer to each single element of the list
individually, and then you can use any Polars expression on the element:
shape: (5, 3)
┌───────────┬──────────────────────┬────────┐
│ station ┆ temperatures ┆ errors │
│ --- ┆ --- ┆ --- │
│ str ┆ list[str] ┆ u32 │
╞═══════════╪══════════════════════╪════════╡
│ Station 1 ┆ ["20", "5", … "20"] ┆ 1 │
│ Station 2 ┆ ["18", "8", … "40"] ┆ 4 │
│ Station 3 ┆ ["19", "24", … "22"] ┆ 1 │
│ Station 4 ┆ ["E2", "E0", … "6"] ┆ 3 │
│ Station 5 ┆ ["14", "8", … "E1"] ┆ 2 │
└───────────┴──────────────────────┴────────┘
Another alternative would be to use a regular expression to check if a measurement starts with a letter:
True
If you are unfamiliar with the namespace str
or the notation (?i)
in the regex, now is a good
time to
look at how to work with strings and regular expressions in Polars.
Row-wise computations
The function eval
gives us access to the list elements and pl.element
refers to each individual
element, but we can also use pl.all()
to refer to all of the elements of the list.
To show this in action, we will start by creating another dataframe with some more weather data:
weather_by_day = pl.DataFrame(
{
"station": [f"Station {idx}" for idx in range(1, 11)],
"day_1": [17, 11, 8, 22, 9, 21, 20, 8, 8, 17],
"day_2": [15, 11, 10, 8, 7, 14, 18, 21, 15, 13],
"day_3": [16, 15, 24, 24, 8, 23, 19, 23, 16, 10],
}
)
print(weather_by_day)
// Contribute the Rust translation of the Python example by opening a PR.
shape: (10, 4)
┌────────────┬───────┬───────┬───────┐
│ station ┆ day_1 ┆ day_2 ┆ day_3 │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞════════════╪═══════╪═══════╪═══════╡
│ Station 1 ┆ 17 ┆ 15 ┆ 16 │
│ Station 2 ┆ 11 ┆ 11 ┆ 15 │
│ Station 3 ┆ 8 ┆ 10 ┆ 24 │
│ Station 4 ┆ 22 ┆ 8 ┆ 24 │
│ Station 5 ┆ 9 ┆ 7 ┆ 8 │
│ Station 6 ┆ 21 ┆ 14 ┆ 23 │
│ Station 7 ┆ 20 ┆ 18 ┆ 19 │
│ Station 8 ┆ 8 ┆ 21 ┆ 23 │
│ Station 9 ┆ 8 ┆ 15 ┆ 16 │
│ Station 10 ┆ 17 ┆ 13 ┆ 10 │
└────────────┴───────┴───────┴───────┘
Now, we will calculate the percentage rank of the temperatures by day, measured across stations. Polars does not provide a function to do this directly, but because expressions are so versatile we can create our own percentage rank expression for highest temperature. Let's try that:
rank_pct = (pl.element().rank(descending=True) / pl.all().count()).round(2)
result = weather_by_day.with_columns(
# create the list of homogeneous data
pl.concat_list(pl.all().exclude("station")).alias("all_temps")
).select(
# select all columns except the intermediate list
pl.all().exclude("all_temps"),
# compute the rank by calling `list.eval`
pl.col("all_temps").list.eval(rank_pct, parallel=True).alias("temps_rank"),
)
print(result)
shape: (10, 5)
┌────────────┬───────┬───────┬───────┬────────────────────┐
│ station ┆ day_1 ┆ day_2 ┆ day_3 ┆ temps_rank │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ list[f64] │
╞════════════╪═══════╪═══════╪═══════╪════════════════════╡
│ Station 1 ┆ 17 ┆ 15 ┆ 16 ┆ [0.33, 1.0, 0.67] │
│ Station 2 ┆ 11 ┆ 11 ┆ 15 ┆ [0.83, 0.83, 0.33] │
│ Station 3 ┆ 8 ┆ 10 ┆ 24 ┆ [1.0, 0.67, 0.33] │
│ Station 4 ┆ 22 ┆ 8 ┆ 24 ┆ [0.67, 1.0, 0.33] │
│ Station 5 ┆ 9 ┆ 7 ┆ 8 ┆ [0.33, 1.0, 0.67] │
│ Station 6 ┆ 21 ┆ 14 ┆ 23 ┆ [0.67, 1.0, 0.33] │
│ Station 7 ┆ 20 ┆ 18 ┆ 19 ┆ [0.33, 1.0, 0.67] │
│ Station 8 ┆ 8 ┆ 21 ┆ 23 ┆ [1.0, 0.67, 0.33] │
│ Station 9 ┆ 8 ┆ 15 ┆ 16 ┆ [1.0, 0.67, 0.33] │
│ Station 10 ┆ 17 ┆ 13 ┆ 10 ┆ [0.33, 0.67, 1.0] │
└────────────┴───────┴───────┴───────┴────────────────────┘
Working with arrays
Creating an array column
As we have seen above, Polars usually does not infer the data type Array
automatically. You have to specify the data type Array
when creating a series/dataframe or
cast a column explicitly unless you create the column out of a NumPy array.
The namespace arr
The data type Array
was recently introduced and is still pretty nascent in features that it
offers. Even so, the namespace arr
aggregates several functions that you can use to work with
arrays.
arr
then, list
now
In previous versions of Polars, the namespace for list operations used to be arr
.
arr
is now the namespace for the data type Array
.
If you find references to the namespace arr
on StackOverflow or other sources, note that those sources may be outdated.
The API documentation should give you a good overview of the functions in the namespace arr
, of
which we present a couple:
df = pl.DataFrame(
{
"first_last": [
["Anne", "Adams"],
["Brandon", "Branson"],
["Camila", "Campbell"],
["Dennis", "Doyle"],
],
"fav_numbers": [
[42, 0, 1],
[2, 3, 5],
[13, 21, 34],
[73, 3, 7],
],
},
schema={
"first_last": pl.Array(pl.String, 2),
"fav_numbers": pl.Array(pl.Int32, 3),
},
)
result = df.select(
pl.col("first_last").arr.join(" ").alias("name"),
pl.col("fav_numbers").arr.sort(),
pl.col("fav_numbers").arr.max().alias("largest_fav"),
pl.col("fav_numbers").arr.sum().alias("summed"),
pl.col("fav_numbers").arr.contains(3).alias("likes_3"),
)
print(result)
`arr
namespace` · Available on feature dtype-array
// Contribute the Rust translation of the Python example by opening a PR.
shape: (4, 5)
┌─────────────────┬───────────────┬─────────────┬────────┬─────────┐
│ name ┆ fav_numbers ┆ largest_fav ┆ summed ┆ likes_3 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ array[i32, 3] ┆ i32 ┆ i32 ┆ bool │
╞═════════════════╪═══════════════╪═════════════╪════════╪═════════╡
│ Anne Adams ┆ [0, 1, 42] ┆ 42 ┆ 43 ┆ false │
│ Brandon Branson ┆ [2, 3, 5] ┆ 5 ┆ 10 ┆ true │
│ Camila Campbell ┆ [13, 21, 34] ┆ 34 ┆ 68 ┆ false │
│ Dennis Doyle ┆ [3, 7, 73] ┆ 73 ┆ 83 ┆ true │
└─────────────────┴───────────────┴─────────────┴────────┴─────────┘