Skip to content

Data types and structures

Data types

Polars supports a variety of data types that fall broadly under the following categories:

  • Numeric data types: signed integers, unsigned integers, floating point numbers, and decimals.
  • Nested data types: lists, structs, and arrays.
  • Temporal: dates, datetimes, times, and time deltas.
  • Miscellaneous: strings, binary data, Booleans, categoricals, enums, and objects.

All types support missing values represented by the special value null. This is not to be conflated with the special value NaN in floating number data types; see the section about floating point numbers for more information.

You can also find a full table with all data types supported in the appendix with notes on when to use each data type and with links to relevant parts of the documentation.

Series

The core base data structures provided by Polars are series and dataframes. A series is a 1-dimensional homogeneous data structure. By “homogeneous” we mean that all elements inside a series have the same data type. The snippet below shows how to create a named series:

Series

import polars as pl

s = pl.Series("ints", [1, 2, 3, 4, 5])
print(s)

Series

use polars::prelude::*;

let s = Series::new("ints".into(), &[1, 2, 3, 4, 5]);

println!("{}", s);

shape: (5,)
Series: 'ints' [i64]
[
    1
    2
    3
    4
    5
]

When creating a series, Polars will infer the data type from the values you provide. You can specify a concrete data type to override the inference mechanism:

Series

s1 = pl.Series("ints", [1, 2, 3, 4, 5])
s2 = pl.Series("uints", [1, 2, 3, 4, 5], dtype=pl.UInt64)
print(s1.dtype, s2.dtype)

Series

let s1 = Series::new("ints".into(), &[1, 2, 3, 4, 5]);
let s2 = Series::new("uints".into(), &[1, 2, 3, 4, 5])
    .cast(&DataType::UInt64) // Here, we actually cast after inference.
    .unwrap();
println!("{} {}", s1.dtype(), s2.dtype()); // i32 u64

Int64 UInt64

Dataframe

A dataframe is a 2-dimensional heterogeneous data structure that contains uniquely named series. By holding your data in a dataframe you will be able to use the Polars API to write queries that manipulate your data. You will be able to do this by using the contexts and expressions provided by Polars that we will talk about next.

The snippet below shows how to create a dataframe from a dictionary of lists:

DataFrame

from datetime import date

df = pl.DataFrame(
    {
        "name": ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
        "birthdate": [
            date(1997, 1, 10),
            date(1985, 2, 15),
            date(1983, 3, 22),
            date(1981, 4, 30),
        ],
        "weight": [57.9, 72.5, 53.6, 83.1],  # (kg)
        "height": [1.56, 1.77, 1.65, 1.75],  # (m)
    }
)

print(df)

DataFrame

use chrono::prelude::*;

let df: DataFrame = df!(
    "name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
    "birthdate" => [
        NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
        NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
        NaiveDate::from_ymd_opt(1983, 3, 22).unwrap(),
        NaiveDate::from_ymd_opt(1981, 4, 30).unwrap(),
    ],
    "weight" => [57.9, 72.5, 53.6, 83.1],  // (kg)
    "height" => [1.56, 1.77, 1.65, 1.75],  // (m)
)
.unwrap();
println!("{}", df);

shape: (4, 4)
┌────────────────┬────────────┬────────┬────────┐
│ name           ┆ birthdate  ┆ weight ┆ height │
│ ---            ┆ ---        ┆ ---    ┆ ---    │
│ str            ┆ date       ┆ f64    ┆ f64    │
╞════════════════╪════════════╪════════╪════════╡
│ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   │
│ Ben Brown      ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
│ Chloe Cooper   ┆ 1983-03-22 ┆ 53.6   ┆ 1.65   │
│ Daniel Donovan ┆ 1981-04-30 ┆ 83.1   ┆ 1.75   │
└────────────────┴────────────┴────────┴────────┘

Inspecting a dataframe

In this subsection we will show some useful methods to quickly inspect a dataframe. We will use the dataframe we created earlier as a starting point.

The function head shows the first rows of a dataframe. By default, you get the first 5 rows but you can also specify the number of rows you want:

head

print(df.head(3))

head

let df_head = df.head(Some(3));

println!("{}", df_head);

shape: (3, 4)
┌──────────────┬────────────┬────────┬────────┐
│ name         ┆ birthdate  ┆ weight ┆ height │
│ ---          ┆ ---        ┆ ---    ┆ ---    │
│ str          ┆ date       ┆ f64    ┆ f64    │
╞══════════════╪════════════╪════════╪════════╡
│ Alice Archer ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   │
│ Ben Brown    ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
│ Chloe Cooper ┆ 1983-03-22 ┆ 53.6   ┆ 1.65   │
└──────────────┴────────────┴────────┴────────┘

Glimpse

The function glimpse is another function that shows the values of the first few rows of a dataframe, but formats the output differently from head. Here, each line of the output corresponds to a single column, making it easier to take inspect wider dataframes:

glimpse

print(df.glimpse(return_as_string=True))
Rows: 4
Columns: 4
$ name       <str> 'Alice Archer', 'Ben Brown', 'Chloe Cooper', 'Daniel Donovan'
$ birthdate <date> 1997-01-10, 1985-02-15, 1983-03-22, 1981-04-30
$ weight     <f64> 57.9, 72.5, 53.6, 83.1
$ height     <f64> 1.56, 1.77, 1.65, 1.75

Info

glimpse is only available for Python users.

Tail

The function tail shows the last rows of a dataframe. By default, you get the last 5 rows but you can also specify the number of rows you want, similar to how head works:

tail

print(df.tail(3))

tail

let df_tail = df.tail(Some(3));

println!("{}", df_tail);

shape: (3, 4)
┌────────────────┬────────────┬────────┬────────┐
│ name           ┆ birthdate  ┆ weight ┆ height │
│ ---            ┆ ---        ┆ ---    ┆ ---    │
│ str            ┆ date       ┆ f64    ┆ f64    │
╞════════════════╪════════════╪════════╪════════╡
│ Ben Brown      ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
│ Chloe Cooper   ┆ 1983-03-22 ┆ 53.6   ┆ 1.65   │
│ Daniel Donovan ┆ 1981-04-30 ┆ 83.1   ┆ 1.75   │
└────────────────┴────────────┴────────┴────────┘

Sample

If you think the first or last rows of your dataframe are not representative of your data, you can use sample to get an arbitrary number of randomly selected rows from the DataFrame. Note that the rows are not necessarily returned in the same order as they appear in the dataframe:

sample

import random

random.seed(42)  # For reproducibility.

print(df.sample(2))

sample_n

let n = Series::new("".into(), &[2]);
let sampled_df = df.sample_n(&n, false, false, None).unwrap();

println!("{}", sampled_df);

shape: (2, 4)
┌────────────────┬────────────┬────────┬────────┐
│ name           ┆ birthdate  ┆ weight ┆ height │
│ ---            ┆ ---        ┆ ---    ┆ ---    │
│ str            ┆ date       ┆ f64    ┆ f64    │
╞════════════════╪════════════╪════════╪════════╡
│ Daniel Donovan ┆ 1981-04-30 ┆ 83.1   ┆ 1.75   │
│ Chloe Cooper   ┆ 1983-03-22 ┆ 53.6   ┆ 1.65   │
└────────────────┴────────────┴────────┴────────┘

Describe

You can also use describe to compute summary statistics for all columns of your dataframe:

describe

print(df.describe())

describe · Available on feature describe

// Not available in Rust

shape: (9, 5)
┌────────────┬────────────────┬─────────────────────┬───────────┬──────────┐
│ statistic  ┆ name           ┆ birthdate           ┆ weight    ┆ height   │
│ ---        ┆ ---            ┆ ---                 ┆ ---       ┆ ---      │
│ str        ┆ str            ┆ str                 ┆ f64       ┆ f64      │
╞════════════╪════════════════╪═════════════════════╪═══════════╪══════════╡
│ count      ┆ 4              ┆ 4                   ┆ 4.0       ┆ 4.0      │
│ null_count ┆ 0              ┆ 0                   ┆ 0.0       ┆ 0.0      │
│ mean       ┆ null           ┆ 1986-09-04 00:00:00 ┆ 66.775    ┆ 1.6825   │
│ std        ┆ null           ┆ null                ┆ 13.560082 ┆ 0.097082 │
│ min        ┆ Alice Archer   ┆ 1981-04-30          ┆ 53.6      ┆ 1.56     │
│ 25%        ┆ null           ┆ 1983-03-22          ┆ 57.9      ┆ 1.65     │
│ 50%        ┆ null           ┆ 1985-02-15          ┆ 72.5      ┆ 1.75     │
│ 75%        ┆ null           ┆ 1985-02-15          ┆ 72.5      ┆ 1.75     │
│ max        ┆ Daniel Donovan ┆ 1997-01-10          ┆ 83.1      ┆ 1.77     │
└────────────┴────────────────┴─────────────────────┴───────────┴──────────┘

Schema

When talking about data (in a dataframe or otherwise) we can refer to its schema. The schema is a mapping of column or series names to the data types of those same columns or series.

You can check the schema of a dataframe with schema:

print(df.schema)
println!("{:?}", df.schema());
Schema({'name': String, 'birthdate': Date, 'weight': Float64, 'height': Float64})

Much like with series, Polars will infer the schema of a dataframe when you create it but you can override the inference system if needed.

In Python, you can specify an explicit schema by using a dictionary to map column names to data types. You can use the value None if you do not wish to override inference for a given column:

df = pl.DataFrame(
    {
        "name": ["Alice", "Ben", "Chloe", "Daniel"],
        "age": [27, 39, 41, 43],
    },
    schema={"name": None, "age": pl.UInt8},
)

print(df)
shape: (4, 2)
┌────────┬─────┐
│ name   ┆ age │
│ ---    ┆ --- │
│ str    ┆ u8  │
╞════════╪═════╡
│ Alice  ┆ 27  │
│ Ben    ┆ 39  │
│ Chloe  ┆ 41  │
│ Daniel ┆ 43  │
└────────┴─────┘

If you only need to override the inference of some columns, the parameter schema_overrides tends to be more convenient because it lets you omit columns for which you do not want to override the inference:

df = pl.DataFrame(
    {
        "name": ["Alice", "Ben", "Chloe", "Daniel"],
        "age": [27, 39, 41, 43],
    },
    schema_overrides={"age": pl.UInt8},
)

print(df)
shape: (4, 2)
┌────────┬─────┐
│ name   ┆ age │
│ ---    ┆ --- │
│ str    ┆ u8  │
╞════════╪═════╡
│ Alice  ┆ 27  │
│ Ben    ┆ 39  │
│ Chloe  ┆ 41  │
│ Daniel ┆ 43  │
└────────┴─────┘

Data types internals

Polars utilizes the Arrow Columnar Format for its data orientation. Following this specification allows Polars to transfer data to/from other tools that also use the Arrow specification with little to no overhead.

Polars gets most of its performance from its query engine, the optimizations it performs on your query plans, and from the parallelization that it employs when running your expressions.

Floating point numbers

Polars generally follows the IEEE 754 floating point standard for Float32 and Float64, with some exceptions:

  • Any NaN compares equal to any other NaN, and greater than any non-NaN value.
  • Operations do not guarantee any particular behavior on the sign of zero or NaN, nor on the payload of NaN values. This is not just limited to arithmetic operations, e.g. a sort or group by operation may canonicalize all zeroes to +0 and all NaNs to a positive NaN without payload for efficient equality checks.

Polars always attempts to provide reasonably accurate results for floating point computations but does not provide guarantees on the error unless mentioned otherwise. Generally speaking 100% accurate results are infeasibly expensive to achieve (requiring much larger internal representations than 64-bit floats), and thus some error is always to be expected.

Appendix: full data types table

Type(s) Details
Boolean Boolean type that is bit packed efficiently.
Int8, Int16, Int32, Int64 Varying-precision signed integer types.
UInt8, UInt16, UInt32, UInt64 Varying-precision unsigned integer types.
Float32, Float64 Varying-precision signed floating point numbers.
Decimal Decimal 128-bit type with optional precision and non-negative scale. Use this if you need fine-grained control over the precision of your floats and the operations you make on them. See Python's decimal.Decimal for documentation on what a decimal data type is.
String Variable length UTF-8 encoded string data, typically Human-readable.
Binary Stores arbitrary, varying length raw binary data.
Date Represents a calendar date.
Time Represents a time of day.
Datetime Represents a calendar date and time of day.
Duration Represents a time duration.
Array Arrays with a known, fixed shape per series; akin to numpy arrays. Learn more about how arrays and lists differ and how to work with both.
List Homogeneous 1D container with variable length. Learn more about how arrays and lists differ and how to work with both.
Object Wraps arbitrary Python objects.
Categorical Efficient encoding of string data where the categories are inferred at runtime. Learn more about how categoricals and enums differ and how to work with both.
Enum Efficient ordered encoding of a set of predetermined string categories. Learn more about how categoricals and enums differ and how to work with both.
Struct Composite product type that can store multiple fields. Learn more about the data type Struct in its dedicated documentation section..
Null Represents null values.