Data types and structures
Data types
Polars supports a variety of data types that fall broadly under the following categories:
- Numeric data types: signed integers, unsigned integers, floating point numbers, and decimals.
- Nested data types: lists, structs, and arrays.
- Temporal: dates, datetimes, times, and time deltas.
- Miscellaneous: strings, binary data, Booleans, categoricals, enums, and objects.
All types support missing values represented by the special value null
. This is not to be
conflated with the special value NaN
in floating number data types; see the
section about floating point numbers for more information.
You can also find a full table with all data types supported in the appendix with notes on when to use each data type and with links to relevant parts of the documentation.
Series
The core base data structures provided by Polars are series and dataframes. A series is a 1-dimensional homogeneous data structure. By “homogeneous” we mean that all elements inside a series have the same data type. The snippet below shows how to create a named series:
shape: (5,)
Series: 'ints' [i64]
[
1
2
3
4
5
]
When creating a series, Polars will infer the data type from the values you provide. You can specify a concrete data type to override the inference mechanism:
s1 = pl.Series("ints", [1, 2, 3, 4, 5])
s2 = pl.Series("uints", [1, 2, 3, 4, 5], dtype=pl.UInt64)
print(s1.dtype, s2.dtype)
let s1 = Series::new("ints".into(), &[1, 2, 3, 4, 5]);
let s2 = Series::new("uints".into(), &[1, 2, 3, 4, 5])
.cast(&DataType::UInt64) // Here, we actually cast after inference.
.unwrap();
println!("{} {}", s1.dtype(), s2.dtype()); // i32 u64
Int64 UInt64
Dataframe
A dataframe is a 2-dimensional heterogeneous data structure that contains uniquely named series. By holding your data in a dataframe you will be able to use the Polars API to write queries that manipulate your data. You will be able to do this by using the contexts and expressions provided by Polars that we will talk about next.
The snippet below shows how to create a dataframe from a dictionary of lists:
from datetime import date
df = pl.DataFrame(
{
"name": ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
"birthdate": [
date(1997, 1, 10),
date(1985, 2, 15),
date(1983, 3, 22),
date(1981, 4, 30),
],
"weight": [57.9, 72.5, 53.6, 83.1], # (kg)
"height": [1.56, 1.77, 1.65, 1.75], # (m)
}
)
print(df)
use chrono::prelude::*;
let df: DataFrame = df!(
"name" => ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
"birthdate" => [
NaiveDate::from_ymd_opt(1997, 1, 10).unwrap(),
NaiveDate::from_ymd_opt(1985, 2, 15).unwrap(),
NaiveDate::from_ymd_opt(1983, 3, 22).unwrap(),
NaiveDate::from_ymd_opt(1981, 4, 30).unwrap(),
],
"weight" => [57.9, 72.5, 53.6, 83.1], // (kg)
"height" => [1.56, 1.77, 1.65, 1.75], // (m)
)
.unwrap();
println!("{}", df);
shape: (4, 4)
┌────────────────┬────────────┬────────┬────────┐
│ name ┆ birthdate ┆ weight ┆ height │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ date ┆ f64 ┆ f64 │
╞════════════════╪════════════╪════════╪════════╡
│ Alice Archer ┆ 1997-01-10 ┆ 57.9 ┆ 1.56 │
│ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 │
│ Chloe Cooper ┆ 1983-03-22 ┆ 53.6 ┆ 1.65 │
│ Daniel Donovan ┆ 1981-04-30 ┆ 83.1 ┆ 1.75 │
└────────────────┴────────────┴────────┴────────┘
Inspecting a dataframe
In this subsection we will show some useful methods to quickly inspect a dataframe. We will use the dataframe we created earlier as a starting point.
Head
The function head
shows the first rows of a dataframe. By default, you get the first 5 rows but
you can also specify the number of rows you want:
shape: (3, 4)
┌──────────────┬────────────┬────────┬────────┐
│ name ┆ birthdate ┆ weight ┆ height │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ date ┆ f64 ┆ f64 │
╞══════════════╪════════════╪════════╪════════╡
│ Alice Archer ┆ 1997-01-10 ┆ 57.9 ┆ 1.56 │
│ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 │
│ Chloe Cooper ┆ 1983-03-22 ┆ 53.6 ┆ 1.65 │
└──────────────┴────────────┴────────┴────────┘
Glimpse
The function glimpse
is another function that shows the values of the first few rows of a
dataframe, but formats the output differently from head
. Here, each line of the output corresponds
to a single column, making it easier to take inspect wider dataframes:
print(df.glimpse(return_as_string=True))
Rows: 4
Columns: 4
$ name <str> 'Alice Archer', 'Ben Brown', 'Chloe Cooper', 'Daniel Donovan'
$ birthdate <date> 1997-01-10, 1985-02-15, 1983-03-22, 1981-04-30
$ weight <f64> 57.9, 72.5, 53.6, 83.1
$ height <f64> 1.56, 1.77, 1.65, 1.75
Info
glimpse
is only available for Python users.
Tail
The function tail
shows the last rows of a dataframe. By default, you get the last 5 rows but you
can also specify the number of rows you want, similar to how head
works:
shape: (3, 4)
┌────────────────┬────────────┬────────┬────────┐
│ name ┆ birthdate ┆ weight ┆ height │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ date ┆ f64 ┆ f64 │
╞════════════════╪════════════╪════════╪════════╡
│ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 │
│ Chloe Cooper ┆ 1983-03-22 ┆ 53.6 ┆ 1.65 │
│ Daniel Donovan ┆ 1981-04-30 ┆ 83.1 ┆ 1.75 │
└────────────────┴────────────┴────────┴────────┘
Sample
If you think the first or last rows of your dataframe are not representative of your data, you can
use sample
to get an arbitrary number of randomly selected rows from the DataFrame. Note that the
rows are not necessarily returned in the same order as they appear in the dataframe:
shape: (2, 4)
┌────────────────┬────────────┬────────┬────────┐
│ name ┆ birthdate ┆ weight ┆ height │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ date ┆ f64 ┆ f64 │
╞════════════════╪════════════╪════════╪════════╡
│ Daniel Donovan ┆ 1981-04-30 ┆ 83.1 ┆ 1.75 │
│ Chloe Cooper ┆ 1983-03-22 ┆ 53.6 ┆ 1.65 │
└────────────────┴────────────┴────────┴────────┘
Describe
You can also use describe
to compute summary statistics for all columns of your dataframe:
print(df.describe())
describe
· Available on feature describe
// Not available in Rust
shape: (9, 5)
┌────────────┬────────────────┬─────────────────────┬───────────┬──────────┐
│ statistic ┆ name ┆ birthdate ┆ weight ┆ height │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ f64 ┆ f64 │
╞════════════╪════════════════╪═════════════════════╪═══════════╪══════════╡
│ count ┆ 4 ┆ 4 ┆ 4.0 ┆ 4.0 │
│ null_count ┆ 0 ┆ 0 ┆ 0.0 ┆ 0.0 │
│ mean ┆ null ┆ 1986-09-04 00:00:00 ┆ 66.775 ┆ 1.6825 │
│ std ┆ null ┆ null ┆ 13.560082 ┆ 0.097082 │
│ min ┆ Alice Archer ┆ 1981-04-30 ┆ 53.6 ┆ 1.56 │
│ 25% ┆ null ┆ 1983-03-22 ┆ 57.9 ┆ 1.65 │
│ 50% ┆ null ┆ 1985-02-15 ┆ 72.5 ┆ 1.75 │
│ 75% ┆ null ┆ 1985-02-15 ┆ 72.5 ┆ 1.75 │
│ max ┆ Daniel Donovan ┆ 1997-01-10 ┆ 83.1 ┆ 1.77 │
└────────────┴────────────────┴─────────────────────┴───────────┴──────────┘
Schema
When talking about data (in a dataframe or otherwise) we can refer to its schema. The schema is a mapping of column or series names to the data types of those same columns or series.
You can check the schema of a dataframe with schema
:
print(df.schema)
println!("{:?}", df.schema());
Schema({'name': String, 'birthdate': Date, 'weight': Float64, 'height': Float64})
Much like with series, Polars will infer the schema of a dataframe when you create it but you can override the inference system if needed.
In Python, you can specify an explicit schema by using a dictionary to map column names to data
types. You can use the value None
if you do not wish to override inference for a given column:
df = pl.DataFrame(
{
"name": ["Alice", "Ben", "Chloe", "Daniel"],
"age": [27, 39, 41, 43],
},
schema={"name": None, "age": pl.UInt8},
)
print(df)
shape: (4, 2)
┌────────┬─────┐
│ name ┆ age │
│ --- ┆ --- │
│ str ┆ u8 │
╞════════╪═════╡
│ Alice ┆ 27 │
│ Ben ┆ 39 │
│ Chloe ┆ 41 │
│ Daniel ┆ 43 │
└────────┴─────┘
If you only need to override the inference of some columns, the parameter schema_overrides
tends
to be more convenient because it lets you omit columns for which you do not want to override the
inference:
df = pl.DataFrame(
{
"name": ["Alice", "Ben", "Chloe", "Daniel"],
"age": [27, 39, 41, 43],
},
schema_overrides={"age": pl.UInt8},
)
print(df)
shape: (4, 2)
┌────────┬─────┐
│ name ┆ age │
│ --- ┆ --- │
│ str ┆ u8 │
╞════════╪═════╡
│ Alice ┆ 27 │
│ Ben ┆ 39 │
│ Chloe ┆ 41 │
│ Daniel ┆ 43 │
└────────┴─────┘
Data types internals
Polars utilizes the Arrow Columnar Format for its data orientation. Following this specification allows Polars to transfer data to/from other tools that also use the Arrow specification with little to no overhead.
Polars gets most of its performance from its query engine, the optimizations it performs on your query plans, and from the parallelization that it employs when running your expressions.
Floating point numbers
Polars generally follows the IEEE 754 floating point standard for Float32
and Float64
, with some
exceptions:
- Any
NaN
compares equal to any otherNaN
, and greater than any non-NaN
value. - Operations do not guarantee any particular behavior on the sign of zero or
NaN
, nor on the payload ofNaN
values. This is not just limited to arithmetic operations, e.g. a sort or group by operation may canonicalize all zeroes to +0 and allNaN
s to a positiveNaN
without payload for efficient equality checks.
Polars always attempts to provide reasonably accurate results for floating point computations but does not provide guarantees on the error unless mentioned otherwise. Generally speaking 100% accurate results are infeasibly expensive to achieve (requiring much larger internal representations than 64-bit floats), and thus some error is always to be expected.
Appendix: full data types table
Type(s) | Details |
---|---|
Boolean |
Boolean type that is bit packed efficiently. |
Int8 , Int16 , Int32 , Int64 |
Varying-precision signed integer types. |
UInt8 , UInt16 , UInt32 , UInt64 |
Varying-precision unsigned integer types. |
Float32 , Float64 |
Varying-precision signed floating point numbers. |
Decimal |
Decimal 128-bit type with optional precision and non-negative scale. Use this if you need fine-grained control over the precision of your floats and the operations you make on them. See Python's decimal.Decimal for documentation on what a decimal data type is. |
String |
Variable length UTF-8 encoded string data, typically Human-readable. |
Binary |
Stores arbitrary, varying length raw binary data. |
Date |
Represents a calendar date. |
Time |
Represents a time of day. |
Datetime |
Represents a calendar date and time of day. |
Duration |
Represents a time duration. |
Array |
Arrays with a known, fixed shape per series; akin to numpy arrays. Learn more about how arrays and lists differ and how to work with both. |
List |
Homogeneous 1D container with variable length. Learn more about how arrays and lists differ and how to work with both. |
Object |
Wraps arbitrary Python objects. |
Categorical |
Efficient encoding of string data where the categories are inferred at runtime. Learn more about how categoricals and enums differ and how to work with both. |
Enum |
Efficient ordered encoding of a set of predetermined string categories. Learn more about how categoricals and enums differ and how to work with both. |
Struct |
Composite product type that can store multiple fields. Learn more about the data type Struct in its dedicated documentation section.. |
Null |
Represents null values. |