Casting
Casting converts the underlying data type of a column to
a new one. Casting is available through the function cast
.
The function cast
includes a parameter strict
that determines how Polars behaves when it
encounters a value that cannot be converted from the source data type to the target data type. The
default behaviour is strict=True
, which means that Polars will thrown an error to notify the user
of the failed conversion while also providing details on the values that couldn't be cast. On the
other hand, if strict=False
, any values that cannot be converted to the target data type will be
quietly converted to null
.
Basic example
Let's take a look at the following dataframe which contains both integers and floating point numbers:
import polars as pl
df = pl.DataFrame(
{
"integers": [1, 2, 3],
"big_integers": [10000002, 2, 30000003],
"floats": [4.0, 5.8, -6.3],
}
)
print(df)
use polars::prelude::*;
let df = df! (
"integers"=> [1, 2, 3],
"big_integers"=> [10000002, 2, 30000003],
"floats"=> [4.0, 5.8, -6.3],
)?;
println!("{}", df);
shape: (3, 3)
┌──────────┬──────────────┬────────┐
│ integers ┆ big_integers ┆ floats │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ f64 │
╞══════════╪══════════════╪════════╡
│ 1 ┆ 10000002 ┆ 4.0 │
│ 2 ┆ 2 ┆ 5.8 │
│ 3 ┆ 30000003 ┆ -6.3 │
└──────────┴──────────────┴────────┘
To perform casting operations between floats and integers, or vice versa, we use the function
cast
:
result = df.select(
pl.col("integers").cast(pl.Float32).alias("integers_as_floats"),
pl.col("floats").cast(pl.Int32).alias("floats_as_integers"),
)
print(result)
let result = df
.clone()
.lazy()
.select([
col("integers")
.cast(DataType::Float32)
.alias("integers_as_floats"),
col("floats")
.cast(DataType::Int32)
.alias("floats_as_integers"),
])
.collect()?;
println!("{}", result);
shape: (3, 2)
┌────────────────────┬────────────────────┐
│ integers_as_floats ┆ floats_as_integers │
│ --- ┆ --- │
│ f32 ┆ i32 │
╞════════════════════╪════════════════════╡
│ 1.0 ┆ 4 │
│ 2.0 ┆ 5 │
│ 3.0 ┆ -6 │
└────────────────────┴────────────────────┘
Note that floating point numbers are truncated when casting to an integer data type.
Downcasting numerical data types
You can reduce the memory footprint of a column by changing the precision associated with its
numeric data type. As an illustration, the code below demonstrates how casting from Int64
to
Int16
and from Float64
to Float32
can be used to lower memory usage:
print(f"Before downcasting: {df.estimated_size()} bytes")
result = df.with_columns(
pl.col("integers").cast(pl.Int16),
pl.col("floats").cast(pl.Float32),
)
print(f"After downcasting: {result.estimated_size()} bytes")
println!("Before downcasting: {} bytes", df.estimated_size());
let result = df
.clone()
.lazy()
.with_columns([
col("integers").cast(DataType::Int16),
col("floats").cast(DataType::Float32),
])
.collect()?;
println!("After downcasting: {} bytes", result.estimated_size());
Before downcasting: 72 bytes
After downcasting: 42 bytes
When performing downcasting it is crucial to ensure that the chosen number of bits (such as 64, 32,
or 16) is sufficient to accommodate the largest and smallest numbers in the column. For example, a
32-bit signed integer (Int32
) represents integers between -2147483648 and 2147483647, inclusive,
while an 8-bit signed integer only represents integers between -128 and 127, inclusive. Attempting
to downcast to a data type with insufficient precision results in an error thrown by Polars:
from polars.exceptions import InvalidOperationError
try:
result = df.select(pl.col("big_integers").cast(pl.Int8))
print(result)
except InvalidOperationError as err:
print(err)
let result = df
.clone()
.lazy()
.select([col("big_integers").strict_cast(DataType::Int8)])
.collect();
if let Err(e) = result {
println!("{}", e)
};
conversion from `i64` to `i8` failed in column 'big_integers' for 2 out of 3 values: [10000002, 30000003]
If you set the parameter strict
to False
the overflowing/underflowing values are converted to
null
:
shape: (3, 1)
┌──────────────┐
│ big_integers │
│ --- │
│ i8 │
╞══════════════╡
│ null │
│ 2 │
│ null │
└──────────────┘
Converting strings to numeric data types
Strings that represent numbers can be converted to the appropriate data types via casting. The opposite conversion is also possible:
df = pl.DataFrame(
{
"integers_as_strings": ["1", "2", "3"],
"floats_as_strings": ["4.0", "5.8", "-6.3"],
"floats": [4.0, 5.8, -6.3],
}
)
result = df.select(
pl.col("integers_as_strings").cast(pl.Int32),
pl.col("floats_as_strings").cast(pl.Float64),
pl.col("floats").cast(pl.String),
)
print(result)
let df = df! (
"integers_as_strings" => ["1", "2", "3"],
"floats_as_strings" => ["4.0", "5.8", "-6.3"],
"floats" => [4.0, 5.8, -6.3],
)?;
let result = df
.clone()
.lazy()
.select([
col("integers_as_strings").cast(DataType::Int32),
col("floats_as_strings").cast(DataType::Float64),
col("floats").cast(DataType::String),
])
.collect()?;
println!("{}", result);
shape: (3, 3)
┌─────────────────────┬───────────────────┬────────┐
│ integers_as_strings ┆ floats_as_strings ┆ floats │
│ --- ┆ --- ┆ --- │
│ i32 ┆ f64 ┆ str │
╞═════════════════════╪═══════════════════╪════════╡
│ 1 ┆ 4.0 ┆ 4.0 │
│ 2 ┆ 5.8 ┆ 5.8 │
│ 3 ┆ -6.3 ┆ -6.3 │
└─────────────────────┴───────────────────┴────────┘
In case the column contains a non-numerical value, or a poorly formatted one, Polars will throw an
error with details on the conversion error. You can set strict=False
to circumvent the error and
get a null
value instead.
df = pl.DataFrame(
{
"floats": ["4.0", "5.8", "- 6 . 3"],
}
)
try:
result = df.select(pl.col("floats").cast(pl.Float64))
except InvalidOperationError as err:
print(err)
let df = df! ("floats" => ["4.0", "5.8", "- 6 . 3"])?;
let result = df
.clone()
.lazy()
.select([col("floats").strict_cast(DataType::Float64)])
.collect();
if let Err(e) = result {
println!("{}", e)
};
conversion from `str` to `f64` failed in column 'floats' for 1 out of 3 values: ["- 6 . 3"]
Booleans
Booleans can be expressed as either 1 (True
) or 0 (False
). It's possible to perform casting
operations between a numerical data type and a Boolean, and vice versa.
When converting numbers to Booleans, the number 0 is converted to False
and all other numbers are
converted to True
, in alignment with Python's Truthy and Falsy values for numbers:
df = pl.DataFrame(
{
"integers": [-1, 0, 2, 3, 4],
"floats": [0.0, 1.0, 2.0, 3.0, 4.0],
"bools": [True, False, True, False, True],
}
)
result = df.select(
pl.col("integers").cast(pl.Boolean),
pl.col("floats").cast(pl.Boolean),
pl.col("bools").cast(pl.Int8),
)
print(result)
let df = df! (
"integers"=> [-1, 0, 2, 3, 4],
"floats"=> [0.0, 1.0, 2.0, 3.0, 4.0],
"bools"=> [true, false, true, false, true],
)?;
let result = df
.clone()
.lazy()
.select([
col("integers").cast(DataType::Boolean),
col("floats").cast(DataType::Boolean),
col("bools").cast(DataType::UInt8),
])
.collect()?;
println!("{}", result);
shape: (5, 3)
┌──────────┬────────┬───────┐
│ integers ┆ floats ┆ bools │
│ --- ┆ --- ┆ --- │
│ bool ┆ bool ┆ i8 │
╞══════════╪════════╪═══════╡
│ true ┆ false ┆ 1 │
│ false ┆ true ┆ 0 │
│ true ┆ true ┆ 1 │
│ true ┆ true ┆ 0 │
│ true ┆ true ┆ 1 │
└──────────┴────────┴───────┘
Parsing / formatting temporal data types
All temporal data types are represented internally as the number of time units elapsed since a
reference moment, usually referred to as the epoch. For example, values of the data type Date
are
stored as the number of days since the epoch. For the data type Datetime
the time unit is the
microsecond (us) and for Time
the time unit is the nanosecond (ns).
Casting between numerical types and temporal data types is allowed and exposes this relationship:
from datetime import date, datetime, time
df = pl.DataFrame(
{
"date": [
date(1970, 1, 1), # epoch
date(1970, 1, 10), # 9 days later
],
"datetime": [
datetime(1970, 1, 1, 0, 0, 0), # epoch
datetime(1970, 1, 1, 0, 1, 0), # 1 minute later
],
"time": [
time(0, 0, 0), # reference time
time(0, 0, 1), # 1 second later
],
}
)
result = df.select(
pl.col("date").cast(pl.Int64).alias("days_since_epoch"),
pl.col("datetime").cast(pl.Int64).alias("us_since_epoch"),
pl.col("time").cast(pl.Int64).alias("ns_since_midnight"),
)
print(result)
use chrono::prelude::*;
let df = df!(
"date" => [
NaiveDate::from_ymd_opt(1970, 1, 1).unwrap(), // epoch
NaiveDate::from_ymd_opt(1970, 1, 10).unwrap(), // 9 days later
],
"datetime" => [
NaiveDate::from_ymd_opt(1970, 1, 1).unwrap().and_hms_opt(0, 0, 0).unwrap(), // epoch
NaiveDate::from_ymd_opt(1970, 1, 1).unwrap().and_hms_opt(0, 1, 0).unwrap(), // 1 minute later
],
"time" => [
NaiveTime::from_hms_opt(0, 0, 0).unwrap(), // reference time
NaiveTime::from_hms_opt(0, 0, 1).unwrap(), // 1 second later
]
)
.unwrap()
.lazy()
// Make the time unit match that of Python's for the same results.
.with_column(col("datetime").cast(DataType::Datetime(TimeUnit::Microseconds, None)))
.collect()?;
let result = df
.clone()
.lazy()
.select([
col("date").cast(DataType::Int64).alias("days_since_epoch"),
col("datetime")
.cast(DataType::Int64)
.alias("us_since_epoch"),
col("time").cast(DataType::Int64).alias("ns_since_midnight"),
])
.collect()?;
println!("{}", result);
shape: (2, 3)
┌──────────────────┬────────────────┬───────────────────┐
│ days_since_epoch ┆ us_since_epoch ┆ ns_since_midnight │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞══════════════════╪════════════════╪═══════════════════╡
│ 0 ┆ 0 ┆ 0 │
│ 9 ┆ 60000000 ┆ 1000000000 │
└──────────────────┴────────────────┴───────────────────┘
To format temporal data types as strings we can use the function dt.to_string
and to parse
temporal data types from strings we can use the function str.to_datetime
. Both functions adopt the
chrono format syntax for
formatting.
df = pl.DataFrame(
{
"date": [date(2022, 1, 1), date(2022, 1, 2)],
"string": ["2022-01-01", "2022-01-02"],
}
)
result = df.select(
pl.col("date").dt.to_string("%Y-%m-%d"),
pl.col("string").str.to_datetime("%Y-%m-%d"),
)
print(result)
dt.to_string
· str.replace_all
· Available on feature temporal · Available on feature dtype-date
let df = df! (
"date" => [
NaiveDate::from_ymd_opt(2022, 1, 1).unwrap(),
NaiveDate::from_ymd_opt(2022, 1, 2).unwrap(),
],
"string" => [
"2022-01-01",
"2022-01-02",
],
)?;
let result = df
.clone()
.lazy()
.select([
col("date").dt().to_string("%Y-%m-%d"),
col("string").str().to_datetime(
Some(TimeUnit::Microseconds),
None,
StrptimeOptions::default(),
lit("raise"),
),
])
.collect()?;
println!("{}", result);
shape: (2, 2)
┌────────────┬─────────────────────┐
│ date ┆ string │
│ --- ┆ --- │
│ str ┆ datetime[μs] │
╞════════════╪═════════════════════╡
│ 2022-01-01 ┆ 2022-01-01 00:00:00 │
│ 2022-01-02 ┆ 2022-01-02 00:00:00 │
└────────────┴─────────────────────┘
It's worth noting that str.to_datetime
features additional options that support timezone
functionality. Refer to the API documentation for further information.