Skip to content

Casting

Casting converts the underlying data type of a column to a new one. Casting is available through the function cast.

The function cast includes a parameter strict that determines how Polars behaves when it encounters a value that cannot be converted from the source data type to the target data type. The default behaviour is strict=True, which means that Polars will thrown an error to notify the user of the failed conversion while also providing details on the values that couldn't be cast. On the other hand, if strict=False, any values that cannot be converted to the target data type will be quietly converted to null.

Basic example

Let's take a look at the following dataframe which contains both integers and floating point numbers:

import polars as pl

df = pl.DataFrame(
    {
        "integers": [1, 2, 3],
        "big_integers": [10000002, 2, 30000003],
        "floats": [4.0, 5.8, -6.3],
    }
)

print(df)
use polars::prelude::*;

let df = df! (
    "integers"=> [1, 2, 3],
    "big_integers"=> [10000002, 2, 30000003],
    "floats"=> [4.0, 5.8, -6.3],
)?;

println!("{}", df);
shape: (3, 3)
┌──────────┬──────────────┬────────┐
│ integers ┆ big_integers ┆ floats │
│ ---      ┆ ---          ┆ ---    │
│ i64      ┆ i64          ┆ f64    │
╞══════════╪══════════════╪════════╡
│ 1        ┆ 10000002     ┆ 4.0    │
│ 2        ┆ 2            ┆ 5.8    │
│ 3        ┆ 30000003     ┆ -6.3   │
└──────────┴──────────────┴────────┘

To perform casting operations between floats and integers, or vice versa, we use the function cast:

cast

result = df.select(
    pl.col("integers").cast(pl.Float32).alias("integers_as_floats"),
    pl.col("floats").cast(pl.Int32).alias("floats_as_integers"),
)
print(result)

cast

let result = df
    .clone()
    .lazy()
    .select([
        col("integers")
            .cast(DataType::Float32)
            .alias("integers_as_floats"),
        col("floats")
            .cast(DataType::Int32)
            .alias("floats_as_integers"),
    ])
    .collect()?;
println!("{}", result);

shape: (3, 2)
┌────────────────────┬────────────────────┐
│ integers_as_floats ┆ floats_as_integers │
│ ---                ┆ ---                │
│ f32                ┆ i32                │
╞════════════════════╪════════════════════╡
│ 1.0                ┆ 4                  │
│ 2.0                ┆ 5                  │
│ 3.0                ┆ -6                 │
└────────────────────┴────────────────────┘

Note that floating point numbers are truncated when casting to an integer data type.

Downcasting numerical data types

You can reduce the memory footprint of a column by changing the precision associated with its numeric data type. As an illustration, the code below demonstrates how casting from Int64 to Int16 and from Float64 to Float32 can be used to lower memory usage:

cast · estimated_size

print(f"Before downcasting: {df.estimated_size()} bytes")
result = df.with_columns(
    pl.col("integers").cast(pl.Int16),
    pl.col("floats").cast(pl.Float32),
)
print(f"After downcasting: {result.estimated_size()} bytes")

cast · estimated_size

println!("Before downcasting: {} bytes", df.estimated_size());
let result = df
    .clone()
    .lazy()
    .with_columns([
        col("integers").cast(DataType::Int16),
        col("floats").cast(DataType::Float32),
    ])
    .collect()?;
println!("After downcasting: {} bytes", result.estimated_size());

Before downcasting: 72 bytes
After downcasting: 42 bytes

When performing downcasting it is crucial to ensure that the chosen number of bits (such as 64, 32, or 16) is sufficient to accommodate the largest and smallest numbers in the column. For example, a 32-bit signed integer (Int32) represents integers between -2147483648 and 2147483647, inclusive, while an 8-bit signed integer only represents integers between -128 and 127, inclusive. Attempting to downcast to a data type with insufficient precision results in an error thrown by Polars:

cast

from polars.exceptions import InvalidOperationError

try:
    result = df.select(pl.col("big_integers").cast(pl.Int8))
    print(result)
except InvalidOperationError as err:
    print(err)

cast

let result = df
    .clone()
    .lazy()
    .select([col("big_integers").strict_cast(DataType::Int8)])
    .collect();
if let Err(e) = result {
    println!("{}", e)
};

conversion from `i64` to `i8` failed in column 'big_integers' for 2 out of 3 values: [10000002, 30000003]

If you set the parameter strict to False the overflowing/underflowing values are converted to null:

cast

result = df.select(pl.col("big_integers").cast(pl.Int8, strict=False))
print(result)

cast

let result = df
    .clone()
    .lazy()
    .select([col("big_integers").cast(DataType::Int8)])
    .collect()?;
println!("{}", result);

shape: (3, 1)
┌──────────────┐
│ big_integers │
│ ---          │
│ i8           │
╞══════════════╡
│ null         │
│ 2            │
│ null         │
└──────────────┘

Converting strings to numeric data types

Strings that represent numbers can be converted to the appropriate data types via casting. The opposite conversion is also possible:

cast

df = pl.DataFrame(
    {
        "integers_as_strings": ["1", "2", "3"],
        "floats_as_strings": ["4.0", "5.8", "-6.3"],
        "floats": [4.0, 5.8, -6.3],
    }
)

result = df.select(
    pl.col("integers_as_strings").cast(pl.Int32),
    pl.col("floats_as_strings").cast(pl.Float64),
    pl.col("floats").cast(pl.String),
)
print(result)

cast

let df = df! (
    "integers_as_strings" => ["1", "2", "3"],
    "floats_as_strings" => ["4.0", "5.8", "-6.3"],
    "floats" => [4.0, 5.8, -6.3],
)?;

let result = df
    .clone()
    .lazy()
    .select([
        col("integers_as_strings").cast(DataType::Int32),
        col("floats_as_strings").cast(DataType::Float64),
        col("floats").cast(DataType::String),
    ])
    .collect()?;
println!("{}", result);

shape: (3, 3)
┌─────────────────────┬───────────────────┬────────┐
│ integers_as_strings ┆ floats_as_strings ┆ floats │
│ ---                 ┆ ---               ┆ ---    │
│ i32                 ┆ f64               ┆ str    │
╞═════════════════════╪═══════════════════╪════════╡
│ 1                   ┆ 4.0               ┆ 4.0    │
│ 2                   ┆ 5.8               ┆ 5.8    │
│ 3                   ┆ -6.3              ┆ -6.3   │
└─────────────────────┴───────────────────┴────────┘

In case the column contains a non-numerical value, or a poorly formatted one, Polars will throw an error with details on the conversion error. You can set strict=False to circumvent the error and get a null value instead.

cast

df = pl.DataFrame(
    {
        "floats": ["4.0", "5.8", "- 6 . 3"],
    }
)
try:
    result = df.select(pl.col("floats").cast(pl.Float64))
except InvalidOperationError as err:
    print(err)

cast

let df = df! ("floats" => ["4.0", "5.8", "- 6 . 3"])?;

let result = df
    .clone()
    .lazy()
    .select([col("floats").strict_cast(DataType::Float64)])
    .collect();
if let Err(e) = result {
    println!("{}", e)
};

conversion from `str` to `f64` failed in column 'floats' for 1 out of 3 values: ["- 6 . 3"]

Booleans

Booleans can be expressed as either 1 (True) or 0 (False). It's possible to perform casting operations between a numerical data type and a Boolean, and vice versa.

When converting numbers to Booleans, the number 0 is converted to False and all other numbers are converted to True, in alignment with Python's Truthy and Falsy values for numbers:

cast

df = pl.DataFrame(
    {
        "integers": [-1, 0, 2, 3, 4],
        "floats": [0.0, 1.0, 2.0, 3.0, 4.0],
        "bools": [True, False, True, False, True],
    }
)

result = df.select(
    pl.col("integers").cast(pl.Boolean),
    pl.col("floats").cast(pl.Boolean),
    pl.col("bools").cast(pl.Int8),
)
print(result)

cast

let df = df! (
        "integers"=> [-1, 0, 2, 3, 4],
        "floats"=> [0.0, 1.0, 2.0, 3.0, 4.0],
        "bools"=> [true, false, true, false, true],
)?;

let result = df
    .clone()
    .lazy()
    .select([
        col("integers").cast(DataType::Boolean),
        col("floats").cast(DataType::Boolean),
        col("bools").cast(DataType::UInt8),
    ])
    .collect()?;
println!("{}", result);

shape: (5, 3)
┌──────────┬────────┬───────┐
│ integers ┆ floats ┆ bools │
│ ---      ┆ ---    ┆ ---   │
│ bool     ┆ bool   ┆ i8    │
╞══════════╪════════╪═══════╡
│ true     ┆ false  ┆ 1     │
│ false    ┆ true   ┆ 0     │
│ true     ┆ true   ┆ 1     │
│ true     ┆ true   ┆ 0     │
│ true     ┆ true   ┆ 1     │
└──────────┴────────┴───────┘

Parsing / formatting temporal data types

All temporal data types are represented internally as the number of time units elapsed since a reference moment, usually referred to as the epoch. For example, values of the data type Date are stored as the number of days since the epoch. For the data type Datetime the time unit is the microsecond (us) and for Time the time unit is the nanosecond (ns).

Casting between numerical types and temporal data types is allowed and exposes this relationship:

cast

from datetime import date, datetime, time

df = pl.DataFrame(
    {
        "date": [
            date(1970, 1, 1),  # epoch
            date(1970, 1, 10),  # 9 days later
        ],
        "datetime": [
            datetime(1970, 1, 1, 0, 0, 0),  # epoch
            datetime(1970, 1, 1, 0, 1, 0),  # 1 minute later
        ],
        "time": [
            time(0, 0, 0),  # reference time
            time(0, 0, 1),  # 1 second later
        ],
    }
)

result = df.select(
    pl.col("date").cast(pl.Int64).alias("days_since_epoch"),
    pl.col("datetime").cast(pl.Int64).alias("us_since_epoch"),
    pl.col("time").cast(pl.Int64).alias("ns_since_midnight"),
)
print(result)

cast

use chrono::prelude::*;

let df = df!(
    "date" => [
        NaiveDate::from_ymd_opt(1970, 1, 1).unwrap(),  // epoch
        NaiveDate::from_ymd_opt(1970, 1, 10).unwrap(),  // 9 days later
    ],
    "datetime" => [
        NaiveDate::from_ymd_opt(1970, 1, 1).unwrap().and_hms_opt(0, 0, 0).unwrap(),  // epoch
        NaiveDate::from_ymd_opt(1970, 1, 1).unwrap().and_hms_opt(0, 1, 0).unwrap(),  // 1 minute later
    ],
    "time" => [
        NaiveTime::from_hms_opt(0, 0, 0).unwrap(),  // reference time
        NaiveTime::from_hms_opt(0, 0, 1).unwrap(),  // 1 second later
    ]
)
.unwrap()
.lazy()
// Make the time unit match that of Python's for the same results.
.with_column(col("datetime").cast(DataType::Datetime(TimeUnit::Microseconds, None)))
.collect()?;

let result = df
    .clone()
    .lazy()
    .select([
        col("date").cast(DataType::Int64).alias("days_since_epoch"),
        col("datetime")
            .cast(DataType::Int64)
            .alias("us_since_epoch"),
        col("time").cast(DataType::Int64).alias("ns_since_midnight"),
    ])
    .collect()?;
println!("{}", result);

shape: (2, 3)
┌──────────────────┬────────────────┬───────────────────┐
│ days_since_epoch ┆ us_since_epoch ┆ ns_since_midnight │
│ ---              ┆ ---            ┆ ---               │
│ i64              ┆ i64            ┆ i64               │
╞══════════════════╪════════════════╪═══════════════════╡
│ 0                ┆ 0              ┆ 0                 │
│ 9                ┆ 60000000       ┆ 1000000000        │
└──────────────────┴────────────────┴───────────────────┘

To format temporal data types as strings we can use the function dt.to_string and to parse temporal data types from strings we can use the function str.to_datetime. Both functions adopt the chrono format syntax for formatting.

dt.to_string · str.to_date

df = pl.DataFrame(
    {
        "date": [date(2022, 1, 1), date(2022, 1, 2)],
        "string": ["2022-01-01", "2022-01-02"],
    }
)

result = df.select(
    pl.col("date").dt.to_string("%Y-%m-%d"),
    pl.col("string").str.to_datetime("%Y-%m-%d"),
)
print(result)

dt.to_string · str.replace_all · Available on feature temporal · Available on feature dtype-date

let df = df! (
        "date" => [
            NaiveDate::from_ymd_opt(2022, 1, 1).unwrap(),
            NaiveDate::from_ymd_opt(2022, 1, 2).unwrap(),
        ],
        "string" => [
            "2022-01-01",
            "2022-01-02",
        ],
)?;

let result = df
    .clone()
    .lazy()
    .select([
        col("date").dt().to_string("%Y-%m-%d"),
        col("string").str().to_datetime(
            Some(TimeUnit::Microseconds),
            None,
            StrptimeOptions::default(),
            lit("raise"),
        ),
    ])
    .collect()?;
println!("{}", result);

shape: (2, 2)
┌────────────┬─────────────────────┐
│ date       ┆ string              │
│ ---        ┆ ---                 │
│ str        ┆ datetime[μs]        │
╞════════════╪═════════════════════╡
│ 2022-01-01 ┆ 2022-01-01 00:00:00 │
│ 2022-01-02 ┆ 2022-01-02 00:00:00 │
└────────────┴─────────────────────┘

It's worth noting that str.to_datetime features additional options that support timezone functionality. Refer to the API documentation for further information.