Skip to content

Missing data

This section of the user guide teaches how to work with missing data in Polars.

null and NaN values

In Polars, missing data is represented by the value null. This missing value null is used for all data types, including numerical types.

Polars also supports the value NaN (“Not a Number”) for columns with floating point numbers. The value NaN is considered to be a valid floating point value, which is different from missing data. We discuss the value NaN separately below.

When creating a series or a dataframe, you can set a value to null by using the appropriate construct for your language:

DataFrame

import polars as pl

df = pl.DataFrame(
    {
        "value": [1, None],
    },
)
print(df)

DataFrame

use polars::prelude::*;
let df = df! (
    "value" => &[Some(1), None],
)?;

println!("{}", df);

shape: (2, 1)
┌───────┐
│ value │
│ ---   │
│ i64   │
╞═══════╡
│ 1     │
│ null  │
└───────┘

Difference from pandas

In pandas, the value used to represent missing data depends on the data type of the column. In Polars, missing data is always represented by the value null.

Missing data metadata

Polars keeps track of some metadata regarding the missing data of each series. This metadata allows Polars to answer some basic queries about missing values in a very efficient way, namely how many values are missing and which ones are missing.

To determine how many values are missing from a column you can use the function null_count:

null_count

null_count_df = df.null_count()
print(null_count_df)

null_count

let null_count_df = df.null_count();
println!("{}", null_count_df);

shape: (1, 1)
┌───────┐
│ value │
│ ---   │
│ u32   │
╞═══════╡
│ 1     │
└───────┘

The function null_count can be called on a dataframe, a column from a dataframe, or on a series directly. The function null_count is a cheap operation because the result is already known.

Polars uses something called a “validity bitmap” to know which values are missing in a series. The validity bitmap is memory efficient as it is bit encoded. If a series has length \(n\), then its validity bitmap will cost \(n / 8\) bytes. The function is_null uses the validity bitmap to efficiently report which values are null and which are not:

is_null

is_null_series = df.select(
    pl.col("value").is_null(),
)
print(is_null_series)

is_null

let is_null_series = df
    .clone()
    .lazy()
    .select([col("value").is_null()])
    .collect()?;
println!("{}", is_null_series);

shape: (2, 1)
┌───────┐
│ value │
│ ---   │
│ bool  │
╞═══════╡
│ false │
│ true  │
└───────┘

The function is_null can be used on a column of a dataframe or on a series directly. Again, this is a cheap operation because the result is already known by Polars.

Why does Polars waste memory on a validity bitmap?

It all comes down to a tradeoff. By using a bit more memory per column, Polars can be much more efficient when performing most operations on your columns. If the validity bitmap wasn't known, every time you wanted to compute something you would have to check each position of the series to see if a legal value was present or not. With the validity bitmap, Polars knows automatically the positions where your operations can be applied.

Filling missing data

Missing data in a series can be filled with the function fill_null. You can specify how missing data is effectively filled in a couple of different ways:

  • a literal of the correct data type;
  • a Polars expression, such as replacing with values computed from another column;
  • a strategy based on neighbouring values, such as filling forwards or backwards; and
  • interpolation.

To illustrate how each of these methods work we start by defining a simple dataframe with two missing values in the second column:

DataFrame

df = pl.DataFrame(
    {
        "col1": [0.5, 1, 1.5, 2, 2.5],
        "col2": [1, None, 3, None, 5],
    },
)
print(df)

DataFrame

let df = df! (
    "col1" => [0.5, 1.0, 1.5, 2.0, 2.5],
    "col2" => [Some(1), None, Some(3), None, Some(5)],
)?;

println!("{}", df);

shape: (5, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ f64  ┆ i64  │
╞══════╪══════╡
│ 0.5  ┆ 1    │
│ 1.0  ┆ null │
│ 1.5  ┆ 3    │
│ 2.0  ┆ null │
│ 2.5  ┆ 5    │
└──────┴──────┘

Fill with a specified literal value

You can fill the missing data with a specified literal value. This literal value will replace all of the occurrences of the value null:

fill_null

fill_literal_df = df.with_columns(
    pl.col("col2").fill_null(3),
)
print(fill_literal_df)

fill_null

let fill_literal_df = df
    .clone()
    .lazy()
    .with_column(col("col2").fill_null(3))
    .collect()?;

println!("{}", fill_literal_df);

shape: (5, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ f64  ┆ i64  │
╞══════╪══════╡
│ 0.5  ┆ 1    │
│ 1.0  ┆ 3    │
│ 1.5  ┆ 3    │
│ 2.0  ┆ 3    │
│ 2.5  ┆ 5    │
└──────┴──────┘

However, this is actually just a special case of the general case where the function fill_null replaces missing values with the corresponding values from the result of a Polars expression, as seen next.

Fill with an expression

In the general case, the missing data can be filled by extracting the corresponding values from the result of a general Polars expression. For example, we can fill the second column with values taken from the double of the first column:

fill_null

fill_expression_df = df.with_columns(
    pl.col("col2").fill_null((2 * pl.col("col1")).cast(pl.Int64)),
)
print(fill_expression_df)

fill_null

let fill_expression_df = df
    .clone()
    .lazy()
    .with_column(col("col2").fill_null((lit(2) * col("col1")).cast(DataType::Int64)))
    .collect()?;

println!("{}", fill_expression_df);

shape: (5, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ f64  ┆ i64  │
╞══════╪══════╡
│ 0.5  ┆ 1    │
│ 1.0  ┆ 2    │
│ 1.5  ┆ 3    │
│ 2.0  ┆ 4    │
│ 2.5  ┆ 5    │
└──────┴──────┘

Fill with a strategy based on neighbouring values

You can also fill the missing data by following a fill strategy based on the neighbouring values. The two simpler strategies look for the first non-null value that comes immediately before or immediately after the value null that is being filled:

fill_null

fill_forward_df = df.with_columns(
    pl.col("col2").fill_null(strategy="forward").alias("forward"),
    pl.col("col2").fill_null(strategy="backward").alias("backward"),
)
print(fill_forward_df)

fill_null

let fill_literal_df = df
    .clone()
    .lazy()
    .with_columns([
        col("col2")
            .fill_null_with_strategy(FillNullStrategy::Forward(None))
            .alias("forward"),
        col("col2")
            .fill_null_with_strategy(FillNullStrategy::Backward(None))
            .alias("backward"),
    ])
    .collect()?;

println!("{}", fill_literal_df);

shape: (5, 4)
┌──────┬──────┬─────────┬──────────┐
│ col1 ┆ col2 ┆ forward ┆ backward │
│ ---  ┆ ---  ┆ ---     ┆ ---      │
│ f64  ┆ i64  ┆ i64     ┆ i64      │
╞══════╪══════╪═════════╪══════════╡
│ 0.5  ┆ 1    ┆ 1       ┆ 1        │
│ 1.0  ┆ null ┆ 1       ┆ 3        │
│ 1.5  ┆ 3    ┆ 3       ┆ 3        │
│ 2.0  ┆ null ┆ 3       ┆ 5        │
│ 2.5  ┆ 5    ┆ 5       ┆ 5        │
└──────┴──────┴─────────┴──────────┘

You can find other fill strategies in the API docs.

Fill with interpolation

Additionally, you can fill missing data with interpolation by using the function interpolate instead of the function fill_null:

interpolate

fill_interpolation_df = df.with_columns(
    pl.col("col2").interpolate(),
)
print(fill_interpolation_df)

interpolate

let fill_interpolation_df = df
    .clone()
    .lazy()
    .with_column(col("col2").interpolate(InterpolationMethod::Linear))
    .collect()?;

println!("{}", fill_interpolation_df);

shape: (5, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ f64  ┆ f64  │
╞══════╪══════╡
│ 0.5  ┆ 1.0  │
│ 1.0  ┆ 2.0  │
│ 1.5  ┆ 3.0  │
│ 2.0  ┆ 4.0  │
│ 2.5  ┆ 5.0  │
└──────┴──────┘

Not a Number, or NaN values

Missing data in a series is represented by the value null, regardless of the data type of the series. However, in columns that have a floating point data type, the value NaN can be used. These values can be created directly:

DataFrame

import numpy as np

nan_df = pl.DataFrame(
    {
        "value": [1.0, np.nan, float("nan"), 3.0],
    },
)
print(nan_df)

DataFrame

let nan_df = df!(
    "value" => [1.0, f64::NAN, f64::NAN, 3.0],
)?;
println!("{}", nan_df);

shape: (4, 1)
┌───────┐
│ value │
│ ---   │
│ f64   │
╞═══════╡
│ 1.0   │
│ NaN   │
│ NaN   │
│ 3.0   │
└───────┘

The special value NaN might also arise as the result of a computation:

df = pl.DataFrame(
    {
        "dividend": [1, 0, -1],
        "divisor": [1, 0, -1],
    }
)
result = df.select(pl.col("dividend") / pl.col("divisor"))
print(result)
let df = df!(
    "dividend" => [1.0, 0.0, -1.0],
    "divisor" => [1.0, 0.0, -1.0],
)?;

let result = df
    .clone()
    .lazy()
    .select([col("dividend") / col("divisor")])
    .collect()?;

println!("{}", result);
shape: (3, 1)
┌──────────┐
│ dividend │
│ ---      │
│ f64      │
╞══════════╡
│ 1.0      │
│ NaN      │
│ 1.0      │
└──────────┘

Info

By default, a NaN value in an integer column causes the column to be cast to a float data type in pandas. This does not happen in Polars; instead, an exception is raised.

NaN values are considered to be a type of floating point data and are not considered to be missing data in Polars. This means:

  • NaN values are not counted with the function null_count; and
  • NaN values are filled when you use the specialised function fill_nan method but are not filled with the function fill_null.

Polars has the functions is_nan and fill_nan, which work in a similar way to the functions is_null and fill_null. Unlike with missing data, Polars does not hold any metadata regarding the NaN values, so the function is_nan entails actual computation.

One further difference between the values null and NaN is that numerical aggregating functions, like mean and sum, skip the missing values when computing the result, whereas the value NaN is considered for the computation and typically propagates into the result. If desirable, this behavior can be avoided by replacing the occurrences of the value NaN with the value null:

fill_nan

mean_nan_df = nan_df.with_columns(
    pl.col("value").fill_nan(None).alias("replaced"),
).select(
    pl.all().mean().name.suffix("_mean"),
    pl.all().sum().name.suffix("_sum"),
)
print(mean_nan_df)

fill_nan

let mean_nan_df = nan_df
    .clone()
    .lazy()
    .with_column(col("value").fill_nan(Null {}.lit()).alias("replaced"))
    .select([
        col("*").mean().name().suffix("_mean"),
        col("*").sum().name().suffix("_sum"),
    ])
    .collect()?;

println!("{}", mean_nan_df);

shape: (1, 4)
┌────────────┬───────────────┬───────────┬──────────────┐
│ value_mean ┆ replaced_mean ┆ value_sum ┆ replaced_sum │
│ ---        ┆ ---           ┆ ---       ┆ ---          │
│ f64        ┆ f64           ┆ f64       ┆ f64          │
╞════════════╪═══════════════╪═══════════╪══════════════╡
│ NaN        ┆ 2.0           ┆ NaN       ┆ 4.0          │
└────────────┴───────────────┴───────────┴──────────────┘