Missing data
This section of the user guide teaches how to work with missing data in Polars.
null
and NaN
values
In Polars, missing data is represented by the value null
. This missing value null
is used for
all data types, including numerical types.
Polars also supports the value NaN
(“Not a Number”) for columns with floating point numbers. The
value NaN
is considered to be a valid floating point value, which is different from missing data.
We discuss the value NaN
separately below.
When creating a series or a dataframe, you can set a value to null
by using the appropriate
construct for your language:
shape: (2, 1)
┌───────┐
│ value │
│ --- │
│ i64 │
╞═══════╡
│ 1 │
│ null │
└───────┘
Difference from pandas
In pandas, the value used to represent missing data depends on the data type of the column.
In Polars, missing data is always represented by the value null
.
Missing data metadata
Polars keeps track of some metadata regarding the missing data of each series. This metadata allows Polars to answer some basic queries about missing values in a very efficient way, namely how many values are missing and which ones are missing.
To determine how many values are missing from a column you can use the function null_count
:
null_count_df = df.null_count()
print(null_count_df)
let null_count_df = df.null_count();
println!("{}", null_count_df);
shape: (1, 1)
┌───────┐
│ value │
│ --- │
│ u32 │
╞═══════╡
│ 1 │
└───────┘
The function null_count
can be called on a dataframe, a column from a dataframe, or on a series
directly. The function null_count
is a cheap operation because the result is already known.
Polars uses something called a “validity bitmap” to know which values are missing in a series. The
validity bitmap is memory efficient as it is bit encoded. If a series has length \(n\), then its
validity bitmap will cost \(n / 8\) bytes. The function is_null
uses the validity bitmap to
efficiently report which values are null
and which are not:
shape: (2, 1)
┌───────┐
│ value │
│ --- │
│ bool │
╞═══════╡
│ false │
│ true │
└───────┘
The function is_null
can be used on a column of a dataframe or on a series directly. Again, this
is a cheap operation because the result is already known by Polars.
Why does Polars waste memory on a validity bitmap?
It all comes down to a tradeoff. By using a bit more memory per column, Polars can be much more efficient when performing most operations on your columns. If the validity bitmap wasn't known, every time you wanted to compute something you would have to check each position of the series to see if a legal value was present or not. With the validity bitmap, Polars knows automatically the positions where your operations can be applied.
Filling missing data
Missing data in a series can be filled with the function fill_null
. You can specify how missing
data is effectively filled in a couple of different ways:
- a literal of the correct data type;
- a Polars expression, such as replacing with values computed from another column;
- a strategy based on neighbouring values, such as filling forwards or backwards; and
- interpolation.
To illustrate how each of these methods work we start by defining a simple dataframe with two missing values in the second column:
shape: (5, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ --- ┆ --- │
│ f64 ┆ i64 │
╞══════╪══════╡
│ 0.5 ┆ 1 │
│ 1.0 ┆ null │
│ 1.5 ┆ 3 │
│ 2.0 ┆ null │
│ 2.5 ┆ 5 │
└──────┴──────┘
Fill with a specified literal value
You can fill the missing data with a specified literal value. This literal value will replace all of
the occurrences of the value null
:
shape: (5, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ --- ┆ --- │
│ f64 ┆ i64 │
╞══════╪══════╡
│ 0.5 ┆ 1 │
│ 1.0 ┆ 3 │
│ 1.5 ┆ 3 │
│ 2.0 ┆ 3 │
│ 2.5 ┆ 5 │
└──────┴──────┘
However, this is actually just a special case of the general case where
the function fill_null
replaces missing values with the corresponding values from the result of a Polars expression,
as seen next.
Fill with an expression
In the general case, the missing data can be filled by extracting the corresponding values from the result of a general Polars expression. For example, we can fill the second column with values taken from the double of the first column:
fill_expression_df = df.with_columns(
pl.col("col2").fill_null((2 * pl.col("col1")).cast(pl.Int64)),
)
print(fill_expression_df)
let fill_expression_df = df
.clone()
.lazy()
.with_column(col("col2").fill_null((lit(2) * col("col1")).cast(DataType::Int64)))
.collect()?;
println!("{}", fill_expression_df);
shape: (5, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ --- ┆ --- │
│ f64 ┆ i64 │
╞══════╪══════╡
│ 0.5 ┆ 1 │
│ 1.0 ┆ 2 │
│ 1.5 ┆ 3 │
│ 2.0 ┆ 4 │
│ 2.5 ┆ 5 │
└──────┴──────┘
Fill with a strategy based on neighbouring values
You can also fill the missing data by following a fill strategy based on the neighbouring values.
The two simpler strategies look for the first non-null
value that comes immediately before or
immediately after the value null
that is being filled:
fill_forward_df = df.with_columns(
pl.col("col2").fill_null(strategy="forward").alias("forward"),
pl.col("col2").fill_null(strategy="backward").alias("backward"),
)
print(fill_forward_df)
let fill_literal_df = df
.clone()
.lazy()
.with_columns([
col("col2")
.fill_null_with_strategy(FillNullStrategy::Forward(None))
.alias("forward"),
col("col2")
.fill_null_with_strategy(FillNullStrategy::Backward(None))
.alias("backward"),
])
.collect()?;
println!("{}", fill_literal_df);
shape: (5, 4)
┌──────┬──────┬─────────┬──────────┐
│ col1 ┆ col2 ┆ forward ┆ backward │
│ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ i64 ┆ i64 ┆ i64 │
╞══════╪══════╪═════════╪══════════╡
│ 0.5 ┆ 1 ┆ 1 ┆ 1 │
│ 1.0 ┆ null ┆ 1 ┆ 3 │
│ 1.5 ┆ 3 ┆ 3 ┆ 3 │
│ 2.0 ┆ null ┆ 3 ┆ 5 │
│ 2.5 ┆ 5 ┆ 5 ┆ 5 │
└──────┴──────┴─────────┴──────────┘
You can find other fill strategies in the API docs.
Fill with interpolation
Additionally, you can fill missing data with interpolation by using the function interpolate
instead of the function fill_null
:
fill_interpolation_df = df.with_columns(
pl.col("col2").interpolate(),
)
print(fill_interpolation_df)
let fill_interpolation_df = df
.clone()
.lazy()
.with_column(col("col2").interpolate(InterpolationMethod::Linear))
.collect()?;
println!("{}", fill_interpolation_df);
shape: (5, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞══════╪══════╡
│ 0.5 ┆ 1.0 │
│ 1.0 ┆ 2.0 │
│ 1.5 ┆ 3.0 │
│ 2.0 ┆ 4.0 │
│ 2.5 ┆ 5.0 │
└──────┴──────┘
Not a Number, or NaN
values
Missing data in a series is represented by the value null
, regardless of the data type of the
series. However, in columns that have a floating point data type, the value NaN
can be used. These
values can be created directly:
shape: (4, 1)
┌───────┐
│ value │
│ --- │
│ f64 │
╞═══════╡
│ 1.0 │
│ NaN │
│ NaN │
│ 3.0 │
└───────┘
The special value NaN
might also arise as the result of a computation:
df = pl.DataFrame(
{
"dividend": [1, 0, -1],
"divisor": [1, 0, -1],
}
)
result = df.select(pl.col("dividend") / pl.col("divisor"))
print(result)
let df = df!(
"dividend" => [1.0, 0.0, -1.0],
"divisor" => [1.0, 0.0, -1.0],
)?;
let result = df
.clone()
.lazy()
.select([col("dividend") / col("divisor")])
.collect()?;
println!("{}", result);
shape: (3, 1)
┌──────────┐
│ dividend │
│ --- │
│ f64 │
╞══════════╡
│ 1.0 │
│ NaN │
│ 1.0 │
└──────────┘
Info
By default, a NaN
value in an integer column causes the column to be cast to a float data type in pandas.
This does not happen in Polars; instead, an exception is raised.
NaN
values are considered to be a type of floating point data and are not considered to be
missing data in Polars. This means:
NaN
values are not counted with the functionnull_count
; andNaN
values are filled when you use the specialised functionfill_nan
method but are not filled with the functionfill_null
.
Polars has the functions is_nan
and fill_nan
, which work in a similar way to the functions
is_null
and fill_null
. Unlike with missing data, Polars does not hold any metadata regarding the
NaN
values, so the function is_nan
entails actual computation.
One further difference between the values null
and NaN
is that numerical aggregating functions,
like mean
and sum
, skip the missing values when computing the result, whereas the value NaN
is
considered for the computation and typically propagates into the result. If desirable, this behavior
can be avoided by replacing the occurrences of the value NaN
with the value null
:
mean_nan_df = nan_df.with_columns(
pl.col("value").fill_nan(None).alias("replaced"),
).select(
pl.all().mean().name.suffix("_mean"),
pl.all().sum().name.suffix("_sum"),
)
print(mean_nan_df)
let mean_nan_df = nan_df
.clone()
.lazy()
.with_column(col("value").fill_nan(Null {}.lit()).alias("replaced"))
.select([
col("*").mean().name().suffix("_mean"),
col("*").sum().name().suffix("_sum"),
])
.collect()?;
println!("{}", mean_nan_df);
shape: (1, 4)
┌────────────┬───────────────┬───────────┬──────────────┐
│ value_mean ┆ replaced_mean ┆ value_sum ┆ replaced_sum │
│ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 │
╞════════════╪═══════════════╪═══════════╪══════════════╡
│ NaN ┆ 2.0 ┆ NaN ┆ 4.0 │
└────────────┴───────────────┴───────────┴──────────────┘