Concatenation
There are a number of ways to concatenate data from separate DataFrames:
- two dataframes with the same columns can be vertically concatenated to make a longer dataframe
- two dataframes with non-overlapping columns can be horizontally concatenated to make a wider dataframe
- two dataframes with different numbers of rows and columns can be diagonally concatenated to make a dataframe which might be longer and/ or wider. Where column names overlap values will be vertically concatenated. Where column names do not overlap new rows and columns will be added. Missing values will be set as
null
Vertical concatenation - getting longer
In a vertical concatenation you combine all of the rows from a list of DataFrames
into a single longer DataFrame
.
df_v1 = pl.DataFrame(
{
"a": [1],
"b": [3],
}
)
df_v2 = pl.DataFrame(
{
"a": [2],
"b": [4],
}
)
df_vertical_concat = pl.concat(
[
df_v1,
df_v2,
],
how="vertical",
)
print(df_vertical_concat)
let df_v1 = df!(
"a"=> &[1],
"b"=> &[3],
)?;
let df_v2 = df!(
"a"=> &[2],
"b"=> &[4],
)?;
let df_vertical_concat = concat(
[df_v1.clone().lazy(), df_v2.clone().lazy()],
UnionArgs::default(),
)?
.collect()?;
println!("{}", &df_vertical_concat);
shape: (2, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 3 │
│ 2 ┆ 4 │
└─────┴─────┘
Vertical concatenation fails when the dataframes do not have the same column names.
Horizontal concatenation - getting wider
In a horizontal concatenation you combine all of the columns from a list of DataFrames
into a single wider DataFrame
.
df_h1 = pl.DataFrame(
{
"l1": [1, 2],
"l2": [3, 4],
}
)
df_h2 = pl.DataFrame(
{
"r1": [5, 6],
"r2": [7, 8],
"r3": [9, 10],
}
)
df_horizontal_concat = pl.concat(
[
df_h1,
df_h2,
],
how="horizontal",
)
print(df_horizontal_concat)
let df_h1 = df!(
"l1"=> &[1, 2],
"l2"=> &[3, 4],
)?;
let df_h2 = df!(
"r1"=> &[5, 6],
"r2"=> &[7, 8],
"r3"=> &[9, 10],
)?;
let df_horizontal_concat = polars::functions::concat_df_horizontal(&[df_h1, df_h2], true)?;
println!("{}", &df_horizontal_concat);
shape: (2, 5)
┌─────┬─────┬─────┬─────┬─────┐
│ l1 ┆ l2 ┆ r1 ┆ r2 ┆ r3 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╪═════╡
│ 1 ┆ 3 ┆ 5 ┆ 7 ┆ 9 │
│ 2 ┆ 4 ┆ 6 ┆ 8 ┆ 10 │
└─────┴─────┴─────┴─────┴─────┘
Horizontal concatenation fails when dataframes have overlapping columns.
When dataframes have different numbers of rows,
columns will be padded with null
values at the end up to the maximum length.
df_h1 = pl.DataFrame(
{
"l1": [1, 2],
"l2": [3, 4],
}
)
df_h2 = pl.DataFrame(
{
"r1": [5, 6, 7],
"r2": [8, 9, 10],
}
)
df_horizontal_concat = pl.concat(
[
df_h1,
df_h2,
],
how="horizontal",
)
print(df_horizontal_concat)
let df_h1 = df!(
"l1"=> &[1, 2],
"l2"=> &[3, 4],
)?;
let df_h2 = df!(
"r1"=> &[5, 6, 7],
"r2"=> &[8, 9, 10],
)?;
let df_horizontal_concat = polars::functions::concat_df_horizontal(&[df_h1, df_h2], true)?;
println!("{}", &df_horizontal_concat);
shape: (3, 4)
┌──────┬──────┬─────┬─────┐
│ l1 ┆ l2 ┆ r1 ┆ r2 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞══════╪══════╪═════╪═════╡
│ 1 ┆ 3 ┆ 5 ┆ 8 │
│ 2 ┆ 4 ┆ 6 ┆ 9 │
│ null ┆ null ┆ 7 ┆ 10 │
└──────┴──────┴─────┴─────┘
Diagonal concatenation - getting longer, wider and null
ier
In a diagonal concatenation you combine all of the row and columns from a list of DataFrames
into a single longer and/or wider DataFrame
.
df_d1 = pl.DataFrame(
{
"a": [1],
"b": [3],
}
)
df_d2 = pl.DataFrame(
{
"a": [2],
"d": [4],
}
)
df_diagonal_concat = pl.concat(
[
df_d1,
df_d2,
],
how="diagonal",
)
print(df_diagonal_concat)
let df_d1 = df!(
"a"=> &[1],
"b"=> &[3],
)?;
let df_d2 = df!(
"a"=> &[2],
"d"=> &[4],)?;
let df_diagonal_concat = polars::functions::concat_df_diagonal(&[df_d1, df_d2])?;
println!("{}", &df_diagonal_concat);
shape: (2, 3)
┌─────┬──────┬──────┐
│ a ┆ b ┆ d │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪══════╪══════╡
│ 1 ┆ 3 ┆ null │
│ 2 ┆ null ┆ 4 │
└─────┴──────┴──────┘
Diagonal concatenation generates nulls when the column names do not overlap.
When the dataframe shapes do not match and we have an overlapping semantic key then we can join the dataframes instead of concatenating them.
Rechunking
Before a concatenation we have two dataframes df1
and df2
. Each column in df1
and df2
is in one or more chunks in memory. By default, during concatenation the chunks in each column are not made contiguous. This makes the concat operation faster and consume less memory but it may slow down future operations that would benefit from having the data be in contiguous memory. The process of copying the fragmented chunks into a single new chunk is known as rechunking. Rechunking is an expensive operation. Prior to version 0.20.26, the default was to perform a rechunk but in new versions, the default is not to.
If you do want Polars to rechunk the concatenated DataFrame
you specify rechunk = True
when doing the concatenation.