Joins
Join strategies
Polars supports the following join strategies by specifying the how
argument:
Strategy | Description |
---|---|
inner |
Returns row with matching keys in both frames. Non-matching rows in either the left or right frame are discarded. |
left |
Returns all rows in the left dataframe, whether or not a match in the right-frame is found. Non-matching rows have their right columns null-filled. |
right |
Returns all rows in the right dataframe, whether or not a match in the left-frame is found. Non-matching rows have their left columns null-filled. |
full |
Returns all rows from both the left and right dataframe. If no match is found in one frame, columns from the other frame are null-filled. |
cross |
Returns the Cartesian product of all rows from the left frame with all rows from the right frame. Duplicates rows are retained; the table length of A cross-joined with B is always len(A) × len(B) . |
semi |
Returns all rows from the left frame in which the join key is also present in the right frame. |
anti |
Returns all rows from the left frame in which the join key is not present in the right frame. |
A separate coalesce
parameter determines whether to merge key columns with the same name from the left and right
frames.
Inner join
An inner
join produces a DataFrame
that contains only the rows where the join key exists in both DataFrames
. Let's
take for example the following two DataFrames
:
shape: (3, 2)
┌─────────────┬─────────┐
│ customer_id ┆ name │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════════════╪═════════╡
│ 1 ┆ Alice │
│ 2 ┆ Bob │
│ 3 ┆ Charlie │
└─────────────┴─────────┘
shape: (3, 3)
┌──────────┬─────────────┬────────┐
│ order_id ┆ customer_id ┆ amount │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞══════════╪═════════════╪════════╡
│ a ┆ 1 ┆ 100 │
│ b ┆ 2 ┆ 200 │
│ c ┆ 2 ┆ 300 │
└──────────┴─────────────┴────────┘
To get a DataFrame
with the orders and their associated customer we can do an inner
join on the customer_id
column:
df_inner_customer_join = df_customers.join(df_orders, on="customer_id", how="inner")
print(df_inner_customer_join)
let df_inner_customer_join = df_customers
.clone()
.lazy()
.join(
df_orders.clone().lazy(),
[col("customer_id")],
[col("customer_id")],
JoinArgs::new(JoinType::Inner),
)
.collect()?;
println!("{}", &df_inner_customer_join);
shape: (3, 4)
┌─────────────┬───────┬──────────┬────────┐
│ customer_id ┆ name ┆ order_id ┆ amount │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ i64 │
╞═════════════╪═══════╪══════════╪════════╡
│ 1 ┆ Alice ┆ a ┆ 100 │
│ 2 ┆ Bob ┆ b ┆ 200 │
│ 2 ┆ Bob ┆ c ┆ 300 │
└─────────────┴───────┴──────────┴────────┘
Left join
The left
outer join produces a DataFrame
that contains all the rows from the left DataFrame
and only the rows from
the right DataFrame
where the join key exists in the left DataFrame
. If we now take the example from above and want
to have a DataFrame
with all the customers and their associated orders (regardless of whether they have placed an
order or not) we can do a left
join:
df_left_join = df_customers.join(df_orders, on="customer_id", how="left")
print(df_left_join)
let df_left_join = df_customers
.clone()
.lazy()
.join(
df_orders.clone().lazy(),
[col("customer_id")],
[col("customer_id")],
JoinArgs::new(JoinType::Left),
)
.collect()?;
println!("{}", &df_left_join);
shape: (4, 4)
┌─────────────┬─────────┬──────────┬────────┐
│ customer_id ┆ name ┆ order_id ┆ amount │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ i64 │
╞═════════════╪═════════╪══════════╪════════╡
│ 1 ┆ Alice ┆ a ┆ 100 │
│ 2 ┆ Bob ┆ b ┆ 200 │
│ 2 ┆ Bob ┆ c ┆ 300 │
│ 3 ┆ Charlie ┆ null ┆ null │
└─────────────┴─────────┴──────────┴────────┘
Notice, that the fields for the customer with the customer_id
of 3
are null, as there are no orders for this
customer.
Right join
The right
outer join produces a DataFrame
that contains all the rows from the right DataFrame
and only the rows from
the left DataFrame
where the join key exists in the right DataFrame
. If we now take the example from above and want
to have a DataFrame
with all the customers and their associated orders (regardless of whether they have placed an
order or not) we can do a right
join:
df_right_join = df_orders.join(df_customers, on="customer_id", how="right")
print(df_right_join)
let df_right_join = df_orders
.clone()
.lazy()
.join(
df_customers.clone().lazy(),
[col("customer_id")],
[col("customer_id")],
JoinArgs::new(JoinType::Right),
)
.collect()?;
println!("{}", &df_right_join);
shape: (4, 4)
┌──────────┬────────┬─────────────┬─────────┐
│ order_id ┆ amount ┆ customer_id ┆ name │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ str │
╞══════════╪════════╪═════════════╪═════════╡
│ a ┆ 100 ┆ 1 ┆ Alice │
│ b ┆ 200 ┆ 2 ┆ Bob │
│ c ┆ 300 ┆ 2 ┆ Bob │
│ null ┆ null ┆ 3 ┆ Charlie │
└──────────┴────────┴─────────────┴─────────┘
Notice, that the fields for the customer with the customer_id
of 3
are null, as there are no orders for this
customer.
Outer join
The full
outer join produces a DataFrame
that contains all the rows from both DataFrames
. Columns are null, if the
join key does not exist in the source DataFrame
. Doing a full
outer join on the two DataFrames
from above produces
a similar DataFrame
to the left
join:
df_outer_join = df_customers.join(df_orders, on="customer_id", how="full")
print(df_outer_join)
let df_full_join = df_customers
.clone()
.lazy()
.join(
df_orders.clone().lazy(),
[col("customer_id")],
[col("customer_id")],
JoinArgs::new(JoinType::Full),
)
.collect()?;
println!("{}", &df_full_join);
shape: (4, 5)
┌─────────────┬─────────┬──────────┬───────────────────┬────────┐
│ customer_id ┆ name ┆ order_id ┆ customer_id_right ┆ amount │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ i64 ┆ i64 │
╞═════════════╪═════════╪══════════╪═══════════════════╪════════╡
│ 1 ┆ Alice ┆ a ┆ 1 ┆ 100 │
│ 2 ┆ Bob ┆ b ┆ 2 ┆ 200 │
│ 2 ┆ Bob ┆ c ┆ 2 ┆ 300 │
│ 3 ┆ Charlie ┆ null ┆ null ┆ null │
└─────────────┴─────────┴──────────┴───────────────────┴────────┘
df_outer_coalesce_join = df_customers.join(
df_orders, on="customer_id", how="full", coalesce=True
)
print(df_outer_coalesce_join)
let df_full_join = df_customers
.clone()
.lazy()
.join(
df_orders.clone().lazy(),
[col("customer_id")],
[col("customer_id")],
JoinArgs::new(JoinType::Full).with_coalesce(JoinCoalesce::CoalesceColumns),
)
.collect()?;
println!("{}", &df_full_join);
shape: (4, 4)
┌─────────────┬─────────┬──────────┬────────┐
│ customer_id ┆ name ┆ order_id ┆ amount │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ i64 │
╞═════════════╪═════════╪══════════╪════════╡
│ 1 ┆ Alice ┆ a ┆ 100 │
│ 2 ┆ Bob ┆ b ┆ 200 │
│ 2 ┆ Bob ┆ c ┆ 300 │
│ 3 ┆ Charlie ┆ null ┆ null │
└─────────────┴─────────┴──────────┴────────┘
Cross join
A cross
join is a Cartesian product of the two DataFrames
. This means that every row in the left DataFrame
is
joined with every row in the right DataFrame
. The cross
join is useful for creating a DataFrame
with all possible
combinations of the columns in two DataFrames
. Let's take for example the following two DataFrames
.
shape: (3, 1)
┌───────┐
│ color │
│ --- │
│ str │
╞═══════╡
│ red │
│ blue │
│ green │
└───────┘
shape: (3, 1)
┌──────┐
│ size │
│ --- │
│ str │
╞══════╡
│ S │
│ M │
│ L │
└──────┘
We can now create a DataFrame
containing all possible combinations of the colors and sizes with a cross
join:
shape: (9, 2)
┌───────┬──────┐
│ color ┆ size │
│ --- ┆ --- │
│ str ┆ str │
╞═══════╪══════╡
│ red ┆ S │
│ red ┆ M │
│ red ┆ L │
│ blue ┆ S │
│ blue ┆ M │
│ blue ┆ L │
│ green ┆ S │
│ green ┆ M │
│ green ┆ L │
└───────┴──────┘
The inner
, left
, right
, full
and cross
join strategies are standard amongst dataframe libraries. We provide more
details on the less familiar semi
, anti
and asof
join strategies below.
Semi join
The semi
join returns all rows from the left frame in which the join key is also present in the right frame. Consider
the following scenario: a car rental company has a DataFrame
showing the cars that it owns with each car having a
unique id
.
shape: (3, 2)
┌─────┬────────┐
│ id ┆ make │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪════════╡
│ a ┆ ford │
│ b ┆ toyota │
│ c ┆ bmw │
└─────┴────────┘
The company has another DataFrame
showing each repair job carried out on a vehicle.
shape: (2, 2)
┌─────┬──────┐
│ id ┆ cost │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪══════╡
│ c ┆ 100 │
│ c ┆ 200 │
└─────┴──────┘
You want to answer this question: which of the cars have had repairs carried out?
An inner join does not answer this question directly as it produces a DataFrame
with multiple rows for each car that
has had multiple repair jobs:
shape: (2, 3)
┌─────┬──────┬──────┐
│ id ┆ make ┆ cost │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞═════╪══════╪══════╡
│ c ┆ bmw ┆ 100 │
│ c ┆ bmw ┆ 200 │
└─────┴──────┴──────┘
However, a semi join produces a single row for each car that has had a repair job carried out.
shape: (1, 2)
┌─────┬──────┐
│ id ┆ make │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪══════╡
│ c ┆ bmw │
└─────┴──────┘
Anti join
Continuing this example, an alternative question might be: which of the cars have not had a repair job carried out?
An anti join produces a DataFrame
showing all the cars from df_cars
where the id
is not present in
the df_repairs
DataFrame
.
shape: (2, 2)
┌─────┬────────┐
│ id ┆ make │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪════════╡
│ a ┆ ford │
│ b ┆ toyota │
└─────┴────────┘
Asof join
An asof
join is like a left join except that we match on nearest key rather than equal keys.
In Polars we can do an asof join with the join_asof
method.
Consider the following scenario: a stock market broker has a DataFrame
called df_trades
showing transactions it has
made for different stocks.
df_trades = pl.DataFrame(
{
"time": [
datetime(2020, 1, 1, 9, 1, 0),
datetime(2020, 1, 1, 9, 1, 0),
datetime(2020, 1, 1, 9, 3, 0),
datetime(2020, 1, 1, 9, 6, 0),
],
"stock": ["A", "B", "B", "C"],
"trade": [101, 299, 301, 500],
}
)
print(df_trades)
use chrono::prelude::*;
let df_trades = df!(
"time"=> &[
NaiveDate::from_ymd_opt(2020, 1, 1).unwrap().and_hms_opt(9, 1, 0).unwrap(),
NaiveDate::from_ymd_opt(2020, 1, 1).unwrap().and_hms_opt(9, 1, 0).unwrap(),
NaiveDate::from_ymd_opt(2020, 1, 1).unwrap().and_hms_opt(9, 3, 0).unwrap(),
NaiveDate::from_ymd_opt(2020, 1, 1).unwrap().and_hms_opt(9, 6, 0).unwrap(),
],
"stock"=> &["A", "B", "B", "C"],
"trade"=> &[101, 299, 301, 500],
)?;
println!("{}", &df_trades);
shape: (4, 3)
┌─────────────────────┬───────┬───────┐
│ time ┆ stock ┆ trade │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ str ┆ i64 │
╞═════════════════════╪═══════╪═══════╡
│ 2020-01-01 09:01:00 ┆ A ┆ 101 │
│ 2020-01-01 09:01:00 ┆ B ┆ 299 │
│ 2020-01-01 09:03:00 ┆ B ┆ 301 │
│ 2020-01-01 09:06:00 ┆ C ┆ 500 │
└─────────────────────┴───────┴───────┘
The broker has another DataFrame
called df_quotes
showing prices it has quoted for these stocks.
df_quotes = pl.DataFrame(
{
"time": [
datetime(2020, 1, 1, 9, 0, 0),
datetime(2020, 1, 1, 9, 2, 0),
datetime(2020, 1, 1, 9, 4, 0),
datetime(2020, 1, 1, 9, 6, 0),
],
"stock": ["A", "B", "C", "A"],
"quote": [100, 300, 501, 102],
}
)
print(df_quotes)
let df_quotes = df!(
"time"=> &[
NaiveDate::from_ymd_opt(2020, 1, 1).unwrap().and_hms_opt(9, 0, 0).unwrap(),
NaiveDate::from_ymd_opt(2020, 1, 1).unwrap().and_hms_opt(9, 2, 0).unwrap(),
NaiveDate::from_ymd_opt(2020, 1, 1).unwrap().and_hms_opt(9, 4, 0).unwrap(),
NaiveDate::from_ymd_opt(2020, 1, 1).unwrap().and_hms_opt(9, 6, 0).unwrap(),
],
"stock"=> &["A", "B", "C", "A"],
"quote"=> &[100, 300, 501, 102],
)?;
println!("{}", &df_quotes);
shape: (4, 3)
┌─────────────────────┬───────┬───────┐
│ time ┆ stock ┆ quote │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ str ┆ i64 │
╞═════════════════════╪═══════╪═══════╡
│ 2020-01-01 09:00:00 ┆ A ┆ 100 │
│ 2020-01-01 09:02:00 ┆ B ┆ 300 │
│ 2020-01-01 09:04:00 ┆ C ┆ 501 │
│ 2020-01-01 09:06:00 ┆ A ┆ 102 │
└─────────────────────┴───────┴───────┘
You want to produce a DataFrame
showing for each trade the most recent quote provided before the trade. You do this
with join_asof
(using the default strategy = "backward"
).
To avoid joining between trades on one stock with a quote on another you must specify an exact preliminary join on the
stock column with by="stock"
.
shape: (4, 4)
┌─────────────────────┬───────┬───────┬───────┐
│ time ┆ stock ┆ trade ┆ quote │
│ --- ┆ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ str ┆ i64 ┆ i64 │
╞═════════════════════╪═══════╪═══════╪═══════╡
│ 2020-01-01 09:01:00 ┆ A ┆ 101 ┆ 100 │
│ 2020-01-01 09:01:00 ┆ B ┆ 299 ┆ null │
│ 2020-01-01 09:03:00 ┆ B ┆ 301 ┆ 300 │
│ 2020-01-01 09:06:00 ┆ C ┆ 500 ┆ 501 │
└─────────────────────┴───────┴───────┴───────┘
If you want to make sure that only quotes within a certain time range are joined to the trades you can specify
the tolerance
argument. In this case we want to make sure that the last preceding quote is within 1 minute of the
trade so we set tolerance = "1m"
.
df_asof_tolerance_join = df_trades.join_asof(
df_quotes, on="time", by="stock", tolerance="1m"
)
print(df_asof_tolerance_join)
shape: (4, 4)
┌─────────────────────┬───────┬───────┬───────┐
│ time ┆ stock ┆ trade ┆ quote │
│ --- ┆ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ str ┆ i64 ┆ i64 │
╞═════════════════════╪═══════╪═══════╪═══════╡
│ 2020-01-01 09:01:00 ┆ A ┆ 101 ┆ 100 │
│ 2020-01-01 09:01:00 ┆ B ┆ 299 ┆ null │
│ 2020-01-01 09:03:00 ┆ B ┆ 301 ┆ 300 │
│ 2020-01-01 09:06:00 ┆ C ┆ 500 ┆ null │
└─────────────────────┴───────┴───────┴───────┘