polars.DataFrame.join_asof#

Perform an asof join.

This is similar to a left-join except that we match on nearest key rather than equal keys.

Both DataFrames must be sorted by the asof_join key.

For each row in the left DataFrame:

A “backward” search selects the last row in the right DataFrame whose ‘on’ key is less than or equal to the left’s key.

A “forward” search selects the first row in the right DataFrame whose ‘on’ key is greater than or equal to the left’s key.

A “nearest” search selects the last row in the right DataFrame whose value is nearest to the left’s key.

The default is “backward”.

Parameters:

other

Lazy DataFrame to join with.

left_on

Join column of the left DataFrame.

right_on

Join column of the right DataFrame.

on

Join column of both DataFrames. If set, left_on and right_on should be None.

by

join on these columns before doing asof join

by_left

join on these columns before doing asof join

by_right

join on these columns before doing asof join

strategy{‘backward’, ‘forward’, ‘nearest’}

Join strategy.

suffix

Suffix to append to columns with a duplicate name.

tolerance

Numeric tolerance. By setting this the join will only be done if the near keys are within this distance. If an asof join is done on columns of dtype “Date”, “Datetime”, “Duration” or “Time”, use the following string language:

1ns (1 nanosecond)

1us (1 microsecond)

1ms (1 millisecond)

1s (1 second)

1m (1 minute)

1h (1 hour)

1d (1 calendar day)

1w (1 calendar week)

1mo (1 calendar month)

1q (1 calendar quarter)

1y (1 calendar year)

1i (1 index count)

Or combine them: “3d12h4m25s” # 3 days, 12 hours, 4 minutes, and 25 seconds

Suffix with “_saturating” to indicate that dates too large for their month should saturate at the largest date (e.g. 2022-02-29 -> 2022-02-28) instead of erroring.

By “calendar day”, we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for “calendar week”, “calendar month”, “calendar quarter”, and “calendar year”.

allow_parallel

Allow the physical plan to optionally evaluate the computation of both DataFrames up to the join in parallel.

force_parallel

Force the physical plan to evaluate the computation of both DataFrames up to the join in parallel.

Examples

>>> from datetime import datetime
>>> gdp = pl.DataFrame(
...     {
...         "date": [
...             datetime(2016, 1, 1),
...             datetime(2017, 1, 1),
...             datetime(2018, 1, 1),
...             datetime(2019, 1, 1),
...         ],  # note record date: Jan 1st (sorted!)
...         "gdp": [4164, 4411, 4566, 4696],
...     }
... ).set_sorted("date")
>>> population = pl.DataFrame(
...     {
...         "date": [
...             datetime(2016, 5, 12),
...             datetime(2017, 5, 12),
...             datetime(2018, 5, 12),
...             datetime(2019, 5, 12),
...         ],  # note record date: May 12th (sorted!)
...         "population": [82.19, 82.66, 83.12, 83.52],
...     }
... ).set_sorted("date")
>>> population.join_asof(gdp, on="date", strategy="backward")
shape: (4, 3)
┌─────────────────────┬────────────┬──────┐
│ date                ┆ population ┆ gdp  │
│ ---                 ┆ ---        ┆ ---  │
│ datetime[μs]        ┆ f64        ┆ i64  │
╞═════════════════════╪════════════╪══════╡
│ 2016-05-12 00:00:00 ┆ 82.19      ┆ 4164 │
│ 2017-05-12 00:00:00 ┆ 82.66      ┆ 4411 │
│ 2018-05-12 00:00:00 ┆ 83.12      ┆ 4566 │
│ 2019-05-12 00:00:00 ┆ 83.52      ┆ 4696 │
└─────────────────────┴────────────┴──────┘