polars.DataFrame.group_by_dynamic#
- DataFrame.group_by_dynamic(
- index_column: IntoExpr,
- *,
- every: str | timedelta,
- period: str | timedelta | None = None,
- offset: str | timedelta | None = None,
- truncate: bool | None = None,
- include_boundaries: bool = False,
- closed: ClosedInterval = 'left',
- label: Label = 'left',
- by: IntoExpr | Iterable[IntoExpr] | None = None,
- start_by: StartBy = 'window',
- check_sorted: bool = True,
Group based on a time value (or index value of type Int32, Int64).
Time windows are calculated and rows are assigned to windows. Different from a normal group by is that a row can be member of multiple groups. By default, the windows look like:
[start, start + period)
[start + every, start + every + period)
[start + 2*every, start + 2*every + period)
…
where
start
is determined bystart_by
,offset
, andevery
(see parameter descriptions below).Warning
The index column must be sorted in ascending order. If
by
is passed, then the index column must be sorted in ascending order within each group.- Parameters:
- index_column
Column used to group based on the time window. Often of type Date/Datetime. This column must be sorted in ascending order (or, if
by
is specified, then it must be sorted in ascending order within each group).In case of a dynamic group by on indices, dtype needs to be one of {Int32, Int64}. Note that Int32 gets temporarily cast to Int64, so if performance matters use an Int64 column.
- every
interval of the window
- period
length of the window, if None it will equal ‘every’
- offset
offset of the window, only takes effect if
start_by
is'window'
. Defaults to negativeevery
.- truncate
truncate the time value to the window lower bound
Deprecated since version 0.19.4: Use
label
instead.- include_boundaries
Add the lower and upper bound of the window to the “_lower_boundary” and “_upper_boundary” columns. This will impact performance because it’s harder to parallelize
- closed{‘left’, ‘right’, ‘both’, ‘none’}
Define which sides of the temporal interval are closed (inclusive).
- label{‘left’, ‘right’, ‘datapoint’}
Define which label to use for the window:
‘left’: lower boundary of the window
‘right’: upper boundary of the window
‘datapoint’: the first value of the index column in the given window. If you don’t need the label to be at one of the boundaries, choose this option for maximum performance
- by
Also group by this column/these columns
- start_by{‘window’, ‘datapoint’, ‘monday’, ‘tuesday’, ‘wednesday’, ‘thursday’, ‘friday’, ‘saturday’, ‘sunday’}
The strategy to determine the start of the first window by.
‘window’: Start by taking the earliest timestamp, truncating it with
every
, and then addingoffset
. Note that weekly windows start on Monday.‘datapoint’: Start from the first encountered data point.
a day of the week (only takes effect if
every
contains'w'
):‘monday’: Start the window on the Monday before the first data point.
‘tuesday’: Start the window on the Tuesday before the first data point.
…
‘sunday’: Start the window on the Sunday before the first data point.
- check_sorted
When the
by
argument is given, polars can not check sortedness by the metadata and has to do a full scan on the index column to verify data is sorted. This is expensive. If you are sure the data within the by groups is sorted, you can set this toFalse
. Doing so incorrectly will lead to incorrect output
- Returns:
- DynamicGroupBy
Object you can call
.agg
on to aggregate by groups, the result of which will be sorted byindex_column
(but note that ifby
columns are passed, it will only be sorted within eachby
group).
See also
Notes
If you’re coming from pandas, then
# polars df.group_by_dynamic("ts", every="1d").agg(pl.col("value").sum())
is equivalent to
# pandas df.set_index("ts").resample("D")["value"].sum().reset_index()
though note that, unlike pandas, polars doesn’t add extra rows for empty windows. If you need
index_column
to be evenly spaced, then please combine withDataFrame.upsample()
.The
every
,period
andoffset
arguments are created with the following string language:1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
1i (1 index count)
Or combine them: “3d12h4m25s” # 3 days, 12 hours, 4 minutes, and 25 seconds
By “calendar day”, we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for “calendar week”, “calendar month”, “calendar quarter”, and “calendar year”.
In case of a group_by_dynamic on an integer column, the windows are defined by:
“1i” # length 1
“10i” # length 10
Examples
>>> from datetime import datetime >>> df = pl.DataFrame( ... { ... "time": pl.datetime_range( ... start=datetime(2021, 12, 16), ... end=datetime(2021, 12, 16, 3), ... interval="30m", ... eager=True, ... ), ... "n": range(7), ... } ... ) >>> df shape: (7, 2) ┌─────────────────────┬─────┐ │ time ┆ n │ │ --- ┆ --- │ │ datetime[μs] ┆ i64 │ ╞═════════════════════╪═════╡ │ 2021-12-16 00:00:00 ┆ 0 │ │ 2021-12-16 00:30:00 ┆ 1 │ │ 2021-12-16 01:00:00 ┆ 2 │ │ 2021-12-16 01:30:00 ┆ 3 │ │ 2021-12-16 02:00:00 ┆ 4 │ │ 2021-12-16 02:30:00 ┆ 5 │ │ 2021-12-16 03:00:00 ┆ 6 │ └─────────────────────┴─────┘
Group by windows of 1 hour starting at 2021-12-16 00:00:00.
>>> df.group_by_dynamic("time", every="1h", closed="right").agg(pl.col("n")) shape: (4, 2) ┌─────────────────────┬───────────┐ │ time ┆ n │ │ --- ┆ --- │ │ datetime[μs] ┆ list[i64] │ ╞═════════════════════╪═══════════╡ │ 2021-12-15 23:00:00 ┆ [0] │ │ 2021-12-16 00:00:00 ┆ [1, 2] │ │ 2021-12-16 01:00:00 ┆ [3, 4] │ │ 2021-12-16 02:00:00 ┆ [5, 6] │ └─────────────────────┴───────────┘
The window boundaries can also be added to the aggregation result
>>> df.group_by_dynamic( ... "time", every="1h", include_boundaries=True, closed="right" ... ).agg(pl.col("n").mean()) shape: (4, 4) ┌─────────────────────┬─────────────────────┬─────────────────────┬─────┐ │ _lower_boundary ┆ _upper_boundary ┆ time ┆ n │ │ --- ┆ --- ┆ --- ┆ --- │ │ datetime[μs] ┆ datetime[μs] ┆ datetime[μs] ┆ f64 │ ╞═════════════════════╪═════════════════════╪═════════════════════╪═════╡ │ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-15 23:00:00 ┆ 0.0 │ │ 2021-12-16 00:00:00 ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 00:00:00 ┆ 1.5 │ │ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 3.5 │ │ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 5.5 │ └─────────────────────┴─────────────────────┴─────────────────────┴─────┘
When closed=”left”, the window excludes the right end of interval: [lower_bound, upper_bound)
>>> df.group_by_dynamic("time", every="1h", closed="left").agg(pl.col("n")) shape: (4, 2) ┌─────────────────────┬───────────┐ │ time ┆ n │ │ --- ┆ --- │ │ datetime[μs] ┆ list[i64] │ ╞═════════════════════╪═══════════╡ │ 2021-12-16 00:00:00 ┆ [0, 1] │ │ 2021-12-16 01:00:00 ┆ [2, 3] │ │ 2021-12-16 02:00:00 ┆ [4, 5] │ │ 2021-12-16 03:00:00 ┆ [6] │ └─────────────────────┴───────────┘
When closed=”both” the time values at the window boundaries belong to 2 groups.
>>> df.group_by_dynamic("time", every="1h", closed="both").agg(pl.col("n")) shape: (5, 2) ┌─────────────────────┬───────────┐ │ time ┆ n │ │ --- ┆ --- │ │ datetime[μs] ┆ list[i64] │ ╞═════════════════════╪═══════════╡ │ 2021-12-15 23:00:00 ┆ [0] │ │ 2021-12-16 00:00:00 ┆ [0, 1, 2] │ │ 2021-12-16 01:00:00 ┆ [2, 3, 4] │ │ 2021-12-16 02:00:00 ┆ [4, 5, 6] │ │ 2021-12-16 03:00:00 ┆ [6] │ └─────────────────────┴───────────┘
Dynamic group bys can also be combined with grouping on normal keys
>>> df = df.with_columns(groups=pl.Series(["a", "a", "a", "b", "b", "a", "a"])) >>> df shape: (7, 3) ┌─────────────────────┬─────┬────────┐ │ time ┆ n ┆ groups │ │ --- ┆ --- ┆ --- │ │ datetime[μs] ┆ i64 ┆ str │ ╞═════════════════════╪═════╪════════╡ │ 2021-12-16 00:00:00 ┆ 0 ┆ a │ │ 2021-12-16 00:30:00 ┆ 1 ┆ a │ │ 2021-12-16 01:00:00 ┆ 2 ┆ a │ │ 2021-12-16 01:30:00 ┆ 3 ┆ b │ │ 2021-12-16 02:00:00 ┆ 4 ┆ b │ │ 2021-12-16 02:30:00 ┆ 5 ┆ a │ │ 2021-12-16 03:00:00 ┆ 6 ┆ a │ └─────────────────────┴─────┴────────┘ >>> df.group_by_dynamic( ... "time", ... every="1h", ... closed="both", ... by="groups", ... include_boundaries=True, ... ).agg(pl.col("n")) shape: (7, 5) ┌────────┬─────────────────────┬─────────────────────┬─────────────────────┬───────────┐ │ groups ┆ _lower_boundary ┆ _upper_boundary ┆ time ┆ n │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ datetime[μs] ┆ datetime[μs] ┆ datetime[μs] ┆ list[i64] │ ╞════════╪═════════════════════╪═════════════════════╪═════════════════════╪═══════════╡ │ a ┆ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-15 23:00:00 ┆ [0] │ │ a ┆ 2021-12-16 00:00:00 ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 00:00:00 ┆ [0, 1, 2] │ │ a ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ [2] │ │ a ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ [5, 6] │ │ a ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 04:00:00 ┆ 2021-12-16 03:00:00 ┆ [6] │ │ b ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ [3, 4] │ │ b ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ [4] │ └────────┴─────────────────────┴─────────────────────┴─────────────────────┴───────────┘
Dynamic group by on an index column
>>> df = pl.DataFrame( ... { ... "idx": pl.int_range(0, 6, eager=True), ... "A": ["A", "A", "B", "B", "B", "C"], ... } ... ) >>> ( ... df.group_by_dynamic( ... "idx", ... every="2i", ... period="3i", ... include_boundaries=True, ... closed="right", ... ).agg(pl.col("A").alias("A_agg_list")) ... ) shape: (4, 4) ┌─────────────────┬─────────────────┬─────┬─────────────────┐ │ _lower_boundary ┆ _upper_boundary ┆ idx ┆ A_agg_list │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 ┆ list[str] │ ╞═════════════════╪═════════════════╪═════╪═════════════════╡ │ -2 ┆ 1 ┆ -2 ┆ ["A", "A"] │ │ 0 ┆ 3 ┆ 0 ┆ ["A", "B", "B"] │ │ 2 ┆ 5 ┆ 2 ┆ ["B", "B", "C"] │ │ 4 ┆ 7 ┆ 4 ┆ ["C"] │ └─────────────────┴─────────────────┴─────┴─────────────────┘