LazyFrame#
This page gives an overview of all public LazyFrame methods.
- class polars.LazyFrame(
 - data: FrameInitTypes | None = None,
 - schema: SchemaDefinition | None = None,
 - *,
 - schema_overrides: SchemaDict | None = None,
 - orient: Orientation | None = None,
 - infer_schema_length: int | None = 100,
 - nan_to_null: bool = False,
 Representation of a Lazy computation graph/query against a DataFrame.
This allows for whole-query optimisation in addition to parallelism, and is the preferred (and highest-performance) mode of operation for polars.
- Parameters:
 - datadict, Sequence, ndarray, Series, or pandas.DataFrame
 Two-dimensional data in various forms; dict input must contain Sequences, Generators, or a
range. Sequence may contain Series or other Sequences.- schemaSequence of str, (str,DataType) pairs, or a {str:DataType,} dict
 The DataFrame schema may be declared in several ways:
As a dict of {name:type} pairs; if type is None, it will be auto-inferred.
As a list of column names; in this case types are automatically inferred.
As a list of (name,type) pairs; this is equivalent to the dictionary form.
If you supply a list of column names that does not match the names in the underlying data, the names given here will overwrite them. The number of names given in the schema should match the underlying data dimensions.
- schema_overridesdict, default None
 Support type specification or override of one or more columns; note that any dtypes inferred from the schema param will be overridden. underlying data, the names given here will overwrite them.
The number of entries in the schema should match the underlying data dimensions, unless a sequence of dictionaries is being passed, in which case a partial schema can be declared to prevent specific fields from being loaded.
- orient{‘col’, ‘row’}, default None
 Whether to interpret two-dimensional data as columns or as rows. If None, the orientation is inferred by matching the columns and data dimensions. If this does not yield conclusive results, column orientation is used.
- infer_schema_lengthint, default None
 Maximum number of rows to read for schema inference; only applies if the input data is a sequence or generator of rows; other input is read as-is.
- nan_to_nullbool, default False
 If the data comes from one or more numpy arrays, can optionally convert input data np.nan values to null instead. This is a no-op for all other input data.
Notes
Initialising
LazyFrame(...)directly is equivalent toDataFrame(...).lazy().Examples
Constructing a LazyFrame directly from a dictionary:
>>> data = {"a": [1, 2], "b": [3, 4]} >>> lf = pl.LazyFrame(data) >>> lf.collect() shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 3 │ │ 2 ┆ 4 │ └─────┴─────┘
Notice that the dtypes are automatically inferred as polars Int64:
>>> lf.dtypes [Int64, Int64]
To specify a more detailed/specific frame schema you can supply the
schemaparameter with a dictionary of (name,dtype) pairs…>>> data = {"col1": [0, 2], "col2": [3, 7]} >>> lf2 = pl.LazyFrame(data, schema={"col1": pl.Float32, "col2": pl.Int64}) >>> lf2.collect() shape: (2, 2) ┌──────┬──────┐ │ col1 ┆ col2 │ │ --- ┆ --- │ │ f32 ┆ i64 │ ╞══════╪══════╡ │ 0.0 ┆ 3 │ │ 2.0 ┆ 7 │ └──────┴──────┘
…a sequence of (name,dtype) pairs…
>>> data = {"col1": [1, 2], "col2": [3, 4]} >>> lf3 = pl.LazyFrame(data, schema=[("col1", pl.Float32), ("col2", pl.Int64)]) >>> lf3.collect() shape: (2, 2) ┌──────┬──────┐ │ col1 ┆ col2 │ │ --- ┆ --- │ │ f32 ┆ i64 │ ╞══════╪══════╡ │ 1.0 ┆ 3 │ │ 2.0 ┆ 4 │ └──────┴──────┘
…or a list of typed Series.
>>> data = [ ... pl.Series("col1", [1, 2], dtype=pl.Float32), ... pl.Series("col2", [3, 4], dtype=pl.Int64), ... ] >>> lf4 = pl.LazyFrame(data) >>> lf4.collect() shape: (2, 2) ┌──────┬──────┐ │ col1 ┆ col2 │ │ --- ┆ --- │ │ f32 ┆ i64 │ ╞══════╪══════╡ │ 1.0 ┆ 3 │ │ 2.0 ┆ 4 │ └──────┴──────┘
Constructing a LazyFrame from a numpy ndarray, specifying column names:
>>> import numpy as np >>> data = np.array([(1, 2), (3, 4)], dtype=np.int64) >>> lf5 = pl.LazyFrame(data, schema=["a", "b"], orient="col") >>> lf5.collect() shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 3 │ │ 2 ┆ 4 │ └─────┴─────┘
Constructing a LazyFrame from a list of lists, row orientation inferred:
>>> data = [[1, 2, 3], [4, 5, 6]] >>> lf6 = pl.LazyFrame(data, schema=["a", "b", "c"]) >>> lf6.collect() shape: (2, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ 1 ┆ 2 ┆ 3 │ │ 4 ┆ 5 ┆ 6 │ └─────┴─────┴─────┘
Methods:
Approximate count of unique values.
Approximate count of unique values.
Return the
ksmallest elements.Cache the result once the execution of the physical plan hits this node.
Cast LazyFrame column(s) to the specified dtype(s).
Create an empty copy of the current LazyFrame, with zero to 'n' rows.
Create a copy of this LazyFrame.
Materialize this LazyFrame into a DataFrame.
Collect DataFrame asynchronously in thread pool.
Read a logical plan from a JSON file to construct a LazyFrame.
Remove columns from the DataFrame.
Drop all rows that contain null values.
Create a string representation of the query plan.
Explode the DataFrame to long format by exploding the given columns.
Collect a small number of rows for debugging purposes.
Fill floating point NaN values.
Fill null values using the specified value or strategy.
Filter the rows in the LazyFrame based on a predicate expression.
Get the first row of the DataFrame.
Read a logical plan from a JSON string to construct a LazyFrame.
Take every nth row in the LazyFrame and return as a new LazyFrame.
Start a group by operation.
Group based on a time value (or index value of type Int32, Int64).
Create rolling groups based on a time, Int32, or Int64 column.
Start a group by operation.
Group based on a time value (or index value of type Int32, Int64).
Create rolling groups based on a time, Int32, or Int64 column.
Get the first
nrows.Inspect a node in the computation graph.
Interpolate intermediate values.
Add a join operation to the Logical Plan.
Perform an asof join.
Get the last row of the DataFrame.
Return lazy representation, i.e. itself.
Get the first
nrows.Apply a custom function.
Apply a custom function.
Aggregate the columns in the LazyFrame to their maximum value.
Aggregate the columns in the LazyFrame to their mean value.
Aggregate the columns in the LazyFrame to their median value.
Unpivot a DataFrame from wide to long format.
Take two sorted DataFrames and merge them by the sorted key.
Aggregate the columns in the LazyFrame to their minimum value.
Aggregate the columns in the LazyFrame as the sum of their null value count.
Offers a structured way to apply a sequence of user-defined functions (UDFs).
Profile a LazyFrame.
Aggregate the columns in the LazyFrame to their quantile value.
Read a logical plan from a JSON file to construct a LazyFrame.
Rename column names.
Reverse the DataFrame.
Create rolling groups based on a time, Int32, or Int64 column.
Select columns from this LazyFrame.
Select columns from this LazyFrame.
Serialize the logical plan of this LazyFrame to a file or string in JSON format.
Indicate that one or multiple columns are sorted.
Shift values by the given number of indices.
Shift values by the given number of places and fill the resulting null values.
Show a plot of the query plan.
Evaluate the query in streaming mode and write to a CSV file.
Evaluate the query in streaming mode and write to an IPC file.
sink_ndjsonPersists a LazyFrame at the provided path.
Evaluate the query in streaming mode and write to a Parquet file.
Get a slice of this DataFrame.
Sort the DataFrame by the given columns.
Aggregate the columns in the LazyFrame to their standard deviation value.
Aggregate the columns in the LazyFrame to their sum value.
Get the last
nrows.Take every nth row in the LazyFrame and return as a new LazyFrame.
Return the
klargest elements.Drop duplicate rows from this DataFrame.
Decompose struct columns into separate columns for each of their fields.
Update the values in this
LazyFramewith the non-null values inother.Aggregate the columns in the LazyFrame to their variance value.
Add columns to this DataFrame.
Add columns to this DataFrame.
Add an external context to the computation graph.
Add a column at index 0 that counts the rows.
Serialize the logical plan of this LazyFrame to a file or string in JSON format.
Attributes:
Get column names.
Get dtypes of columns in LazyFrame.
Get a dict[column name, DataType].
Get the width of the LazyFrame.
- approx_n_unique() Self[source]
 Approximate count of unique values.
This is done using the HyperLogLog++ algorithm for cardinality estimation.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [1, 2, 1, 1], ... } ... ) >>> lf.approx_n_unique().collect() shape: (1, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ u32 ┆ u32 │ ╞═════╪═════╡ │ 4 ┆ 2 │ └─────┴─────┘
- approx_unique() Self[source]
 Approximate count of unique values.
Deprecated since version 0.18.12: This method has been renamed to
LazyFrame.approx_n_unique().
- bottom_k(
 - k: int,
 - *,
 - by: IntoExpr | Iterable[IntoExpr],
 - descending: bool | Sequence[bool] = False,
 - nulls_last: bool = False,
 - maintain_order: bool = False,
 Return the
ksmallest elements.If ‘descending=True` the largest elements will be given.
- Parameters:
 - k
 Number of rows to return.
- by
 Column(s) included in sort order. Accepts expression input. Strings are parsed as column names.
- descending
 Return the ‘k’ smallest. Top-k by multiple columns can be specified per column by passing a sequence of booleans.
- nulls_last
 Place null values last.
- maintain_order
 Whether the order should be maintained if elements are equal. Note that if
truestreaming is not possible and performance might be worse since this requires a stable search.
See also
Examples
>>> lf = pl.LazyFrame( ... { ... "a": ["a", "b", "a", "b", "b", "c"], ... "b": [2, 1, 1, 3, 2, 1], ... } ... )
Get the rows which contain the 4 smallest values in column b.
>>> lf.bottom_k(4, by="b").collect() shape: (4, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════╪═════╡ │ b ┆ 1 │ │ a ┆ 1 │ │ c ┆ 1 │ │ a ┆ 2 │ └─────┴─────┘
Get the rows which contain the 4 smallest values when sorting on column a and b.
>>> lf.bottom_k(4, by=["a", "b"]).collect() shape: (4, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════╪═════╡ │ a ┆ 1 │ │ a ┆ 2 │ │ b ┆ 1 │ │ b ┆ 2 │ └─────┴─────┘
- cache() Self[source]
 Cache the result once the execution of the physical plan hits this node.
- cast(
 - dtypes: Mapping[ColumnNameOrSelector, PolarsDataType] | PolarsDataType,
 - *,
 - strict: bool = True,
 Cast LazyFrame column(s) to the specified dtype(s).
- Parameters:
 - dtypes
 Mapping of column names (or selector) to dtypes, or a single dtype to which all columns will be cast.
- strict
 Throw an error if a cast could not be done (for instance, due to an overflow).
Examples
>>> from datetime import date >>> lf = pl.LazyFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": [date(2020, 1, 2), date(2021, 3, 4), date(2022, 5, 6)], ... } ... )
Cast specific frame columns to the specified dtypes:
>>> lf.cast({"foo": pl.Float32, "bar": pl.UInt8}).collect() shape: (3, 3) ┌─────┬─────┬────────────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ f32 ┆ u8 ┆ date │ ╞═════╪═════╪════════════╡ │ 1.0 ┆ 6 ┆ 2020-01-02 │ │ 2.0 ┆ 7 ┆ 2021-03-04 │ │ 3.0 ┆ 8 ┆ 2022-05-06 │ └─────┴─────┴────────────┘
Cast all frame columns to the specified dtype:
>>> lf.cast(pl.Utf8).collect().to_dict(as_series=False) {'foo': ['1', '2', '3'], 'bar': ['6.0', '7.0', '8.0'], 'ham': ['2020-01-02', '2021-03-04', '2022-05-06']}
Use selectors to define the columns being cast:
>>> import polars.selectors as cs >>> lf.cast({cs.numeric(): pl.UInt32, cs.temporal(): pl.Utf8}).collect() shape: (3, 3) ┌─────┬─────┬────────────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ u32 ┆ u32 ┆ str │ ╞═════╪═════╪════════════╡ │ 1 ┆ 6 ┆ 2020-01-02 │ │ 2 ┆ 7 ┆ 2021-03-04 │ │ 3 ┆ 8 ┆ 2022-05-06 │ └─────┴─────┴────────────┘
- clear(n: int = 0) LazyFrame[source]
 Create an empty copy of the current LazyFrame, with zero to ‘n’ rows.
Returns a copy with an identical schema but no data.
- Parameters:
 - n
 Number of (empty) rows to return in the cleared frame.
See also
cloneCheap deepcopy/clone.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [None, 2, 3, 4], ... "b": [0.5, None, 2.5, 13], ... "c": [True, True, False, None], ... } ... ) >>> lf.clear().fetch() shape: (0, 3) ┌─────┬─────┬──────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ bool │ ╞═════╪═════╪══════╡ └─────┴─────┴──────┘
>>> lf.clear(2).fetch() shape: (2, 3) ┌──────┬──────┬──────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ bool │ ╞══════╪══════╪══════╡ │ null ┆ null ┆ null │ │ null ┆ null ┆ null │ └──────┴──────┴──────┘
- clone() Self[source]
 Create a copy of this LazyFrame.
This is a cheap operation that does not copy data.
See also
clearCreate an empty copy of the current LazyFrame, with identical schema but no data.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [None, 2, 3, 4], ... "b": [0.5, None, 2.5, 13], ... "c": [True, True, False, None], ... } ... ) >>> lf.clone() <LazyFrame [3 cols, {"a": Int64 … "c": Boolean}] at ...>
- collect(
 - *,
 - type_coercion: bool = True,
 - predicate_pushdown: bool = True,
 - projection_pushdown: bool = True,
 - simplify_expression: bool = True,
 - slice_pushdown: bool = True,
 - comm_subplan_elim: bool = True,
 - comm_subexpr_elim: bool = True,
 - no_optimization: bool = False,
 - streaming: bool = False,
 - _eager: bool = False,
 Materialize this LazyFrame into a DataFrame.
By default, all query optimizations are enabled. Individual optimizations may be disabled by setting the corresponding parameter to
False.- Parameters:
 - type_coercion
 Do type coercion optimization.
- predicate_pushdown
 Do predicate pushdown optimization.
- projection_pushdown
 Do projection pushdown optimization.
- simplify_expression
 Run simplify expressions optimization.
- slice_pushdown
 Slice pushdown optimization.
- comm_subplan_elim
 Will try to cache branching subplans that occur on self-joins or unions.
- comm_subexpr_elim
 Common subexpressions will be cached and reused.
- no_optimization
 Turn off (certain) optimizations.
- streaming
 Process the query in batches to handle larger-than-memory data. If set to
False(default), the entire query is processed in a single batch.Warning
This functionality is currently in an alpha state.
Note
Use
explain()to see if Polars can process the query in streaming mode.
- Returns:
 - DataFrame
 
See also
fetchRun the query on the first
nrows only for debugging purposes.explainPrint the query plan that is evaluated with collect.
profileCollect the LazyFrame and time each node in the computation graph.
polars.collect_allCollect multiple LazyFrames at the same time.
polars.Config.set_streaming_chunk_sizeSet the size of streaming batches.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": ["a", "b", "a", "b", "b", "c"], ... "b": [1, 2, 3, 4, 5, 6], ... "c": [6, 5, 4, 3, 2, 1], ... } ... ) >>> lf.group_by("a").agg(pl.all().sum()).collect() shape: (3, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ a ┆ 4 ┆ 10 │ │ b ┆ 11 ┆ 10 │ │ c ┆ 6 ┆ 1 │ └─────┴─────┴─────┘
Collect in streaming mode
>>> lf.group_by("a").agg(pl.all().sum()).collect( ... streaming=True ... ) shape: (3, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ a ┆ 4 ┆ 10 │ │ b ┆ 11 ┆ 10 │ │ c ┆ 6 ┆ 1 │ └─────┴─────┴─────┘
- collect_async(
 - *,
 - gevent: Literal[True],
 - type_coercion: bool = True,
 - predicate_pushdown: bool = True,
 - projection_pushdown: bool = True,
 - simplify_expression: bool = True,
 - no_optimization: bool = True,
 - slice_pushdown: bool = True,
 - comm_subplan_elim: bool = True,
 - comm_subexpr_elim: bool = True,
 - streaming: bool = True,
 - collect_async(
 - *,
 - gevent: Literal[False] = False,
 - type_coercion: bool = True,
 - predicate_pushdown: bool = True,
 - projection_pushdown: bool = True,
 - simplify_expression: bool = True,
 - no_optimization: bool = True,
 - slice_pushdown: bool = True,
 - comm_subplan_elim: bool = True,
 - comm_subexpr_elim: bool = True,
 - streaming: bool = True,
 Collect DataFrame asynchronously in thread pool.
Collects into a DataFrame (like
collect()), but instead of returning DataFrame directly, they are scheduled to be collected inside thread pool, while this method returns almost instantly.May be useful if you use gevent or asyncio and want to release control to other greenlets/tasks while LazyFrames are being collected.
- Parameters:
 - gevent
 Return wrapper to
gevent.event.AsyncResultinstead of Awaitable- type_coercion
 Do type coercion optimization.
- predicate_pushdown
 Do predicate pushdown optimization.
- projection_pushdown
 Do projection pushdown optimization.
- simplify_expression
 Run simplify expressions optimization.
- no_optimization
 Turn off (certain) optimizations.
- slice_pushdown
 Slice pushdown optimization.
- comm_subplan_elim
 Will try to cache branching subplans that occur on self-joins or unions.
- comm_subexpr_elim
 Common subexpressions will be cached and reused.
- streaming
 Run parts of the query in a streaming fashion (this is in an alpha state)
- Returns:
 - If 
gevent=False(default) then returns awaitable. - If 
gevent=Truethen returns wrapper that has .get(block=True, timeout=None)method.
- If 
 
Warning
This functionality is experimental and may change without it being considered a breaking change.
See also
polars.collect_allCollect multiple LazyFrames at the same time.
polars.collect_all_asyncCollect multiple LazyFrames at the same time lazily.
Notes
In case of error
set_exceptionis used onasyncio.Future/gevent.event.AsyncResultand will be reraised by them.Examples
>>> import asyncio >>> lf = pl.LazyFrame( ... { ... "a": ["a", "b", "a", "b", "b", "c"], ... "b": [1, 2, 3, 4, 5, 6], ... "c": [6, 5, 4, 3, 2, 1], ... } ... ) >>> async def main(): ... return await ( ... lf.group_by("a", maintain_order=True) ... .agg(pl.all().sum()) ... .collect_async() ... ) ... >>> asyncio.run(main()) shape: (3, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ a ┆ 4 ┆ 10 │ │ b ┆ 11 ┆ 10 │ │ c ┆ 6 ┆ 1 │ └─────┴─────┴─────┘
- property columns: list[str][source]
 Get column names.
Examples
>>> lf = pl.LazyFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ).select("foo", "bar") >>> lf.columns ['foo', 'bar']
- classmethod deserialize(source: str | Path | IOBase) Self[source]
 Read a logical plan from a JSON file to construct a LazyFrame.
- Parameters:
 - source
 Path to a file or a file-like object (by file-like object, we refer to objects that have a
read()method, such as a file handler (e.g. via builtinopenfunction) orBytesIO).
See also
Examples
>>> import io >>> lf = pl.LazyFrame({"a": [1, 2, 3]}).sum() >>> json = lf.serialize() >>> pl.LazyFrame.deserialize(io.StringIO(json)).collect() shape: (1, 1) ┌─────┐ │ a │ │ --- │ │ i64 │ ╞═════╡ │ 6 │ └─────┘
- drop(
 - columns: ColumnNameOrSelector | Collection[ColumnNameOrSelector],
 - *more_columns: ColumnNameOrSelector,
 Remove columns from the DataFrame.
- Parameters:
 - columns
 Name of the column(s) that should be removed from the DataFrame.
- *more_columns
 Additional columns to drop, specified as positional arguments.
Examples
Drop a single column by passing the name of that column.
>>> lf = pl.LazyFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": ["a", "b", "c"], ... } ... ) >>> lf.drop("ham").collect() shape: (3, 2) ┌─────┬─────┐ │ foo ┆ bar │ │ --- ┆ --- │ │ i64 ┆ f64 │ ╞═════╪═════╡ │ 1 ┆ 6.0 │ │ 2 ┆ 7.0 │ │ 3 ┆ 8.0 │ └─────┴─────┘
Drop multiple columns by passing a selector.
>>> import polars.selectors as cs >>> lf.drop(cs.numeric()).collect() shape: (3, 1) ┌─────┐ │ ham │ │ --- │ │ str │ ╞═════╡ │ a │ │ b │ │ c │ └─────┘
Use positional arguments to drop multiple columns.
>>> lf.drop("foo", "ham").collect() shape: (3, 1) ┌─────┐ │ bar │ │ --- │ │ f64 │ ╞═════╡ │ 6.0 │ │ 7.0 │ │ 8.0 │ └─────┘
- drop_nulls(
 - subset: ColumnNameOrSelector | Collection[ColumnNameOrSelector] | None = None,
 Drop all rows that contain null values.
The original order of the remaining rows is preserved.
- Parameters:
 - subset
 Column name(s) for which null values are considered. If set to
None(default), use all columns.
Examples
>>> lf = pl.LazyFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, None, 8], ... "ham": ["a", "b", None], ... } ... )
The default behavior of this method is to drop rows where any single value of the row is null.
>>> lf.drop_nulls().collect() shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ └─────┴─────┴─────┘
This behaviour can be constrained to consider only a subset of columns, as defined by name or with a selector. For example, dropping rows if there is a null in any of the integer columns:
>>> import polars.selectors as cs >>> lf.drop_nulls(subset=cs.integer()).collect() shape: (2, 3) ┌─────┬─────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪══════╡ │ 1 ┆ 6 ┆ a │ │ 3 ┆ 8 ┆ null │ └─────┴─────┴──────┘
This method drops a row if any single value of the row is null.
Below are some example snippets that show how you could drop null values based on other conditions:
>>> lf = pl.LazyFrame( ... { ... "a": [None, None, None, None], ... "b": [1, 2, None, 1], ... "c": [1, None, None, 1], ... } ... ) >>> lf.collect() shape: (4, 3) ┌──────┬──────┬──────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ f32 ┆ i64 ┆ i64 │ ╞══════╪══════╪══════╡ │ null ┆ 1 ┆ 1 │ │ null ┆ 2 ┆ null │ │ null ┆ null ┆ null │ │ null ┆ 1 ┆ 1 │ └──────┴──────┴──────┘
Drop a row only if all values are null:
>>> lf.filter(~pl.all_horizontal(pl.all().is_null())).collect() shape: (3, 3) ┌──────┬─────┬──────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ f32 ┆ i64 ┆ i64 │ ╞══════╪═════╪══════╡ │ null ┆ 1 ┆ 1 │ │ null ┆ 2 ┆ null │ │ null ┆ 1 ┆ 1 │ └──────┴─────┴──────┘
- property dtypes: list[PolarsDataType][source]
 Get dtypes of columns in LazyFrame.
See also
schemaReturns a {colname:dtype} mapping.
Examples
>>> lf = pl.LazyFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": ["a", "b", "c"], ... } ... ) >>> lf.dtypes [Int64, Float64, Utf8]
- explain(
 - *,
 - optimized: bool = True,
 - type_coercion: bool = True,
 - predicate_pushdown: bool = True,
 - projection_pushdown: bool = True,
 - simplify_expression: bool = True,
 - slice_pushdown: bool = True,
 - comm_subplan_elim: bool = True,
 - comm_subexpr_elim: bool = True,
 - streaming: bool = False,
 Create a string representation of the query plan.
Different optimizations can be turned on or off.
- Parameters:
 - optimized
 Return an optimized query plan. Defaults to
True. If this is set toTruethe subsequent optimization flags control which optimizations run.- type_coercion
 Do type coercion optimization.
- predicate_pushdown
 Do predicate pushdown optimization.
- projection_pushdown
 Do projection pushdown optimization.
- simplify_expression
 Run simplify expressions optimization.
- slice_pushdown
 Slice pushdown optimization.
- comm_subplan_elim
 Will try to cache branching subplans that occur on self-joins or unions.
- comm_subexpr_elim
 Common subexpressions will be cached and reused.
- streaming
 Run parts of the query in a streaming fashion (this is in an alpha state)
Examples
>>> lf = pl.LazyFrame( ... { ... "a": ["a", "b", "a", "b", "b", "c"], ... "b": [1, 2, 3, 4, 5, 6], ... "c": [6, 5, 4, 3, 2, 1], ... } ... ) >>> lf.group_by("a", maintain_order=True).agg(pl.all().sum()).sort( ... "a" ... ).explain()
- explode( ) Self[source]
 Explode the DataFrame to long format by exploding the given columns.
- Parameters:
 - columns
 Column names, expressions, or a selector defining them. The underlying columns being exploded must be of List or Utf8 datatype.
- *more_columns
 Additional names of columns to explode, specified as positional arguments.
Examples
>>> lf = pl.LazyFrame( ... { ... "letters": ["a", "a", "b", "c"], ... "numbers": [[1], [2, 3], [4, 5], [6, 7, 8]], ... } ... ) >>> lf.explode("numbers").collect() shape: (8, 2) ┌─────────┬─────────┐ │ letters ┆ numbers │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════════╪═════════╡ │ a ┆ 1 │ │ a ┆ 2 │ │ a ┆ 3 │ │ b ┆ 4 │ │ b ┆ 5 │ │ c ┆ 6 │ │ c ┆ 7 │ │ c ┆ 8 │ └─────────┴─────────┘
- fetch(
 - n_rows: int = 500,
 - *,
 - type_coercion: bool = True,
 - predicate_pushdown: bool = True,
 - projection_pushdown: bool = True,
 - simplify_expression: bool = True,
 - no_optimization: bool = False,
 - slice_pushdown: bool = True,
 - comm_subplan_elim: bool = True,
 - comm_subexpr_elim: bool = True,
 - streaming: bool = False,
 Collect a small number of rows for debugging purposes.
- Parameters:
 - n_rows
 Collect n_rows from the data sources.
- type_coercion
 Run type coercion optimization.
- predicate_pushdown
 Run predicate pushdown optimization.
- projection_pushdown
 Run projection pushdown optimization.
- simplify_expression
 Run simplify expressions optimization.
- no_optimization
 Turn off optimizations.
- slice_pushdown
 Slice pushdown optimization
- comm_subplan_elim
 Will try to cache branching subplans that occur on self-joins or unions.
- comm_subexpr_elim
 Common subexpressions will be cached and reused.
- streaming
 Run parts of the query in a streaming fashion (this is in an alpha state)
- Returns:
 - DataFrame
 
Warning
This is strictly a utility function that can help to debug queries using a smaller number of rows, and should not be used in production code.
Notes
This is similar to a
collect()operation, but it overwrites the number of rows read by every scan operation. Be aware thatfetchdoes not guarantee the final number of rows in the DataFrame. Filters, join operations and fewer rows being available in the scanned data will all influence the final number of rows (joins are especially susceptible to this, and may return no data at all ifn_rowsis too small as the join keys may not be present).Examples
>>> lf = pl.LazyFrame( ... { ... "a": ["a", "b", "a", "b", "b", "c"], ... "b": [1, 2, 3, 4, 5, 6], ... "c": [6, 5, 4, 3, 2, 1], ... } ... ) >>> lf.group_by("a", maintain_order=True).agg(pl.all().sum()).fetch(2) shape: (2, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ a ┆ 1 ┆ 6 │ │ b ┆ 2 ┆ 5 │ └─────┴─────┴─────┘
- fill_nan(value: int | float | Expr | None) Self[source]
 Fill floating point NaN values.
- Parameters:
 - value
 Value to fill the NaN values with.
Warning
Note that floating point NaN (Not a Number) are not missing values! To replace missing values, use
fill_null()instead.Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1.5, 2, float("nan"), 4], ... "b": [0.5, 4, float("nan"), 13], ... } ... ) >>> lf.fill_nan(99).collect() shape: (4, 2) ┌──────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ f64 ┆ f64 │ ╞══════╪══════╡ │ 1.5 ┆ 0.5 │ │ 2.0 ┆ 4.0 │ │ 99.0 ┆ 99.0 │ │ 4.0 ┆ 13.0 │ └──────┴──────┘
- fill_null(
 - value: Any | None = None,
 - strategy: FillNullStrategy | None = None,
 - limit: int | None = None,
 - *,
 - matches_supertype: bool = True,
 Fill null values using the specified value or strategy.
- Parameters:
 - value
 Value used to fill null values.
- strategy{None, ‘forward’, ‘backward’, ‘min’, ‘max’, ‘mean’, ‘zero’, ‘one’}
 Strategy used to fill null values.
- limit
 Number of consecutive null values to fill when using the ‘forward’ or ‘backward’ strategy.
- matches_supertype
 Fill all matching supertypes of the fill
valueliteral.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, None, 4], ... "b": [0.5, 4, None, 13], ... } ... ) >>> lf.fill_null(99).collect() shape: (4, 2) ┌─────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ f64 │ ╞═════╪══════╡ │ 1 ┆ 0.5 │ │ 2 ┆ 4.0 │ │ 99 ┆ 99.0 │ │ 4 ┆ 13.0 │ └─────┴──────┘ >>> lf.fill_null(strategy="forward").collect() shape: (4, 2) ┌─────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ f64 │ ╞═════╪══════╡ │ 1 ┆ 0.5 │ │ 2 ┆ 4.0 │ │ 2 ┆ 4.0 │ │ 4 ┆ 13.0 │ └─────┴──────┘
>>> lf.fill_null(strategy="max").collect() shape: (4, 2) ┌─────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ f64 │ ╞═════╪══════╡ │ 1 ┆ 0.5 │ │ 2 ┆ 4.0 │ │ 4 ┆ 13.0 │ │ 4 ┆ 13.0 │ └─────┴──────┘
>>> lf.fill_null(strategy="zero").collect() shape: (4, 2) ┌─────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ f64 │ ╞═════╪══════╡ │ 1 ┆ 0.5 │ │ 2 ┆ 4.0 │ │ 0 ┆ 0.0 │ │ 4 ┆ 13.0 │ └─────┴──────┘
- filter(
 - *predicates: IntoExprColumn | Iterable[IntoExprColumn] | bool | list[bool] | np.ndarray[Any, Any],
 - **constraints: Any,
 Filter the rows in the LazyFrame based on a predicate expression.
The original order of the remaining rows is preserved.
- Parameters:
 - predicates
 Expression that evaluates to a boolean Series.
- constraints
 Column filters. Use name=value to filter column name by the supplied value.
Examples
>>> lf = pl.LazyFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... )
Filter on one condition:
>>> lf.filter(pl.col("foo") > 1).collect() shape: (2, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 2 ┆ 7 ┆ b │ │ 3 ┆ 8 ┆ c │ └─────┴─────┴─────┘
Filter on multiple conditions:
>>> lf.filter((pl.col("foo") < 3) & (pl.col("ham") == "a")).collect() shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ └─────┴─────┴─────┘
Provide multiple filters using
*argssyntax:>>> lf.filter( ... pl.col("foo") == 1, ... pl.col("ham") == "a", ... ).collect() shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ └─────┴─────┴─────┘
Provide multiple filters using
**kwargssyntax:>>> lf.filter(foo=1, ham="a").collect() shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ └─────┴─────┴─────┘
Filter on an OR condition:
>>> lf.filter((pl.col("foo") == 1) | (pl.col("ham") == "c")).collect() shape: (2, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ │ 3 ┆ 8 ┆ c │ └─────┴─────┴─────┘
- first() Self[source]
 Get the first row of the DataFrame.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 3, 5], ... "b": [2, 4, 6], ... } ... ) >>> lf.first().collect() shape: (1, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 2 │ └─────┴─────┘
- classmethod from_json(json: str) Self[source]
 Read a logical plan from a JSON string to construct a LazyFrame.
Deprecated since version 0.18.12: This method is deprecated. Convert the JSON string to
StringIOand then useLazyFrame.deserialize.- Parameters:
 - json
 String in JSON format.
See also
- gather_every(n: int) Self[source]
 Take every nth row in the LazyFrame and return as a new LazyFrame.
- Parameters:
 - n
 Gather every n-th row.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [5, 6, 7, 8], ... } ... ) >>> lf.gather_every(2).collect() shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 5 │ │ 3 ┆ 7 │ └─────┴─────┘
- group_by(
 - by: IntoExpr | Iterable[IntoExpr],
 - *more_by: IntoExpr,
 - maintain_order: bool = False,
 Start a group by operation.
- Parameters:
 - by
 Column(s) to group by. Accepts expression input. Strings are parsed as column names.
- *more_by
 Additional columns to group by, specified as positional arguments.
- maintain_order
 Ensure that the order of the groups is consistent with the input data. This is slower than a default group by. Setting this to
Trueblocks the possibility to run on the streaming engine.
Examples
Group by one column and call
aggto compute the grouped sum of another column.>>> lf = pl.LazyFrame( ... { ... "a": ["a", "b", "a", "b", "c"], ... "b": [1, 2, 1, 3, 3], ... "c": [5, 4, 3, 2, 1], ... } ... ) >>> lf.group_by("a").agg(pl.col("b").sum()).collect() shape: (3, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════╪═════╡ │ a ┆ 2 │ │ b ┆ 5 │ │ c ┆ 3 │ └─────┴─────┘
Set
maintain_order=Trueto ensure the order of the groups is consistent with the input.>>> lf.group_by("a", maintain_order=True).agg(pl.col("c")).collect() shape: (3, 2) ┌─────┬───────────┐ │ a ┆ c │ │ --- ┆ --- │ │ str ┆ list[i64] │ ╞═════╪═══════════╡ │ a ┆ [5, 3] │ │ b ┆ [4, 2] │ │ c ┆ [1] │ └─────┴───────────┘
Group by multiple columns by passing a list of column names.
>>> lf.group_by(["a", "b"]).agg(pl.max("c")).collect() shape: (4, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ a ┆ 1 ┆ 5 │ │ b ┆ 2 ┆ 4 │ │ b ┆ 3 ┆ 2 │ │ c ┆ 3 ┆ 1 │ └─────┴─────┴─────┘
Or use positional arguments to group by multiple columns in the same way. Expressions are also accepted.
>>> lf.group_by("a", pl.col("b") // 2).agg( ... pl.col("c").mean() ... ).collect() shape: (3, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ f64 │ ╞═════╪═════╪═════╡ │ a ┆ 0 ┆ 4.0 │ │ b ┆ 1 ┆ 3.0 │ │ c ┆ 1 ┆ 1.0 │ └─────┴─────┴─────┘
- group_by_dynamic(
 - index_column: IntoExpr,
 - *,
 - every: str | timedelta,
 - period: str | timedelta | None = None,
 - offset: str | timedelta | None = None,
 - truncate: bool | None = None,
 - include_boundaries: bool = False,
 - closed: ClosedInterval = 'left',
 - label: Label = 'left',
 - by: IntoExpr | Iterable[IntoExpr] | None = None,
 - start_by: StartBy = 'window',
 - check_sorted: bool = True,
 Group based on a time value (or index value of type Int32, Int64).
Time windows are calculated and rows are assigned to windows. Different from a normal group by is that a row can be member of multiple groups. By default, the windows look like:
[start, start + period)
[start + every, start + every + period)
[start + 2*every, start + 2*every + period)
…
where
startis determined bystart_by,offset, andevery(see parameter descriptions below).Warning
The index column must be sorted in ascending order. If
byis passed, then the index column must be sorted in ascending order within each group.- Parameters:
 - index_column
 Column used to group based on the time window. Often of type Date/Datetime. This column must be sorted in ascending order (or, if
byis specified, then it must be sorted in ascending order within each group).In case of a dynamic group by on indices, dtype needs to be one of {Int32, Int64}. Note that Int32 gets temporarily cast to Int64, so if performance matters use an Int64 column.
- every
 interval of the window
- period
 length of the window, if None it will equal ‘every’
- offset
 offset of the window, only takes effect if
start_byis'window'. Defaults to negativeevery.- truncate
 truncate the time value to the window lower bound
Deprecated since version 0.19.4: Use
labelinstead.- include_boundaries
 Add the lower and upper bound of the window to the “_lower_boundary” and “_upper_boundary” columns. This will impact performance because it’s harder to parallelize
- closed{‘left’, ‘right’, ‘both’, ‘none’}
 Define which sides of the temporal interval are closed (inclusive).
- label{‘left’, ‘right’, ‘datapoint’}
 Define which label to use for the window:
‘left’: lower boundary of the window
‘right’: upper boundary of the window
‘datapoint’: the first value of the index column in the given window. If you don’t need the label to be at one of the boundaries, choose this option for maximum performance
- by
 Also group by this column/these columns
- start_by{‘window’, ‘datapoint’, ‘monday’, ‘tuesday’, ‘wednesday’, ‘thursday’, ‘friday’, ‘saturday’, ‘sunday’}
 The strategy to determine the start of the first window by.
‘window’: Start by taking the earliest timestamp, truncating it with
every, and then addingoffset. Note that weekly windows start on Monday.‘datapoint’: Start from the first encountered data point.
a day of the week (only takes effect if
everycontains'w'):‘monday’: Start the window on the Monday before the first data point.
‘tuesday’: Start the window on the Tuesday before the first data point.
…
‘sunday’: Start the window on the Sunday before the first data point.
- check_sorted
 When the
byargument is given, polars can not check sortedness by the metadata and has to do a full scan on the index column to verify data is sorted. This is expensive. If you are sure the data within the by groups is sorted, you can set this toFalse. Doing so incorrectly will lead to incorrect output
- Returns:
 - LazyGroupBy
 Object you can call
.aggon to aggregate by groups, the result of which will be sorted byindex_column(but note that ifbycolumns are passed, it will only be sorted within eachbygroup).
See also
Notes
If you’re coming from pandas, then
# polars df.group_by_dynamic("ts", every="1d").agg(pl.col("value").sum())
is equivalent to
# pandas df.set_index("ts").resample("D")["value"].sum().reset_index()
though note that, unlike pandas, polars doesn’t add extra rows for empty windows. If you need
index_columnto be evenly spaced, then please combine withDataFrame.upsample().The
every,periodandoffsetarguments are created with the following string language:1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
1i (1 index count)
Or combine them: “3d12h4m25s” # 3 days, 12 hours, 4 minutes, and 25 seconds
By “calendar day”, we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for “calendar week”, “calendar month”, “calendar quarter”, and “calendar year”.
In case of a group_by_dynamic on an integer column, the windows are defined by:
“1i” # length 1
“10i” # length 10
Examples
>>> from datetime import datetime >>> lf = pl.LazyFrame( ... { ... "time": pl.datetime_range( ... start=datetime(2021, 12, 16), ... end=datetime(2021, 12, 16, 3), ... interval="30m", ... eager=True, ... ), ... "n": range(7), ... } ... ) >>> lf.collect() shape: (7, 2) ┌─────────────────────┬─────┐ │ time ┆ n │ │ --- ┆ --- │ │ datetime[μs] ┆ i64 │ ╞═════════════════════╪═════╡ │ 2021-12-16 00:00:00 ┆ 0 │ │ 2021-12-16 00:30:00 ┆ 1 │ │ 2021-12-16 01:00:00 ┆ 2 │ │ 2021-12-16 01:30:00 ┆ 3 │ │ 2021-12-16 02:00:00 ┆ 4 │ │ 2021-12-16 02:30:00 ┆ 5 │ │ 2021-12-16 03:00:00 ┆ 6 │ └─────────────────────┴─────┘
Group by windows of 1 hour starting at 2021-12-16 00:00:00.
>>> lf.group_by_dynamic("time", every="1h", closed="right").agg( ... pl.col("n") ... ).collect() shape: (4, 2) ┌─────────────────────┬───────────┐ │ time ┆ n │ │ --- ┆ --- │ │ datetime[μs] ┆ list[i64] │ ╞═════════════════════╪═══════════╡ │ 2021-12-15 23:00:00 ┆ [0] │ │ 2021-12-16 00:00:00 ┆ [1, 2] │ │ 2021-12-16 01:00:00 ┆ [3, 4] │ │ 2021-12-16 02:00:00 ┆ [5, 6] │ └─────────────────────┴───────────┘
The window boundaries can also be added to the aggregation result
>>> lf.group_by_dynamic( ... "time", every="1h", include_boundaries=True, closed="right" ... ).agg(pl.col("n").mean()).collect() shape: (4, 4) ┌─────────────────────┬─────────────────────┬─────────────────────┬─────┐ │ _lower_boundary ┆ _upper_boundary ┆ time ┆ n │ │ --- ┆ --- ┆ --- ┆ --- │ │ datetime[μs] ┆ datetime[μs] ┆ datetime[μs] ┆ f64 │ ╞═════════════════════╪═════════════════════╪═════════════════════╪═════╡ │ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-15 23:00:00 ┆ 0.0 │ │ 2021-12-16 00:00:00 ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 00:00:00 ┆ 1.5 │ │ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 3.5 │ │ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 5.5 │ └─────────────────────┴─────────────────────┴─────────────────────┴─────┘
When closed=”left”, the window excludes the right end of interval: [lower_bound, upper_bound)
>>> lf.group_by_dynamic("time", every="1h", closed="left").agg( ... pl.col("n") ... ).collect() shape: (4, 2) ┌─────────────────────┬───────────┐ │ time ┆ n │ │ --- ┆ --- │ │ datetime[μs] ┆ list[i64] │ ╞═════════════════════╪═══════════╡ │ 2021-12-16 00:00:00 ┆ [0, 1] │ │ 2021-12-16 01:00:00 ┆ [2, 3] │ │ 2021-12-16 02:00:00 ┆ [4, 5] │ │ 2021-12-16 03:00:00 ┆ [6] │ └─────────────────────┴───────────┘
When closed=”both” the time values at the window boundaries belong to 2 groups.
>>> lf.group_by_dynamic("time", every="1h", closed="both").agg( ... pl.col("n") ... ).collect() shape: (5, 2) ┌─────────────────────┬───────────┐ │ time ┆ n │ │ --- ┆ --- │ │ datetime[μs] ┆ list[i64] │ ╞═════════════════════╪═══════════╡ │ 2021-12-15 23:00:00 ┆ [0] │ │ 2021-12-16 00:00:00 ┆ [0, 1, 2] │ │ 2021-12-16 01:00:00 ┆ [2, 3, 4] │ │ 2021-12-16 02:00:00 ┆ [4, 5, 6] │ │ 2021-12-16 03:00:00 ┆ [6] │ └─────────────────────┴───────────┘
Dynamic group bys can also be combined with grouping on normal keys
>>> lf = lf.with_columns(groups=pl.Series(["a", "a", "a", "b", "b", "a", "a"])) >>> lf.collect() shape: (7, 3) ┌─────────────────────┬─────┬────────┐ │ time ┆ n ┆ groups │ │ --- ┆ --- ┆ --- │ │ datetime[μs] ┆ i64 ┆ str │ ╞═════════════════════╪═════╪════════╡ │ 2021-12-16 00:00:00 ┆ 0 ┆ a │ │ 2021-12-16 00:30:00 ┆ 1 ┆ a │ │ 2021-12-16 01:00:00 ┆ 2 ┆ a │ │ 2021-12-16 01:30:00 ┆ 3 ┆ b │ │ 2021-12-16 02:00:00 ┆ 4 ┆ b │ │ 2021-12-16 02:30:00 ┆ 5 ┆ a │ │ 2021-12-16 03:00:00 ┆ 6 ┆ a │ └─────────────────────┴─────┴────────┘ >>> lf.group_by_dynamic( ... "time", ... every="1h", ... closed="both", ... by="groups", ... include_boundaries=True, ... ).agg(pl.col("n")).collect() shape: (7, 5) ┌────────┬─────────────────────┬─────────────────────┬─────────────────────┬───────────┐ │ groups ┆ _lower_boundary ┆ _upper_boundary ┆ time ┆ n │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ datetime[μs] ┆ datetime[μs] ┆ datetime[μs] ┆ list[i64] │ ╞════════╪═════════════════════╪═════════════════════╪═════════════════════╪═══════════╡ │ a ┆ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-15 23:00:00 ┆ [0] │ │ a ┆ 2021-12-16 00:00:00 ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 00:00:00 ┆ [0, 1, 2] │ │ a ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ [2] │ │ a ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ [5, 6] │ │ a ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 04:00:00 ┆ 2021-12-16 03:00:00 ┆ [6] │ │ b ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ [3, 4] │ │ b ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ [4] │ └────────┴─────────────────────┴─────────────────────┴─────────────────────┴───────────┘
Dynamic group by on an index column
>>> lf = pl.LazyFrame( ... { ... "idx": pl.int_range(0, 6, eager=True), ... "A": ["A", "A", "B", "B", "B", "C"], ... } ... ) >>> lf.group_by_dynamic( ... "idx", ... every="2i", ... period="3i", ... include_boundaries=True, ... closed="right", ... ).agg(pl.col("A").alias("A_agg_list")).collect() shape: (4, 4) ┌─────────────────┬─────────────────┬─────┬─────────────────┐ │ _lower_boundary ┆ _upper_boundary ┆ idx ┆ A_agg_list │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 ┆ list[str] │ ╞═════════════════╪═════════════════╪═════╪═════════════════╡ │ -2 ┆ 1 ┆ -2 ┆ ["A", "A"] │ │ 0 ┆ 3 ┆ 0 ┆ ["A", "B", "B"] │ │ 2 ┆ 5 ┆ 2 ┆ ["B", "B", "C"] │ │ 4 ┆ 7 ┆ 4 ┆ ["C"] │ └─────────────────┴─────────────────┴─────┴─────────────────┘
- group_by_rolling(
 - index_column: IntoExpr,
 - *,
 - period: str | timedelta,
 - offset: str | timedelta | None = None,
 - closed: ClosedInterval = 'right',
 - by: IntoExpr | Iterable[IntoExpr] | None = None,
 - check_sorted: bool = True,
 Create rolling groups based on a time, Int32, or Int64 column.
Deprecated since version 0.19.9: This method has been renamed to
LazyFrame.rolling().- Parameters:
 - index_column
 Column used to group based on the time window. Often of type Date/Datetime. This column must be sorted in ascending order (or, if
byis specified, then it must be sorted in ascending order within each group).In case of a rolling group by on indices, dtype needs to be one of {Int32, Int64}. Note that Int32 gets temporarily cast to Int64, so if performance matters use an Int64 column.
- period
 length of the window - must be non-negative
- offset
 offset of the window. Default is -period
- closed{‘right’, ‘left’, ‘both’, ‘none’}
 Define which sides of the temporal interval are closed (inclusive).
- by
 Also group by this column/these columns
- check_sorted
 When the
byargument is given, polars can not check sortedness by the metadata and has to do a full scan on the index column to verify data is sorted. This is expensive. If you are sure the data within the by groups is sorted, you can set this toFalse. Doing so incorrectly will lead to incorrect output
- Returns:
 - LazyGroupBy
 Object you can call
.aggon to aggregate by groups, the result of which will be sorted byindex_column(but note that ifbycolumns are passed, it will only be sorted within eachbygroup).
- groupby(
 - by: IntoExpr | Iterable[IntoExpr],
 - *more_by: IntoExpr,
 - maintain_order: bool = False,
 Start a group by operation.
Deprecated since version 0.19.0: This method has been renamed to
LazyFrame.group_by().- Parameters:
 - by
 Column(s) to group by. Accepts expression input. Strings are parsed as column names.
- *more_by
 Additional columns to group by, specified as positional arguments.
- maintain_order
 Ensure that the order of the groups is consistent with the input data. This is slower than a default group by. Settings this to
Trueblocks the possibility to run on the streaming engine.
- groupby_dynamic(
 - index_column: IntoExpr,
 - *,
 - every: str | timedelta,
 - period: str | timedelta | None = None,
 - offset: str | timedelta | None = None,
 - truncate: bool = True,
 - include_boundaries: bool = False,
 - closed: ClosedInterval = 'left',
 - by: IntoExpr | Iterable[IntoExpr] | None = None,
 - start_by: StartBy = 'window',
 - check_sorted: bool = True,
 Group based on a time value (or index value of type Int32, Int64).
Deprecated since version 0.19.0: This method has been renamed to
LazyFrame.group_by_dynamic().- Parameters:
 - index_column
 Column used to group based on the time window. Often of type Date/Datetime. This column must be sorted in ascending order (or, if
byis specified, then it must be sorted in ascending order within each group).In case of a dynamic group by on indices, dtype needs to be one of {Int32, Int64}. Note that Int32 gets temporarily cast to Int64, so if performance matters use an Int64 column.
- every
 interval of the window
- period
 length of the window, if None it will equal ‘every’
- offset
 offset of the window, only takes effect if
start_byis'window'. Defaults to negativeevery.- truncate
 truncate the time value to the window lower bound
- include_boundaries
 Add the lower and upper bound of the window to the “_lower_bound” and “_upper_bound” columns. This will impact performance because it’s harder to parallelize
- closed{‘right’, ‘left’, ‘both’, ‘none’}
 Define which sides of the temporal interval are closed (inclusive).
- by
 Also group by this column/these columns
- start_by{‘window’, ‘datapoint’, ‘monday’, ‘tuesday’, ‘wednesday’, ‘thursday’, ‘friday’, ‘saturday’, ‘sunday’}
 The strategy to determine the start of the first window by.
‘window’: Start by taking the earliest timestamp, truncating it with
every, and then addingoffset. Note that weekly windows start on Monday.‘datapoint’: Start from the first encountered data point.
a day of the week (only takes effect if
everycontains'w'):‘monday’: Start the window on the Monday before the first data point.
‘tuesday’: Start the window on the Tuesday before the first data point.
…
‘sunday’: Start the window on the Sunday before the first data point.
- check_sorted
 When the
byargument is given, polars can not check sortedness by the metadata and has to do a full scan on the index column to verify data is sorted. This is expensive. If you are sure the data within the by groups is sorted, you can set this toFalse. Doing so incorrectly will lead to incorrect output
- Returns:
 - LazyGroupBy
 Object you can call
.aggon to aggregate by groups, the result of which will be sorted byindex_column(but note that ifbycolumns are passed, it will only be sorted within eachbygroup).
- groupby_rolling(
 - index_column: IntoExpr,
 - *,
 - period: str | timedelta,
 - offset: str | timedelta | None = None,
 - closed: ClosedInterval = 'right',
 - by: IntoExpr | Iterable[IntoExpr] | None = None,
 - check_sorted: bool = True,
 Create rolling groups based on a time, Int32, or Int64 column.
Deprecated since version 0.19.0: This method has been renamed to
LazyFrame.rolling().- Parameters:
 - index_column
 Column used to group based on the time window. Often of type Date/Datetime. This column must be sorted in ascending order (or, if
byis specified, then it must be sorted in ascending order within each group).In case of a rolling group by on indices, dtype needs to be one of {Int32, Int64}. Note that Int32 gets temporarily cast to Int64, so if performance matters use an Int64 column.
- period
 length of the window - must be non-negative
- offset
 offset of the window. Default is -period
- closed{‘right’, ‘left’, ‘both’, ‘none’}
 Define which sides of the temporal interval are closed (inclusive).
- by
 Also group by this column/these columns
- check_sorted
 When the
byargument is given, polars can not check sortedness by the metadata and has to do a full scan on the index column to verify data is sorted. This is expensive. If you are sure the data within the by groups is sorted, you can set this toFalse. Doing so incorrectly will lead to incorrect output
- Returns:
 - LazyGroupBy
 Object you can call
.aggon to aggregate by groups, the result of which will be sorted byindex_column(but note that ifbycolumns are passed, it will only be sorted within eachbygroup).
- head(n: int = 5) Self[source]
 Get the first
nrows.- Parameters:
 - n
 Number of rows to return.
Notes
Consider using the
fetch()operation if you only want to test your query. Thefetch()operation will load the firstnrows at the scan level, whereas thehead()/limit()are applied at the end.Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4, 5, 6], ... "b": [7, 8, 9, 10, 11, 12], ... } ... ) >>> lf.head().collect() shape: (5, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 7 │ │ 2 ┆ 8 │ │ 3 ┆ 9 │ │ 4 ┆ 10 │ │ 5 ┆ 11 │ └─────┴─────┘ >>> lf.head(2).collect() shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 7 │ │ 2 ┆ 8 │ └─────┴─────┘
- inspect(fmt: str = '{}') Self[source]
 Inspect a node in the computation graph.
Print the value that this node in the computation graph evaluates to and passes on the value.
Examples
>>> lf = pl.LazyFrame({"foo": [1, 1, -2, 3]}) >>> ( ... lf.with_columns(pl.col("foo").cum_sum().alias("bar")) ... .inspect() # print the node before the filter ... .filter(pl.col("bar") == pl.col("foo")) ... ) <LazyFrame [2 cols, {"foo": Int64, "bar": Int64}] at ...>
- interpolate() Self[source]
 Interpolate intermediate values. The interpolation method is linear.
Examples
>>> lf = pl.LazyFrame( ... { ... "foo": [1, None, 9, 10], ... "bar": [6, 7, 9, None], ... "baz": [1, None, None, 9], ... } ... ) >>> lf.interpolate().collect() shape: (4, 3) ┌──────┬──────┬──────────┐ │ foo ┆ bar ┆ baz │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ f64 │ ╞══════╪══════╪══════════╡ │ 1.0 ┆ 6.0 ┆ 1.0 │ │ 5.0 ┆ 7.0 ┆ 3.666667 │ │ 9.0 ┆ 9.0 ┆ 6.333333 │ │ 10.0 ┆ null ┆ 9.0 │ └──────┴──────┴──────────┘
- join(
 - other: LazyFrame,
 - on: str | Expr | Sequence[str | Expr] | None = None,
 - how: JoinStrategy = 'inner',
 - *,
 - left_on: str | Expr | Sequence[str | Expr] | None = None,
 - right_on: str | Expr | Sequence[str | Expr] | None = None,
 - suffix: str = '_right',
 - validate: JoinValidation = 'm:m',
 - allow_parallel: bool = True,
 - force_parallel: bool = False,
 Add a join operation to the Logical Plan.
- Parameters:
 - other
 Lazy DataFrame to join with.
- on
 Join column of both DataFrames. If set,
left_onandright_onshould be None.- how{‘inner’, ‘left’, ‘outer’, ‘semi’, ‘anti’, ‘cross’}
 Join strategy.
Note
A left join preserves the row order of the left DataFrame.
- left_on
 Join column of the left DataFrame.
- right_on
 Join column of the right DataFrame.
- suffix
 Suffix to append to columns with a duplicate name.
- validate: {‘m:m’, ‘m:1’, ‘1:m’, ‘1:1’}
 Checks if join is of specified type.
- many_to_many
 “m:m”: default, does not result in checks
- one_to_one
 “1:1”: check if join keys are unique in both left and right datasets
- one_to_many
 “1:m”: check if join keys are unique in left dataset
- many_to_one
 “m:1”: check if join keys are unique in right dataset
Note
This is currently not supported the streaming engine.
This is only supported when joined by single columns.
- allow_parallel
 Allow the physical plan to optionally evaluate the computation of both DataFrames up to the join in parallel.
- force_parallel
 Force the physical plan to evaluate the computation of both DataFrames up to the join in parallel.
See also
Examples
>>> lf = pl.LazyFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": ["a", "b", "c"], ... } ... ) >>> other_lf = pl.LazyFrame( ... { ... "apple": ["x", "y", "z"], ... "ham": ["a", "b", "d"], ... } ... ) >>> lf.join(other_lf, on="ham").collect() shape: (2, 4) ┌─────┬─────┬─────┬───────┐ │ foo ┆ bar ┆ ham ┆ apple │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str ┆ str │ ╞═════╪═════╪═════╪═══════╡ │ 1 ┆ 6.0 ┆ a ┆ x │ │ 2 ┆ 7.0 ┆ b ┆ y │ └─────┴─────┴─────┴───────┘ >>> lf.join(other_lf, on="ham", how="outer").collect() shape: (4, 4) ┌──────┬──────┬─────┬───────┐ │ foo ┆ bar ┆ ham ┆ apple │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str ┆ str │ ╞══════╪══════╪═════╪═══════╡ │ 1 ┆ 6.0 ┆ a ┆ x │ │ 2 ┆ 7.0 ┆ b ┆ y │ │ null ┆ null ┆ d ┆ z │ │ 3 ┆ 8.0 ┆ c ┆ null │ └──────┴──────┴─────┴───────┘ >>> lf.join(other_lf, on="ham", how="left").collect() shape: (3, 4) ┌─────┬─────┬─────┬───────┐ │ foo ┆ bar ┆ ham ┆ apple │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str ┆ str │ ╞═════╪═════╪═════╪═══════╡ │ 1 ┆ 6.0 ┆ a ┆ x │ │ 2 ┆ 7.0 ┆ b ┆ y │ │ 3 ┆ 8.0 ┆ c ┆ null │ └─────┴─────┴─────┴───────┘ >>> lf.join(other_lf, on="ham", how="semi").collect() shape: (2, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6.0 ┆ a │ │ 2 ┆ 7.0 ┆ b │ └─────┴─────┴─────┘ >>> lf.join(other_lf, on="ham", how="anti").collect() shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞═════╪═════╪═════╡ │ 3 ┆ 8.0 ┆ c │ └─────┴─────┴─────┘
- join_asof(
 - other: LazyFrame,
 - *,
 - left_on: str | None | Expr = None,
 - right_on: str | None | Expr = None,
 - on: str | None | Expr = None,
 - by_left: str | Sequence[str] | None = None,
 - by_right: str | Sequence[str] | None = None,
 - by: str | Sequence[str] | None = None,
 - strategy: AsofJoinStrategy = 'backward',
 - suffix: str = '_right',
 - tolerance: str | int | float | timedelta | None = None,
 - allow_parallel: bool = True,
 - force_parallel: bool = False,
 Perform an asof join.
This is similar to a left-join except that we match on nearest key rather than equal keys.
Both DataFrames must be sorted by the join_asof key.
For each row in the left DataFrame:
A “backward” search selects the last row in the right DataFrame whose ‘on’ key is less than or equal to the left’s key.
A “forward” search selects the first row in the right DataFrame whose ‘on’ key is greater than or equal to the left’s key.
A “nearest” search selects the last row in the right DataFrame whose value is nearest to the left’s key. String keys are not currently supported for a nearest search.
The default is “backward”.
- Parameters:
 - other
 Lazy DataFrame to join with.
- left_on
 Join column of the left DataFrame.
- right_on
 Join column of the right DataFrame.
- on
 Join column of both DataFrames. If set,
left_onandright_onshould be None.- by
 Join on these columns before doing asof join.
- by_left
 Join on these columns before doing asof join.
- by_right
 Join on these columns before doing asof join.
- strategy{‘backward’, ‘forward’, ‘nearest’}
 Join strategy.
- suffix
 Suffix to append to columns with a duplicate name.
- tolerance
 Numeric tolerance. By setting this the join will only be done if the near keys are within this distance. If an asof join is done on columns of dtype “Date”, “Datetime”, “Duration” or “Time”, use either a datetime.timedelta object or the following string language:
1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
1i (1 index count)
Or combine them: “3d12h4m25s” # 3 days, 12 hours, 4 minutes, and 25 seconds
By “calendar day”, we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for “calendar week”, “calendar month”, “calendar quarter”, and “calendar year”.
- allow_parallel
 Allow the physical plan to optionally evaluate the computation of both DataFrames up to the join in parallel.
- force_parallel
 Force the physical plan to evaluate the computation of both DataFrames up to the join in parallel.
Examples
>>> from datetime import datetime >>> gdp = pl.LazyFrame( ... { ... "date": [ ... datetime(2016, 1, 1), ... datetime(2017, 1, 1), ... datetime(2018, 1, 1), ... datetime(2019, 1, 1), ... ], # note record date: Jan 1st (sorted!) ... "gdp": [4164, 4411, 4566, 4696], ... } ... ).set_sorted("date") >>> population = pl.LazyFrame( ... { ... "date": [ ... datetime(2016, 5, 12), ... datetime(2017, 5, 12), ... datetime(2018, 5, 12), ... datetime(2019, 5, 12), ... ], # note record date: May 12th (sorted!) ... "population": [82.19, 82.66, 83.12, 83.52], ... } ... ).set_sorted("date") >>> population.join_asof(gdp, on="date", strategy="backward").collect() shape: (4, 3) ┌─────────────────────┬────────────┬──────┐ │ date ┆ population ┆ gdp │ │ --- ┆ --- ┆ --- │ │ datetime[μs] ┆ f64 ┆ i64 │ ╞═════════════════════╪════════════╪══════╡ │ 2016-05-12 00:00:00 ┆ 82.19 ┆ 4164 │ │ 2017-05-12 00:00:00 ┆ 82.66 ┆ 4411 │ │ 2018-05-12 00:00:00 ┆ 83.12 ┆ 4566 │ │ 2019-05-12 00:00:00 ┆ 83.52 ┆ 4696 │ └─────────────────────┴────────────┴──────┘
- last() Self[source]
 Get the last row of the DataFrame.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 3, 5], ... "b": [2, 4, 6], ... } ... ) >>> lf.last().collect() shape: (1, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 5 ┆ 6 │ └─────┴─────┘
- lazy() Self[source]
 Return lazy representation, i.e. itself.
Useful for writing code that expects either a
DataFrameorLazyFrame.- Returns:
 - LazyFrame
 
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [None, 2, 3, 4], ... "b": [0.5, None, 2.5, 13], ... "c": [True, True, False, None], ... } ... ) >>> lf.lazy() <LazyFrame [3 cols, {"a": Int64 … "c": Boolean}] at ...>
- limit(n: int = 5) Self[source]
 Get the first
nrows.Alias for
LazyFrame.head().- Parameters:
 - n
 Number of rows to return.
Notes
Consider using the
fetch()operation if you only want to test your query. Thefetch()operation will load the firstnrows at the scan level, whereas thehead()/limit()are applied at the end.Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4, 5, 6], ... "b": [7, 8, 9, 10, 11, 12], ... } ... ) >>> lf.limit().collect() shape: (5, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 7 │ │ 2 ┆ 8 │ │ 3 ┆ 9 │ │ 4 ┆ 10 │ │ 5 ┆ 11 │ └─────┴─────┘ >>> lf.limit(2).collect() shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 7 │ │ 2 ┆ 8 │ └─────┴─────┘
- map(
 - function: Callable[[DataFrame], DataFrame],
 - *,
 - predicate_pushdown: bool = True,
 - projection_pushdown: bool = True,
 - slice_pushdown: bool = True,
 - no_optimizations: bool = False,
 - schema: None | SchemaDict = None,
 - validate_output_schema: bool = True,
 - streamable: bool = False,
 Apply a custom function.
Deprecated since version 0.19.0: This method has been renamed to
LazyFrame.map_batches().- Parameters:
 - function
 Lambda/ function to apply.
- predicate_pushdown
 Allow predicate pushdown optimization to pass this node.
- projection_pushdown
 Allow projection pushdown optimization to pass this node.
- slice_pushdown
 Allow slice pushdown optimization to pass this node.
- no_optimizations
 Turn off all optimizations past this point.
- schema
 Output schema of the function, if set to
Nonewe assume that the schema will remain unchanged by the applied function.- validate_output_schema
 It is paramount that polars’ schema is correct. This flag will ensure that the output schema of this function will be checked with the expected schema. Setting this to
Falsewill not do this check, but may lead to hard to debug bugs.- streamable
 Whether the function that is given is eligible to be running with the streaming engine. That means that the function must produce the same result when it is executed in batches or when it is be executed on the full dataset.
- map_batches(
 - function: Callable[[DataFrame], DataFrame],
 - *,
 - predicate_pushdown: bool = True,
 - projection_pushdown: bool = True,
 - slice_pushdown: bool = True,
 - no_optimizations: bool = False,
 - schema: None | SchemaDict = None,
 - validate_output_schema: bool = True,
 - streamable: bool = False,
 Apply a custom function.
It is important that the function returns a Polars DataFrame.
- Parameters:
 - function
 Lambda/ function to apply.
- predicate_pushdown
 Allow predicate pushdown optimization to pass this node.
- projection_pushdown
 Allow projection pushdown optimization to pass this node.
- slice_pushdown
 Allow slice pushdown optimization to pass this node.
- no_optimizations
 Turn off all optimizations past this point.
- schema
 Output schema of the function, if set to
Nonewe assume that the schema will remain unchanged by the applied function.- validate_output_schema
 It is paramount that polars’ schema is correct. This flag will ensure that the output schema of this function will be checked with the expected schema. Setting this to
Falsewill not do this check, but may lead to hard to debug bugs.- streamable
 Whether the function that is given is eligible to be running with the streaming engine. That means that the function must produce the same result when it is executed in batches or when it is be executed on the full dataset.
Warning
The
schemaof aLazyFramemust always be correct. It is up to the caller of this function to ensure that this invariant is upheld.It is important that the optimization flags are correct. If the custom function for instance does an aggregation of a column,
predicate_pushdownshould not be allowed, as this prunes rows and will influence your aggregation results.Examples
>>> lf = ( ... pl.LazyFrame( ... { ... "a": pl.int_range(-100_000, 0, eager=True), ... "b": pl.int_range(0, 100_000, eager=True), ... } ... ) ... .map_batches(lambda x: 2 * x, streamable=True) ... .collect(streaming=True) ... ) shape: (100_000, 2) ┌─────────┬────────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════════╪════════╡ │ -200000 ┆ 0 │ │ -199998 ┆ 2 │ │ -199996 ┆ 4 │ │ -199994 ┆ 6 │ │ … ┆ … │ │ -8 ┆ 199992 │ │ -6 ┆ 199994 │ │ -4 ┆ 199996 │ │ -2 ┆ 199998 │ └─────────┴────────┘
- max() Self[source]
 Aggregate the columns in the LazyFrame to their maximum value.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [1, 2, 1, 1], ... } ... ) >>> lf.max().collect() shape: (1, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 4 ┆ 2 │ └─────┴─────┘
- mean() Self[source]
 Aggregate the columns in the LazyFrame to their mean value.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [1, 2, 1, 1], ... } ... ) >>> lf.mean().collect() shape: (1, 2) ┌─────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ f64 ┆ f64 │ ╞═════╪══════╡ │ 2.5 ┆ 1.25 │ └─────┴──────┘
- median() Self[source]
 Aggregate the columns in the LazyFrame to their median value.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [1, 2, 1, 1], ... } ... ) >>> lf.median().collect() shape: (1, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ f64 ┆ f64 │ ╞═════╪═════╡ │ 2.5 ┆ 1.0 │ └─────┴─────┘
- melt(
 - id_vars: ColumnNameOrSelector | Sequence[ColumnNameOrSelector] | None = None,
 - value_vars: ColumnNameOrSelector | Sequence[ColumnNameOrSelector] | None = None,
 - variable_name: str | None = None,
 - value_name: str | None = None,
 - *,
 - streamable: bool = True,
 Unpivot a DataFrame from wide to long format.
Optionally leaves identifiers set.
This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars) while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis leaving just two non-identifier columns, ‘variable’ and ‘value’.
- Parameters:
 - id_vars
 Column(s) or selector(s) to use as identifier variables.
- value_vars
 Column(s) or selector(s) to use as values variables; if
value_varsis empty all columns that are not inid_varswill be used.- variable_name
 Name to give to the
variablecolumn. Defaults to “variable”- value_name
 Name to give to the
valuecolumn. Defaults to “value”- streamable
 Allow this node to run in the streaming engine. If this runs in streaming, the output of the melt operation will not have a stable ordering.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": ["x", "y", "z"], ... "b": [1, 3, 5], ... "c": [2, 4, 6], ... } ... ) >>> import polars.selectors as cs >>> lf.melt(id_vars="a", value_vars=cs.numeric()).collect() shape: (6, 3) ┌─────┬──────────┬───────┐ │ a ┆ variable ┆ value │ │ --- ┆ --- ┆ --- │ │ str ┆ str ┆ i64 │ ╞═════╪══════════╪═══════╡ │ x ┆ b ┆ 1 │ │ y ┆ b ┆ 3 │ │ z ┆ b ┆ 5 │ │ x ┆ c ┆ 2 │ │ y ┆ c ┆ 4 │ │ z ┆ c ┆ 6 │ └─────┴──────────┴───────┘
- merge_sorted(other: LazyFrame, key: str) Self[source]
 Take two sorted DataFrames and merge them by the sorted key.
The output of this operation will also be sorted. It is the callers responsibility that the frames are sorted by that key otherwise the output will not make sense.
The schemas of both LazyFrames must be equal.
- Parameters:
 - other
 Other DataFrame that must be merged
- key
 Key that is sorted.
Examples
>>> df0 = pl.LazyFrame( ... {"name": ["steve", "elise", "bob"], "age": [42, 44, 18]} ... ).sort("age") >>> df0.collect() shape: (3, 2) ┌───────┬─────┐ │ name ┆ age │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═══════╪═════╡ │ bob ┆ 18 │ │ steve ┆ 42 │ │ elise ┆ 44 │ └───────┴─────┘ >>> df1 = pl.LazyFrame( ... {"name": ["anna", "megan", "steve", "thomas"], "age": [21, 33, 42, 20]} ... ).sort("age") >>> df1.collect() shape: (4, 2) ┌────────┬─────┐ │ name ┆ age │ │ --- ┆ --- │ │ str ┆ i64 │ ╞════════╪═════╡ │ thomas ┆ 20 │ │ anna ┆ 21 │ │ megan ┆ 33 │ │ steve ┆ 42 │ └────────┴─────┘ >>> df0.merge_sorted(df1, key="age").collect() shape: (7, 2) ┌────────┬─────┐ │ name ┆ age │ │ --- ┆ --- │ │ str ┆ i64 │ ╞════════╪═════╡ │ bob ┆ 18 │ │ thomas ┆ 20 │ │ anna ┆ 21 │ │ megan ┆ 33 │ │ steve ┆ 42 │ │ steve ┆ 42 │ │ elise ┆ 44 │ └────────┴─────┘
- min() Self[source]
 Aggregate the columns in the LazyFrame to their minimum value.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [1, 2, 1, 1], ... } ... ) >>> lf.min().collect() shape: (1, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 1 │ └─────┴─────┘
- null_count() Self[source]
 Aggregate the columns in the LazyFrame as the sum of their null value count.
Examples
>>> lf = pl.LazyFrame( ... { ... "foo": [1, None, 3], ... "bar": [6, 7, None], ... "ham": ["a", "b", "c"], ... } ... ) >>> lf.null_count().collect() shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ u32 ┆ u32 ┆ u32 │ ╞═════╪═════╪═════╡ │ 1 ┆ 1 ┆ 0 │ └─────┴─────┴─────┘
- pipe(
 - function: Callable[Concatenate[LazyFrame, P], T],
 - *args: P.args,
 - **kwargs: P.kwargs,
 Offers a structured way to apply a sequence of user-defined functions (UDFs).
- Parameters:
 - function
 Callable; will receive the frame as the first parameter, followed by any given args/kwargs.
- *args
 Arguments to pass to the UDF.
- **kwargs
 Keyword arguments to pass to the UDF.
Examples
>>> def cast_str_to_int(data, col_name): ... return data.with_columns(pl.col(col_name).cast(pl.Int64)) ... >>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4], ... "b": ["10", "20", "30", "40"], ... } ... ) >>> lf.pipe(cast_str_to_int, col_name="b").collect() shape: (4, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 10 │ │ 2 ┆ 20 │ │ 3 ┆ 30 │ │ 4 ┆ 40 │ └─────┴─────┘
>>> lf = pl.LazyFrame( ... { ... "b": [1, 2], ... "a": [3, 4], ... } ... ) >>> lf.collect() shape: (2, 2) ┌─────┬─────┐ │ b ┆ a │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 3 │ │ 2 ┆ 4 │ └─────┴─────┘ >>> lf.pipe(lambda tdf: tdf.select(sorted(tdf.columns))).collect() shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 3 ┆ 1 │ │ 4 ┆ 2 │ └─────┴─────┘
- profile(
 - *,
 - type_coercion: bool = True,
 - predicate_pushdown: bool = True,
 - projection_pushdown: bool = True,
 - simplify_expression: bool = True,
 - no_optimization: bool = False,
 - slice_pushdown: bool = True,
 - comm_subplan_elim: bool = True,
 - comm_subexpr_elim: bool = True,
 - show_plot: bool = False,
 - truncate_nodes: int = 0,
 - figsize: tuple[int, int] = (18, 8),
 - streaming: bool = False,
 Profile a LazyFrame.
This will run the query and return a tuple containing the materialized DataFrame and a DataFrame that contains profiling information of each node that is executed.
The units of the timings are microseconds.
- Parameters:
 - type_coercion
 Do type coercion optimization.
- predicate_pushdown
 Do predicate pushdown optimization.
- projection_pushdown
 Do projection pushdown optimization.
- simplify_expression
 Run simplify expressions optimization.
- no_optimization
 Turn off (certain) optimizations.
- slice_pushdown
 Slice pushdown optimization.
- comm_subplan_elim
 Will try to cache branching subplans that occur on self-joins or unions.
- comm_subexpr_elim
 Common subexpressions will be cached and reused.
- show_plot
 Show a gantt chart of the profiling result
- truncate_nodes
 Truncate the label lengths in the gantt chart to this number of characters.
- figsize
 matplotlib figsize of the profiling plot
- streaming
 Run parts of the query in a streaming fashion (this is in an alpha state)
Examples
>>> lf = pl.LazyFrame( ... { ... "a": ["a", "b", "a", "b", "b", "c"], ... "b": [1, 2, 3, 4, 5, 6], ... "c": [6, 5, 4, 3, 2, 1], ... } ... ) >>> lf.group_by("a", maintain_order=True).agg(pl.all().sum()).sort( ... "a" ... ).profile() (shape: (3, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ a ┆ 4 ┆ 10 │ │ b ┆ 11 ┆ 10 │ │ c ┆ 6 ┆ 1 │ └─────┴─────┴─────┘, shape: (3, 3) ┌─────────────────────────┬───────┬──────┐ │ node ┆ start ┆ end │ │ --- ┆ --- ┆ --- │ │ str ┆ u64 ┆ u64 │ ╞═════════════════════════╪═══════╪══════╡ │ optimization ┆ 0 ┆ 5 │ │ group_by_partitioned(a) ┆ 5 ┆ 470 │ │ sort(a) ┆ 475 ┆ 1964 │ └─────────────────────────┴───────┴──────┘)
- quantile(
 - quantile: float | Expr,
 - interpolation: RollingInterpolationMethod = 'nearest',
 Aggregate the columns in the LazyFrame to their quantile value.
- Parameters:
 - quantile
 Quantile between 0.0 and 1.0.
- interpolation{‘nearest’, ‘higher’, ‘lower’, ‘midpoint’, ‘linear’}
 Interpolation method.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [1, 2, 1, 1], ... } ... ) >>> lf.quantile(0.7).collect() shape: (1, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ f64 ┆ f64 │ ╞═════╪═════╡ │ 3.0 ┆ 1.0 │ └─────┴─────┘
- classmethod read_json(source: str | Path | IOBase) Self[source]
 Read a logical plan from a JSON file to construct a LazyFrame.
Deprecated since version 0.18.12: This class method has been renamed to
deserialize.- Parameters:
 - source
 Path to a file or a file-like object (by file-like object, we refer to objects that have a
read()method, such as a file handler (e.g. via builtinopenfunction) orBytesIO).
See also
- rename(mapping: dict[str, str]) Self[source]
 Rename column names.
- Parameters:
 - mapping
 Key value pairs that map from old name to new name.
Notes
If existing names are swapped (e.g. ‘A’ points to ‘B’ and ‘B’ points to ‘A’), polars will block projection and predicate pushdowns at this node.
Examples
>>> lf = pl.LazyFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> lf.rename({"foo": "apple"}).collect() shape: (3, 3) ┌───────┬─────┬─────┐ │ apple ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═══════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ │ 2 ┆ 7 ┆ b │ │ 3 ┆ 8 ┆ c │ └───────┴─────┴─────┘
- reverse() Self[source]
 Reverse the DataFrame.
Examples
>>> lf = pl.LazyFrame( ... { ... "key": ["a", "b", "c"], ... "val": [1, 2, 3], ... } ... ) >>> lf.reverse().collect() shape: (3, 2) ┌─────┬─────┐ │ key ┆ val │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════╪═════╡ │ c ┆ 3 │ │ b ┆ 2 │ │ a ┆ 1 │ └─────┴─────┘
- rolling(
 - index_column: IntoExpr,
 - *,
 - period: str | timedelta,
 - offset: str | timedelta | None = None,
 - closed: ClosedInterval = 'right',
 - by: IntoExpr | Iterable[IntoExpr] | None = None,
 - check_sorted: bool = True,
 Create rolling groups based on a time, Int32, or Int64 column.
Different from a
dynamic_group_bythe windows are now determined by the individual values and are not of constant intervals. For constant intervals useLazyFrame.group_by_dynamic().If you have a time series
<t_0, t_1, ..., t_n>, then by default the windows created will be(t_0 - period, t_0]
(t_1 - period, t_1]
…
(t_n - period, t_n]
whereas if you pass a non-default
offset, then the windows will be(t_0 + offset, t_0 + offset + period]
(t_1 + offset, t_1 + offset + period]
…
(t_n + offset, t_n + offset + period]
The
periodandoffsetarguments are created either from a timedelta, or by using the following string language:1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
1i (1 index count)
Or combine them: “3d12h4m25s” # 3 days, 12 hours, 4 minutes, and 25 seconds
By “calendar day”, we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for “calendar week”, “calendar month”, “calendar quarter”, and “calendar year”.
In case of a rolling operation on an integer column, the windows are defined by:
“1i” # length 1
“10i” # length 10
- Parameters:
 - index_column
 Column used to group based on the time window. Often of type Date/Datetime. This column must be sorted in ascending order (or, if
byis specified, then it must be sorted in ascending order within each group).In case of a rolling group by on indices, dtype needs to be one of {Int32, Int64}. Note that Int32 gets temporarily cast to Int64, so if performance matters use an Int64 column.
- period
 length of the window - must be non-negative
- offset
 offset of the window. Default is -period
- closed{‘right’, ‘left’, ‘both’, ‘none’}
 Define which sides of the temporal interval are closed (inclusive).
- by
 Also group by this column/these columns
- check_sorted
 When the
byargument is given, polars can not check sortedness by the metadata and has to do a full scan on the index column to verify data is sorted. This is expensive. If you are sure the data within the by groups is sorted, you can set this toFalse. Doing so incorrectly will lead to incorrect output
- Returns:
 - LazyGroupBy
 Object you can call
.aggon to aggregate by groups, the result of which will be sorted byindex_column(but note that ifbycolumns are passed, it will only be sorted within eachbygroup).
See also
Examples
>>> dates = [ ... "2020-01-01 13:45:48", ... "2020-01-01 16:42:13", ... "2020-01-01 16:45:09", ... "2020-01-02 18:12:48", ... "2020-01-03 19:45:32", ... "2020-01-08 23:16:43", ... ] >>> df = pl.LazyFrame({"dt": dates, "a": [3, 7, 5, 9, 2, 1]}).with_columns( ... pl.col("dt").str.strptime(pl.Datetime).set_sorted() ... ) >>> out = ( ... df.rolling(index_column="dt", period="2d") ... .agg( ... pl.sum("a").alias("sum_a"), ... pl.min("a").alias("min_a"), ... pl.max("a").alias("max_a"), ... ) ... .collect() ... ) >>> out shape: (6, 4) ┌─────────────────────┬───────┬───────┬───────┐ │ dt ┆ sum_a ┆ min_a ┆ max_a │ │ --- ┆ --- ┆ --- ┆ --- │ │ datetime[μs] ┆ i64 ┆ i64 ┆ i64 │ ╞═════════════════════╪═══════╪═══════╪═══════╡ │ 2020-01-01 13:45:48 ┆ 3 ┆ 3 ┆ 3 │ │ 2020-01-01 16:42:13 ┆ 10 ┆ 3 ┆ 7 │ │ 2020-01-01 16:45:09 ┆ 15 ┆ 3 ┆ 7 │ │ 2020-01-02 18:12:48 ┆ 24 ┆ 3 ┆ 9 │ │ 2020-01-03 19:45:32 ┆ 11 ┆ 2 ┆ 9 │ │ 2020-01-08 23:16:43 ┆ 1 ┆ 1 ┆ 1 │ └─────────────────────┴───────┴───────┴───────┘
- property schema: SchemaDict[source]
 Get a dict[column name, DataType].
Examples
>>> lf = pl.LazyFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": ["a", "b", "c"], ... } ... ) >>> lf.schema OrderedDict({'foo': Int64, 'bar': Float64, 'ham': Utf8})
- select(*exprs: IntoExpr | Iterable[IntoExpr], **named_exprs: IntoExpr) Self[source]
 Select columns from this LazyFrame.
- Parameters:
 - *exprs
 Column(s) to select, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.
- **named_exprs
 Additional columns to select, specified as keyword arguments. The columns will be renamed to the keyword used.
Examples
Pass the name of a column to select that column.
>>> lf = pl.LazyFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> lf.select("foo").collect() shape: (3, 1) ┌─────┐ │ foo │ │ --- │ │ i64 │ ╞═════╡ │ 1 │ │ 2 │ │ 3 │ └─────┘
Multiple columns can be selected by passing a list of column names.
>>> lf.select(["foo", "bar"]).collect() shape: (3, 2) ┌─────┬─────┐ │ foo ┆ bar │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 6 │ │ 2 ┆ 7 │ │ 3 ┆ 8 │ └─────┴─────┘
Multiple columns can also be selected using positional arguments instead of a list. Expressions are also accepted.
>>> lf.select(pl.col("foo"), pl.col("bar") + 1).collect() shape: (3, 2) ┌─────┬─────┐ │ foo ┆ bar │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 7 │ │ 2 ┆ 8 │ │ 3 ┆ 9 │ └─────┴─────┘
Use keyword arguments to easily name your expression inputs.
>>> lf.select( ... threshold=pl.when(pl.col("foo") > 2).then(10).otherwise(0) ... ).collect() shape: (3, 1) ┌───────────┐ │ threshold │ │ --- │ │ i32 │ ╞═══════════╡ │ 0 │ │ 0 │ │ 10 │ └───────────┘
Expressions with multiple outputs can be automatically instantiated as Structs by enabling the setting
Config.set_auto_structify(True):>>> with pl.Config(auto_structify=True): ... lf.select( ... is_odd=(pl.col(pl.INTEGER_DTYPES) % 2).name.suffix("_is_odd"), ... ).collect() ... shape: (3, 1) ┌───────────┐ │ is_odd │ │ --- │ │ struct[2] │ ╞═══════════╡ │ {1,0} │ │ {0,1} │ │ {1,0} │ └───────────┘
- select_seq(
 - *exprs: IntoExpr | Iterable[IntoExpr],
 - **named_exprs: IntoExpr,
 Select columns from this LazyFrame.
This will run all expression sequentially instead of in parallel. Use this when the work per expression is cheap.
- Parameters:
 - *exprs
 Column(s) to select, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.
- **named_exprs
 Additional columns to select, specified as keyword arguments. The columns will be renamed to the keyword used.
See also
- serialize(file: None = None) str[source]
 - serialize(file: IOBase | str | Path) None
 Serialize the logical plan of this LazyFrame to a file or string in JSON format.
- Parameters:
 - file
 File path to which the result should be written. If set to
None(default), the output is returned as a string instead.
See also
Examples
Serialize the logical plan into a JSON string.
>>> lf = pl.LazyFrame({"a": [1, 2, 3]}).sum() >>> json = lf.serialize() >>> json '{"Projection":{"expr":[{"Agg":{"Sum":{"Column":"a"}}}],"input":{"DataFrameScan":{"df":{"columns":[{"name":"a","datatype":"Int64","bit_settings":"","values":[1,2,3]}]},"schema":{"inner":{"a":"Int64"}},"output_schema":null,"projection":null,"selection":null}},"schema":{"inner":{"a":"Int64"}},"options":{"run_parallel":true,"duplicate_check":true}}}'
The logical plan can later be deserialized back into a LazyFrame.
>>> import io >>> pl.LazyFrame.deserialize(io.StringIO(json)).collect() shape: (1, 1) ┌─────┐ │ a │ │ --- │ │ i64 │ ╞═════╡ │ 6 │ └─────┘
- set_sorted( ) Self[source]
 Indicate that one or multiple columns are sorted.
- Parameters:
 - column
 Columns that are sorted
- more_columns
 Additional columns that are sorted, specified as positional arguments.
- descending
 Whether the columns are sorted in descending order.
- shift( ) Self[source]
 Shift values by the given number of indices.
- Parameters:
 - n
 Number of indices to shift forward. If a negative value is passed, values are shifted in the opposite direction instead.
- fill_value
 Fill the resulting null values with this value. Accepts expression input. Non-expression inputs are parsed as literals.
Notes
This method is similar to the
LAGoperation in SQL when the value fornis positive. With a negative value forn, it is similar toLEAD.Examples
By default, values are shifted forward by one index.
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [5, 6, 7, 8], ... } ... ) >>> lf.shift().collect() shape: (4, 2) ┌──────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞══════╪══════╡ │ null ┆ null │ │ 1 ┆ 5 │ │ 2 ┆ 6 │ │ 3 ┆ 7 │ └──────┴──────┘
Pass a negative value to shift in the opposite direction instead.
>>> lf.shift(-2).collect() shape: (4, 2) ┌──────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞══════╪══════╡ │ 3 ┆ 7 │ │ 4 ┆ 8 │ │ null ┆ null │ │ null ┆ null │ └──────┴──────┘
Specify
fill_valueto fill the resulting null values.>>> lf.shift(-2, fill_value=100).collect() shape: (4, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 3 ┆ 7 │ │ 4 ┆ 8 │ │ 100 ┆ 100 │ │ 100 ┆ 100 │ └─────┴─────┘
- shift_and_fill(fill_value: Expr | int | str | float, *, n: int = 1) Self[source]
 Shift values by the given number of places and fill the resulting null values.
Deprecated since version 0.19.12: Use
shift()instead.- Parameters:
 - fill_value
 fill None values with the result of this expression.
- n
 Number of places to shift (may be negative).
- show_graph(
 - *,
 - optimized: bool = True,
 - show: bool = True,
 - output_path: str | Path | None = None,
 - raw_output: bool = False,
 - figsize: tuple[float, float] = (16.0, 12.0),
 - type_coercion: bool = True,
 - predicate_pushdown: bool = True,
 - projection_pushdown: bool = True,
 - simplify_expression: bool = True,
 - slice_pushdown: bool = True,
 - comm_subplan_elim: bool = True,
 - comm_subexpr_elim: bool = True,
 - streaming: bool = False,
 Show a plot of the query plan. Note that you should have graphviz installed.
- Parameters:
 - optimized
 Optimize the query plan.
- show
 Show the figure.
- output_path
 Write the figure to disk.
- raw_output
 Return dot syntax. This cannot be combined with
showand/oroutput_path.- figsize
 Passed to matplotlib if
show== True.- type_coercion
 Do type coercion optimization.
- predicate_pushdown
 Do predicate pushdown optimization.
- projection_pushdown
 Do projection pushdown optimization.
- simplify_expression
 Run simplify expressions optimization.
- slice_pushdown
 Slice pushdown optimization.
- comm_subplan_elim
 Will try to cache branching subplans that occur on self-joins or unions.
- comm_subexpr_elim
 Common subexpressions will be cached and reused.
- streaming
 Run parts of the query in a streaming fashion (this is in an alpha state)
Examples
>>> lf = pl.LazyFrame( ... { ... "a": ["a", "b", "a", "b", "b", "c"], ... "b": [1, 2, 3, 4, 5, 6], ... "c": [6, 5, 4, 3, 2, 1], ... } ... ) >>> lf.group_by("a", maintain_order=True).agg(pl.all().sum()).sort( ... "a" ... ).show_graph()
- sink_csv(
 - path: str | Path,
 - *,
 - include_bom: bool = False,
 - include_header: bool = True,
 - separator: str = ',',
 - line_terminator: str = '\n',
 - quote_char: str = '"',
 - batch_size: int = 1024,
 - datetime_format: str | None = None,
 - date_format: str | None = None,
 - time_format: str | None = None,
 - float_precision: int | None = None,
 - null_value: str | None = None,
 - quote_style: CsvQuoteStyle | None = None,
 - maintain_order: bool = True,
 - type_coercion: bool = True,
 - predicate_pushdown: bool = True,
 - projection_pushdown: bool = True,
 - simplify_expression: bool = True,
 - slice_pushdown: bool = True,
 - no_optimization: bool = False,
 Evaluate the query in streaming mode and write to a CSV file.
This allows streaming results that are larger than RAM to be written to disk.
- Parameters:
 - path
 File path to which the file should be written.
- include_bom
 Whether to include UTF-8 BOM in the CSV output.
- include_header
 Whether to include header in the CSV output.
- separator
 Separate CSV fields with this symbol.
- line_terminator
 String used to end each row.
- quote_char
 Byte to use as quoting character.
- batch_size
 Number of rows that will be processed per thread.
- datetime_format
 A format string, with the specifiers defined by the chrono Rust crate. If no format specified, the default fractional-second precision is inferred from the maximum timeunit found in the frame’s Datetime cols (if any).
- date_format
 A format string, with the specifiers defined by the chrono Rust crate.
- time_format
 A format string, with the specifiers defined by the chrono Rust crate.
- float_precision
 Number of decimal places to write, applied to both
Float32andFloat64datatypes.- null_value
 A string representing null values (defaulting to the empty string).
- quote_style{‘necessary’, ‘always’, ‘non_numeric’, ‘never’}
 Determines the quoting strategy used.
necessary (default): This puts quotes around fields only when necessary. They are necessary when fields contain a quote, delimiter or record terminator. Quotes are also necessary when writing an empty record (which is indistinguishable from a record with one empty field). This is the default.
always: This puts quotes around every field. Always.
never: This never puts quotes around fields, even if that results in invalid CSV data (e.g.: by not quoting strings containing the separator).
non_numeric: This puts quotes around all fields that are non-numeric. Namely, when writing a field that does not parse as a valid float or integer, then quotes will be used even if they aren`t strictly necessary.
- maintain_order
 Maintain the order in which data is processed. Setting this to
Falsewill be slightly faster.- type_coercion
 Do type coercion optimization.
- predicate_pushdown
 Do predicate pushdown optimization.
- projection_pushdown
 Do projection pushdown optimization.
- simplify_expression
 Run simplify expressions optimization.
- slice_pushdown
 Slice pushdown optimization.
- no_optimization
 Turn off (certain) optimizations.
- Returns:
 - DataFrame
 
Examples
>>> lf = pl.scan_csv("/path/to/my_larger_than_ram_file.csv") >>> lf.sink_csv("out.csv")
- sink_ipc(
 - path: str | Path,
 - *,
 - compression: str | None = 'zstd',
 - maintain_order: bool = True,
 - type_coercion: bool = True,
 - predicate_pushdown: bool = True,
 - projection_pushdown: bool = True,
 - simplify_expression: bool = True,
 - slice_pushdown: bool = True,
 - no_optimization: bool = False,
 Evaluate the query in streaming mode and write to an IPC file.
This allows streaming results that are larger than RAM to be written to disk.
- Parameters:
 - path
 File path to which the file should be written.
- compression{‘lz4’, ‘zstd’}
 Choose “zstd” for good compression performance. Choose “lz4” for fast compression/decompression.
- maintain_order
 Maintain the order in which data is processed. Setting this to
Falsewill be slightly faster.- type_coercion
 Do type coercion optimization.
- predicate_pushdown
 Do predicate pushdown optimization.
- projection_pushdown
 Do projection pushdown optimization.
- simplify_expression
 Run simplify expressions optimization.
- slice_pushdown
 Slice pushdown optimization.
- no_optimization
 Turn off (certain) optimizations.
- Returns:
 - DataFrame
 
Examples
>>> lf = pl.scan_csv("/path/to/my_larger_than_ram_file.csv") >>> lf.sink_ipc("out.arrow")
- sink_ndjson(
 - path: str | Path,
 - *,
 - maintain_order: bool = True,
 - type_coercion: bool = True,
 - predicate_pushdown: bool = True,
 - projection_pushdown: bool = True,
 - simplify_expression: bool = True,
 - no_optimization: bool = False,
 - slice_pushdown: bool = True,
 Persists a LazyFrame at the provided path.
This allows streaming results that are larger than RAM to be written to disk.
- Parameters:
 - path
 File path to which the file should be written.
- maintain_order
 Maintain the order in which data is processed. Setting this to
Falsewill be slightly faster.- type_coercion
 Do type coercion optimization.
- predicate_pushdown
 Do predicate pushdown optimization.
- projection_pushdown
 Do projection pushdown optimization.
- simplify_expression
 Run simplify expressions optimization.
- no_optimization
 Turn off (certain) optimizations.
- slice_pushdown
 Slice pushdown optimization.
- Returns:
 - DataFrame
 
Examples
>>> lf = pl.scan_csv("/path/to/my_larger_than_ram_file.csv") >>> lf.sink_json("out.json")
- sink_parquet(
 - path: str | Path,
 - *,
 - compression: str = 'zstd',
 - compression_level: int | None = None,
 - statistics: bool = False,
 - row_group_size: int | None = None,
 - data_pagesize_limit: int | None = None,
 - maintain_order: bool = True,
 - type_coercion: bool = True,
 - predicate_pushdown: bool = True,
 - projection_pushdown: bool = True,
 - simplify_expression: bool = True,
 - slice_pushdown: bool = True,
 - no_optimization: bool = False,
 Evaluate the query in streaming mode and write to a Parquet file.
This allows streaming results that are larger than RAM to be written to disk.
- Parameters:
 - path
 File path to which the file should be written.
- compression{‘lz4’, ‘uncompressed’, ‘snappy’, ‘gzip’, ‘lzo’, ‘brotli’, ‘zstd’}
 Choose “zstd” for good compression performance. Choose “lz4” for fast compression/decompression. Choose “snappy” for more backwards compatibility guarantees when you deal with older parquet readers.
- compression_level
 The level of compression to use. Higher compression means smaller files on disk.
“gzip” : min-level: 0, max-level: 10.
“brotli” : min-level: 0, max-level: 11.
“zstd” : min-level: 1, max-level: 22.
- statistics
 Write statistics to the parquet headers. This requires extra compute.
- row_group_size
 Size of the row groups in number of rows. If None (default), the chunks of the
DataFrameare used. Writing in smaller chunks may reduce memory pressure and improve writing speeds.- data_pagesize_limit
 Size limit of individual data pages. If not set defaults to 1024 * 1024 bytes
- maintain_order
 Maintain the order in which data is processed. Setting this to
Falsewill be slightly faster.- type_coercion
 Do type coercion optimization.
- predicate_pushdown
 Do predicate pushdown optimization.
- projection_pushdown
 Do projection pushdown optimization.
- simplify_expression
 Run simplify expressions optimization.
- slice_pushdown
 Slice pushdown optimization.
- no_optimization
 Turn off (certain) optimizations.
- Returns:
 - DataFrame
 
Examples
>>> lf = pl.scan_csv("/path/to/my_larger_than_ram_file.csv") >>> lf.sink_parquet("out.parquet")
- slice(offset: int, length: int | None = None) Self[source]
 Get a slice of this DataFrame.
- Parameters:
 - offset
 Start index. Negative indexing is supported.
- length
 Length of the slice. If set to
None, all rows starting at the offset will be selected.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": ["x", "y", "z"], ... "b": [1, 3, 5], ... "c": [2, 4, 6], ... } ... ) >>> lf.slice(1, 2).collect() shape: (2, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ y ┆ 3 ┆ 4 │ │ z ┆ 5 ┆ 6 │ └─────┴─────┴─────┘
- sort(
 - by: IntoExpr | Iterable[IntoExpr],
 - *more_by: IntoExpr,
 - descending: bool | Sequence[bool] = False,
 - nulls_last: bool = False,
 - maintain_order: bool = False,
 Sort the DataFrame by the given columns.
- Parameters:
 - by
 Column(s) to sort by. Accepts expression input. Strings are parsed as column names.
- *more_by
 Additional columns to sort by, specified as positional arguments.
- descending
 Sort in descending order. When sorting by multiple columns, can be specified per column by passing a sequence of booleans.
- nulls_last
 Place null values last.
- maintain_order
 Whether the order should be maintained if elements are equal. Note that if
truestreaming is not possible and performance might be worse since this requires a stable search.
Examples
Pass a single column name to sort by that column.
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, None], ... "b": [6.0, 5.0, 4.0], ... "c": ["a", "c", "b"], ... } ... ) >>> lf.sort("a").collect() shape: (3, 3) ┌──────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞══════╪═════╪═════╡ │ null ┆ 4.0 ┆ b │ │ 1 ┆ 6.0 ┆ a │ │ 2 ┆ 5.0 ┆ c │ └──────┴─────┴─────┘
Sorting by expressions is also supported.
>>> lf.sort(pl.col("a") + pl.col("b") * 2, nulls_last=True).collect() shape: (3, 3) ┌──────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞══════╪═════╪═════╡ │ 2 ┆ 5.0 ┆ c │ │ 1 ┆ 6.0 ┆ a │ │ null ┆ 4.0 ┆ b │ └──────┴─────┴─────┘
Sort by multiple columns by passing a list of columns.
>>> lf.sort(["c", "a"], descending=True).collect() shape: (3, 3) ┌──────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞══════╪═════╪═════╡ │ 2 ┆ 5.0 ┆ c │ │ null ┆ 4.0 ┆ b │ │ 1 ┆ 6.0 ┆ a │ └──────┴─────┴─────┘
Or use positional arguments to sort by multiple columns in the same way.
>>> lf.sort("c", "a", descending=[False, True]).collect() shape: (3, 3) ┌──────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞══════╪═════╪═════╡ │ 1 ┆ 6.0 ┆ a │ │ null ┆ 4.0 ┆ b │ │ 2 ┆ 5.0 ┆ c │ └──────┴─────┴─────┘
- std(ddof: int = 1) Self[source]
 Aggregate the columns in the LazyFrame to their standard deviation value.
- Parameters:
 - ddof
 “Delta Degrees of Freedom”: the divisor used in the calculation is N - ddof, where N represents the number of elements. By default ddof is 1.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [1, 2, 1, 1], ... } ... ) >>> lf.std().collect() shape: (1, 2) ┌──────────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ f64 ┆ f64 │ ╞══════════╪═════╡ │ 1.290994 ┆ 0.5 │ └──────────┴─────┘ >>> lf.std(ddof=0).collect() shape: (1, 2) ┌──────────┬──────────┐ │ a ┆ b │ │ --- ┆ --- │ │ f64 ┆ f64 │ ╞══════════╪══════════╡ │ 1.118034 ┆ 0.433013 │ └──────────┴──────────┘
- sum() Self[source]
 Aggregate the columns in the LazyFrame to their sum value.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [1, 2, 1, 1], ... } ... ) >>> lf.sum().collect() shape: (1, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 10 ┆ 5 │ └─────┴─────┘
- tail(n: int = 5) Self[source]
 Get the last
nrows.- Parameters:
 - n
 Number of rows to return.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4, 5, 6], ... "b": [7, 8, 9, 10, 11, 12], ... } ... ) >>> lf.tail().collect() shape: (5, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 2 ┆ 8 │ │ 3 ┆ 9 │ │ 4 ┆ 10 │ │ 5 ┆ 11 │ │ 6 ┆ 12 │ └─────┴─────┘ >>> lf.tail(2).collect() shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 5 ┆ 11 │ │ 6 ┆ 12 │ └─────┴─────┘
- take_every(n: int) Self[source]
 Take every nth row in the LazyFrame and return as a new LazyFrame.
Deprecated since version 0.19.0: This method has been renamed to
gather_every().- Parameters:
 - n
 Gather every n-th row.
- top_k(
 - k: int,
 - *,
 - by: IntoExpr | Iterable[IntoExpr],
 - descending: bool | Sequence[bool] = False,
 - nulls_last: bool = False,
 - maintain_order: bool = False,
 Return the
klargest elements.If ‘descending=True` the smallest elements will be given.
- Parameters:
 - k
 Number of rows to return.
- by
 Column(s) included in sort order. Accepts expression input. Strings are parsed as column names.
- descending
 Return the ‘k’ smallest. Top-k by multiple columns can be specified per column by passing a sequence of booleans.
- nulls_last
 Place null values last.
- maintain_order
 Whether the order should be maintained if elements are equal. Note that if
truestreaming is not possible and performance might be worse since this requires a stable search.
See also
Examples
>>> lf = pl.LazyFrame( ... { ... "a": ["a", "b", "a", "b", "b", "c"], ... "b": [2, 1, 1, 3, 2, 1], ... } ... )
Get the rows which contain the 4 largest values in column b.
>>> lf.top_k(4, by="b").collect() shape: (4, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════╪═════╡ │ b ┆ 3 │ │ a ┆ 2 │ │ b ┆ 2 │ │ b ┆ 1 │ └─────┴─────┘
Get the rows which contain the 4 largest values when sorting on column b and a.
>>> lf.top_k(4, by=["b", "a"]).collect() shape: (4, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════╪═════╡ │ b ┆ 3 │ │ b ┆ 2 │ │ a ┆ 2 │ │ c ┆ 1 │ └─────┴─────┘
- unique(
 - subset: ColumnNameOrSelector | Collection[ColumnNameOrSelector] | None = None,
 - *,
 - keep: UniqueKeepStrategy = 'any',
 - maintain_order: bool = False,
 Drop duplicate rows from this DataFrame.
- Parameters:
 - subset
 Column name(s) or selector(s), to consider when identifying duplicate rows. If set to
None(default), use all columns.- keep{‘first’, ‘last’, ‘any’, ‘none’}
 Which of the duplicate rows to keep.
- ‘any’: Does not give any guarantee of which row is kept.
 This allows more optimizations.
‘none’: Don’t keep duplicate rows.
‘first’: Keep first unique row.
‘last’: Keep last unique row.
- maintain_order
 Keep the same order as the original DataFrame. This is more expensive to compute. Settings this to
Trueblocks the possibility to run on the streaming engine.
- Returns:
 - LazyFrame
 LazyFrame with unique rows.
Warning
This method will fail if there is a column of type
Listin the DataFrame or subset.Examples
>>> lf = pl.LazyFrame( ... { ... "foo": [1, 2, 3, 1], ... "bar": ["a", "a", "a", "a"], ... "ham": ["b", "b", "b", "b"], ... } ... ) >>> lf.unique(maintain_order=True).collect() shape: (3, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ a ┆ b │ │ 2 ┆ a ┆ b │ │ 3 ┆ a ┆ b │ └─────┴─────┴─────┘ >>> lf.unique(subset=["bar", "ham"], maintain_order=True).collect() shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ a ┆ b │ └─────┴─────┴─────┘ >>> lf.unique(keep="last", maintain_order=True).collect() shape: (3, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ str │ ╞═════╪═════╪═════╡ │ 2 ┆ a ┆ b │ │ 3 ┆ a ┆ b │ │ 1 ┆ a ┆ b │ └─────┴─────┴─────┘
- unnest(
 - columns: ColumnNameOrSelector | Collection[ColumnNameOrSelector],
 - *more_columns: ColumnNameOrSelector,
 Decompose struct columns into separate columns for each of their fields.
The new columns will be inserted into the DataFrame at the location of the struct column.
- Parameters:
 - columns
 Name of the struct column(s) that should be unnested.
- *more_columns
 Additional columns to unnest, specified as positional arguments.
Examples
>>> df = pl.LazyFrame( ... { ... "before": ["foo", "bar"], ... "t_a": [1, 2], ... "t_b": ["a", "b"], ... "t_c": [True, None], ... "t_d": [[1, 2], [3]], ... "after": ["baz", "womp"], ... } ... ).select("before", pl.struct(pl.col("^t_.$")).alias("t_struct"), "after") >>> df.collect() shape: (2, 3) ┌────────┬─────────────────────┬───────┐ │ before ┆ t_struct ┆ after │ │ --- ┆ --- ┆ --- │ │ str ┆ struct[4] ┆ str │ ╞════════╪═════════════════════╪═══════╡ │ foo ┆ {1,"a",true,[1, 2]} ┆ baz │ │ bar ┆ {2,"b",null,[3]} ┆ womp │ └────────┴─────────────────────┴───────┘ >>> df.unnest("t_struct").collect() shape: (2, 6) ┌────────┬─────┬─────┬──────┬───────────┬───────┐ │ before ┆ t_a ┆ t_b ┆ t_c ┆ t_d ┆ after │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ str ┆ bool ┆ list[i64] ┆ str │ ╞════════╪═════╪═════╪══════╪═══════════╪═══════╡ │ foo ┆ 1 ┆ a ┆ true ┆ [1, 2] ┆ baz │ │ bar ┆ 2 ┆ b ┆ null ┆ [3] ┆ womp │ └────────┴─────┴─────┴──────┴───────────┴───────┘
- update(
 - other: LazyFrame,
 - on: str | Sequence[str] | None = None,
 - left_on: str | Sequence[str] | None = None,
 - right_on: str | Sequence[str] | None = None,
 - how: Literal['left', 'inner', 'outer'] = 'left',
 - include_nulls: bool | None = False,
 Update the values in this
LazyFramewith the non-null values inother.- Parameters:
 - other
 LazyFrame that will be used to update the values
- on
 Column names that will be joined on; if given
Nonethe implicit row index is used as a join key instead.- left_on
 Join column(s) of the left DataFrame.
- right_on
 Join column(s) of the right DataFrame.
- how{‘left’, ‘inner’, ‘outer’}
 ‘left’ will keep all rows from the left table; rows may be duplicated if multiple rows in the right frame match the left row’s key.
‘inner’ keeps only those rows where the key exists in both frames.
‘outer’ will update existing rows where the key matches while also adding any new rows contained in the given frame.
- include_nulls
 If True, null values from the right DataFrame will be used to update the left DataFrame.
Notes
This is syntactic sugar for a left/inner join, with an optional coalesce when
include_nulls = False.Examples
>>> lf = pl.LazyFrame( ... { ... "A": [1, 2, 3, 4], ... "B": [400, 500, 600, 700], ... } ... ) >>> lf.collect() shape: (4, 2) ┌─────┬─────┐ │ A ┆ B │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 400 │ │ 2 ┆ 500 │ │ 3 ┆ 600 │ │ 4 ┆ 700 │ └─────┴─────┘ >>> new_lf = pl.LazyFrame( ... { ... "B": [-66, None, -99], ... "C": [5, 3, 1], ... } ... )
Update
dfvalues with the non-null values innew_df, by row index:>>> lf.update(new_lf).collect() shape: (4, 2) ┌─────┬─────┐ │ A ┆ B │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ -66 │ │ 2 ┆ 500 │ │ 3 ┆ -99 │ │ 4 ┆ 700 │ └─────┴─────┘
Update
dfvalues with the non-null values innew_df, by row index, but only keeping those rows that are common to both frames:>>> lf.update(new_lf, how="inner").collect() shape: (3, 2) ┌─────┬─────┐ │ A ┆ B │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ -66 │ │ 2 ┆ 500 │ │ 3 ┆ -99 │ └─────┴─────┘
Update
dfvalues with the non-null values innew_df, using an outer join strategy that defines explicit join columns in each frame:>>> lf.update(new_lf, left_on=["A"], right_on=["C"], how="outer").collect() shape: (5, 2) ┌─────┬─────┐ │ A ┆ B │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ -99 │ │ 2 ┆ 500 │ │ 3 ┆ 600 │ │ 4 ┆ 700 │ │ 5 ┆ -66 │ └─────┴─────┘
Update
dfvalues including null values innew_df, using an outer join strategy that defines explicit join columns in each frame:>>> lf.update( ... new_lf, left_on="A", right_on="C", how="outer", include_nulls=True ... ).collect() shape: (5, 2) ┌─────┬──────┐ │ A ┆ B │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪══════╡ │ 1 ┆ -99 │ │ 2 ┆ 500 │ │ 3 ┆ null │ │ 4 ┆ 700 │ │ 5 ┆ -66 │ └─────┴──────┘
- var(ddof: int = 1) Self[source]
 Aggregate the columns in the LazyFrame to their variance value.
- Parameters:
 - ddof
 “Delta Degrees of Freedom”: the divisor used in the calculation is N - ddof, where N represents the number of elements. By default ddof is 1.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [1, 2, 1, 1], ... } ... ) >>> lf.var().collect() shape: (1, 2) ┌──────────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ f64 ┆ f64 │ ╞══════════╪══════╡ │ 1.666667 ┆ 0.25 │ └──────────┴──────┘ >>> lf.var(ddof=0).collect() shape: (1, 2) ┌──────┬────────┐ │ a ┆ b │ │ --- ┆ --- │ │ f64 ┆ f64 │ ╞══════╪════════╡ │ 1.25 ┆ 0.1875 │ └──────┴────────┘
- property width: int[source]
 Get the width of the LazyFrame.
Examples
>>> lf = pl.LazyFrame( ... { ... "foo": [1, 2, 3], ... "bar": [4, 5, 6], ... } ... ) >>> lf.width 2
- with_columns(
 - *exprs: IntoExpr | Iterable[IntoExpr],
 - **named_exprs: IntoExpr,
 Add columns to this DataFrame.
Added columns will replace existing columns with the same name.
- Parameters:
 - *exprs
 Column(s) to add, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.
- **named_exprs
 Additional columns to add, specified as keyword arguments. The columns will be renamed to the keyword used.
- Returns:
 - LazyFrame
 A new LazyFrame with the columns added.
Notes
Creating a new LazyFrame using this method does not create a new copy of existing data.
Examples
Pass an expression to add it as a new column.
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [0.5, 4, 10, 13], ... "c": [True, True, False, True], ... } ... ) >>> lf.with_columns((pl.col("a") ** 2).alias("a^2")).collect() shape: (4, 4) ┌─────┬──────┬───────┬──────┐ │ a ┆ b ┆ c ┆ a^2 │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ bool ┆ f64 │ ╞═════╪══════╪═══════╪══════╡ │ 1 ┆ 0.5 ┆ true ┆ 1.0 │ │ 2 ┆ 4.0 ┆ true ┆ 4.0 │ │ 3 ┆ 10.0 ┆ false ┆ 9.0 │ │ 4 ┆ 13.0 ┆ true ┆ 16.0 │ └─────┴──────┴───────┴──────┘
Added columns will replace existing columns with the same name.
>>> lf.with_columns(pl.col("a").cast(pl.Float64)).collect() shape: (4, 3) ┌─────┬──────┬───────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ bool │ ╞═════╪══════╪═══════╡ │ 1.0 ┆ 0.5 ┆ true │ │ 2.0 ┆ 4.0 ┆ true │ │ 3.0 ┆ 10.0 ┆ false │ │ 4.0 ┆ 13.0 ┆ true │ └─────┴──────┴───────┘
Multiple columns can be added by passing a list of expressions.
>>> lf.with_columns( ... [ ... (pl.col("a") ** 2).alias("a^2"), ... (pl.col("b") / 2).alias("b/2"), ... (pl.col("c").not_()).alias("not c"), ... ] ... ).collect() shape: (4, 6) ┌─────┬──────┬───────┬──────┬──────┬───────┐ │ a ┆ b ┆ c ┆ a^2 ┆ b/2 ┆ not c │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ bool ┆ f64 ┆ f64 ┆ bool │ ╞═════╪══════╪═══════╪══════╪══════╪═══════╡ │ 1 ┆ 0.5 ┆ true ┆ 1.0 ┆ 0.25 ┆ false │ │ 2 ┆ 4.0 ┆ true ┆ 4.0 ┆ 2.0 ┆ false │ │ 3 ┆ 10.0 ┆ false ┆ 9.0 ┆ 5.0 ┆ true │ │ 4 ┆ 13.0 ┆ true ┆ 16.0 ┆ 6.5 ┆ false │ └─────┴──────┴───────┴──────┴──────┴───────┘
Multiple columns also can be added using positional arguments instead of a list.
>>> lf.with_columns( ... (pl.col("a") ** 2).alias("a^2"), ... (pl.col("b") / 2).alias("b/2"), ... (pl.col("c").not_()).alias("not c"), ... ).collect() shape: (4, 6) ┌─────┬──────┬───────┬──────┬──────┬───────┐ │ a ┆ b ┆ c ┆ a^2 ┆ b/2 ┆ not c │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ bool ┆ f64 ┆ f64 ┆ bool │ ╞═════╪══════╪═══════╪══════╪══════╪═══════╡ │ 1 ┆ 0.5 ┆ true ┆ 1.0 ┆ 0.25 ┆ false │ │ 2 ┆ 4.0 ┆ true ┆ 4.0 ┆ 2.0 ┆ false │ │ 3 ┆ 10.0 ┆ false ┆ 9.0 ┆ 5.0 ┆ true │ │ 4 ┆ 13.0 ┆ true ┆ 16.0 ┆ 6.5 ┆ false │ └─────┴──────┴───────┴──────┴──────┴───────┘
Use keyword arguments to easily name your expression inputs.
>>> lf.with_columns( ... ab=pl.col("a") * pl.col("b"), ... not_c=pl.col("c").not_(), ... ).collect() shape: (4, 5) ┌─────┬──────┬───────┬──────┬───────┐ │ a ┆ b ┆ c ┆ ab ┆ not_c │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ bool ┆ f64 ┆ bool │ ╞═════╪══════╪═══════╪══════╪═══════╡ │ 1 ┆ 0.5 ┆ true ┆ 0.5 ┆ false │ │ 2 ┆ 4.0 ┆ true ┆ 8.0 ┆ false │ │ 3 ┆ 10.0 ┆ false ┆ 30.0 ┆ true │ │ 4 ┆ 13.0 ┆ true ┆ 52.0 ┆ false │ └─────┴──────┴───────┴──────┴───────┘
Expressions with multiple outputs can be automatically instantiated as Structs by enabling the setting
Config.set_auto_structify(True):>>> with pl.Config(auto_structify=True): ... lf.drop("c").with_columns( ... diffs=pl.col(["a", "b"]).diff().name.suffix("_diff"), ... ).collect() ... shape: (4, 3) ┌─────┬──────┬─────────────┐ │ a ┆ b ┆ diffs │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ struct[2] │ ╞═════╪══════╪═════════════╡ │ 1 ┆ 0.5 ┆ {null,null} │ │ 2 ┆ 4.0 ┆ {1,3.5} │ │ 3 ┆ 10.0 ┆ {1,6.0} │ │ 4 ┆ 13.0 ┆ {1,3.0} │ └─────┴──────┴─────────────┘
- with_columns_seq(
 - *exprs: IntoExpr | Iterable[IntoExpr],
 - **named_exprs: IntoExpr,
 Add columns to this DataFrame.
Added columns will replace existing columns with the same name.
This will run all expression sequentially instead of in parallel. Use this when the work per expression is cheap.
- Parameters:
 - *exprs
 Column(s) to add, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.
- **named_exprs
 Additional columns to add, specified as keyword arguments. The columns will be renamed to the keyword used.
- Returns:
 - LazyFrame
 A new LazyFrame with the columns added.
See also
- with_context(other: Self | list[Self]) Self[source]
 Add an external context to the computation graph.
This allows expressions to also access columns from DataFrames that are not part of this one.
- Parameters:
 - other
 Lazy DataFrame to join with.
Examples
>>> lf = pl.LazyFrame({"a": [1, 2, 3], "b": ["a", "c", None]}) >>> lf_other = pl.LazyFrame({"c": ["foo", "ham"]}) >>> lf.with_context(lf_other).select( ... pl.col("b") + pl.col("c").first() ... ).collect() shape: (3, 1) ┌──────┐ │ b │ │ --- │ │ str │ ╞══════╡ │ afoo │ │ cfoo │ │ null │ └──────┘
Fill nulls with the median from another DataFrame:
>>> train_lf = pl.LazyFrame( ... {"feature_0": [-1.0, 0, 1], "feature_1": [-1.0, 0, 1]} ... ) >>> test_lf = pl.LazyFrame( ... {"feature_0": [-1.0, None, 1], "feature_1": [-1.0, 0, 1]} ... ) >>> test_lf.with_context( ... train_lf.select(pl.all().name.suffix("_train")) ... ).select( ... pl.col("feature_0").fill_null(pl.col("feature_0_train").median()) ... ).collect() shape: (3, 1) ┌───────────┐ │ feature_0 │ │ --- │ │ f64 │ ╞═══════════╡ │ -1.0 │ │ 0.0 │ │ 1.0 │ └───────────┘
- with_row_count(name: str = 'row_nr', offset: int = 0) Self[source]
 Add a column at index 0 that counts the rows.
- Parameters:
 - name
 Name of the column to add.
- offset
 Start the row count at this offset.
Warning
This can have a negative effect on query performance. This may, for instance, block predicate pushdown optimization.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 3, 5], ... "b": [2, 4, 6], ... } ... ) >>> lf.with_row_count().collect() shape: (3, 3) ┌────────┬─────┬─────┐ │ row_nr ┆ a ┆ b │ │ --- ┆ --- ┆ --- │ │ u32 ┆ i64 ┆ i64 │ ╞════════╪═════╪═════╡ │ 0 ┆ 1 ┆ 2 │ │ 1 ┆ 3 ┆ 4 │ │ 2 ┆ 5 ┆ 6 │ └────────┴─────┴─────┘
- write_json(file: None = None) str[source]
 - write_json(file: IOBase | str | Path) None
 Serialize the logical plan of this LazyFrame to a file or string in JSON format.
Deprecated since version 0.18.12: This method has been renamed to
LazyFrame.serialize().- Parameters:
 - file
 File path to which the result should be written. If set to
None(default), the output is returned as a string instead.