LazyFrame#

This page gives an overview of all public LazyFrame methods.

class polars.LazyFrame( data: FrameInitTypes | None = None, schema: SchemaDefinition | None = None, *, schema_overrides: SchemaDict | None = None, orient: Orientation | None = None, infer_schema_length: int | None = 100, nan_to_null: bool = False, )[source]

Representation of a Lazy computation graph/query against a DataFrame.

This allows for whole-query optimisation in addition to parallelism, and is the preferred (and highest-performance) mode of operation for polars.

Parameters:

datadict, Sequence, ndarray, Series, or pandas.DataFrame

Two-dimensional data in various forms; dict input must contain Sequences, Generators, or a range. Sequence may contain Series or other Sequences.

schemaSequence of str, (str,DataType) pairs, or a {str:DataType,} dict

The DataFrame schema may be declared in several ways:

As a dict of {name:type} pairs; if type is None, it will be auto-inferred.
As a list of column names; in this case types are automatically inferred.
As a list of (name,type) pairs; this is equivalent to the dictionary form.

If you supply a list of column names that does not match the names in the underlying data, the names given here will overwrite them. The number of names given in the schema should match the underlying data dimensions.

schema_overridesdict, default None

Support type specification or override of one or more columns; note that any dtypes inferred from the schema param will be overridden. underlying data, the names given here will overwrite them.

The number of entries in the schema should match the underlying data dimensions, unless a sequence of dictionaries is being passed, in which case a _partial_ schema can be declared to prevent specific fields from being loaded.

orient{‘col’, ‘row’}, default None

Whether to interpret two-dimensional data as columns or as rows. If None, the orientation is inferred by matching the columns and data dimensions. If this does not yield conclusive results, column orientation is used.

infer_schema_lengthint, default None

Maximum number of rows to read for schema inference; only applies if the input data is a sequence or generator of rows; other input is read as-is.

nan_to_nullbool, default False

If the data comes from one or more numpy arrays, can optionally convert input data np.nan values to null instead. This is a no-op for all other input data.

Notes

Initialising LazyFrame(...) directly is equivalent to DataFrame(...).lazy().

Examples

Constructing a LazyFrame directly from a dictionary:

>>> data = {"a": [1, 2], "b": [3, 4]}
>>> lf = pl.LazyFrame(data)
>>> lf.collect()
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 3   │
│ 2   ┆ 4   │
└─────┴─────┘

Notice that the dtypes are automatically inferred as polars Int64:

>>> lf.dtypes
[Int64, Int64]

To specify a more detailed/specific frame schema you can supply the schema parameter with a dictionary of (name,dtype) pairs…

>>> data = {"col1": [0, 2], "col2": [3, 7]}
>>> lf2 = pl.LazyFrame(data, schema={"col1": pl.Float32, "col2": pl.Int64})
>>> lf2.collect()
shape: (2, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ f32  ┆ i64  │
╞══════╪══════╡
│ 0.0  ┆ 3    │
│ 2.0  ┆ 7    │
└──────┴──────┘

…a sequence of (name,dtype) pairs…

>>> data = {"col1": [1, 2], "col2": [3, 4]}
>>> lf3 = pl.LazyFrame(data, schema=[("col1", pl.Float32), ("col2", pl.Int64)])
>>> lf3.collect()
shape: (2, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ f32  ┆ i64  │
╞══════╪══════╡
│ 1.0  ┆ 3    │
│ 2.0  ┆ 4    │
└──────┴──────┘

…or a list of typed Series.

>>> data = [
...     pl.Series("col1", [1, 2], dtype=pl.Float32),
...     pl.Series("col2", [3, 4], dtype=pl.Int64),
... ]
>>> lf4 = pl.LazyFrame(data)
>>> lf4.collect()
shape: (2, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ f32  ┆ i64  │
╞══════╪══════╡
│ 1.0  ┆ 3    │
│ 2.0  ┆ 4    │
└──────┴──────┘

Constructing a LazyFrame from a numpy ndarray, specifying column names:

>>> import numpy as np
>>> data = np.array([(1, 2), (3, 4)], dtype=np.int64)
>>> lf5 = pl.LazyFrame(data, schema=["a", "b"], orient="col")
>>> lf5.collect()
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 3   │
│ 2   ┆ 4   │
└─────┴─────┘

Constructing a LazyFrame from a list of lists, row orientation inferred:

>>> data = [[1, 2, 3], [4, 5, 6]]
>>> lf6 = pl.LazyFrame(data, schema=["a", "b", "c"])
>>> lf6.collect()
shape: (2, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1   ┆ 2   ┆ 3   │
│ 4   ┆ 5   ┆ 6   │
└─────┴─────┴─────┘

Methods:

`approx_n_unique`	Approximate count of unique values.
`bottom_k`	Return the k smallest elements.
`cache`	Cache the result once the execution of the physical plan hits this node.
`clear`	Create an empty copy of the current LazyFrame, with zero to 'n' rows.
`clone`	Very cheap deepcopy/clone.
`collect`	Collect into a DataFrame.
`deserialize`	Read a logical plan from a JSON file to construct a LazyFrame.
`drop`	Remove columns from the dataframe.
`drop_nulls`	Drop all rows that contain null values.
`explain`	Create a string representation of the query plan.
`explode`	Explode the dataframe to long format by exploding the given columns.
`fetch`	Collect a small number of rows for debugging purposes.
`fill_nan`	Fill floating point NaN values.
`fill_null`	Fill null values using the specified value or strategy.
`filter`	Filter the rows in the LazyFrame based on a predicate expression.
`first`	Get the first row of the DataFrame.
`from_json`	Read a logical plan from a JSON string to construct a LazyFrame.
`groupby`	Start a groupby operation.
`groupby_dynamic`	Group based on a time value (or index value of type Int32, Int64).
`groupby_rolling`	Create rolling groups based on a time, Int32, or Int64 column.
`head`	Get the first n rows.
`inspect`	Inspect a node in the computation graph.
`interpolate`	Interpolate intermediate values.
`join`	Add a join operation to the Logical Plan.
`join_asof`	Perform an asof join.
`last`	Get the last row of the DataFrame.
`lazy`	Return lazy representation, i.e. itself.
`limit`	Get the first n rows.
`map`	Apply a custom function.
`max`	Aggregate the columns in the LazyFrame to their maximum value.
`mean`	Aggregate the columns in the LazyFrame to their mean value.
`median`	Aggregate the columns in the LazyFrame to their median value.
`melt`	Unpivot a DataFrame from wide to long format.
`merge_sorted`	Take two sorted DataFrames and merge them by the sorted key.
`min`	Aggregate the columns in the LazyFrame to their minimum value.
`null_count`	Aggregate the columns in the LazyFrame as the sum of their null value count.
`pipe`	Offers a structured way to apply a sequence of user-defined functions (UDFs).
`profile`	Profile a LazyFrame.
`quantile`	Aggregate the columns in the LazyFrame to their quantile value.
`read_json`	Read a logical plan from a JSON file to construct a LazyFrame.
`rename`	Rename column names.
`reverse`	Reverse the DataFrame.
`select`	Select columns from this LazyFrame.
`select_seq`	Select columns from this LazyFrame.
`serialize`	Serialize the logical plan of this LazyFrame to a file or string in JSON format.
`set_sorted`	Indicate that one or multiple columns are sorted.
`shift`	Shift the values by a given period.
`shift_and_fill`	Shift the values by a given period and fill the resulting null values.
`show_graph`	Show a plot of the query plan.
`sink_ipc`	Persists a LazyFrame at the provided path.
`sink_parquet`	Persists a LazyFrame at the provided path.
`slice`	Get a slice of this DataFrame.
`sort`	Sort the dataframe by the given columns.
`std`	Aggregate the columns in the LazyFrame to their standard deviation value.
`sum`	Aggregate the columns in the LazyFrame to their sum value.
`tail`	Get the last n rows.
`take_every`	Take every nth row in the LazyFrame and return as a new LazyFrame.
`top_k`	Return the k largest elements.
`unique`	Drop duplicate rows from this dataframe.
`unnest`	Decompose struct columns into separate columns for each of their fields.
`update`	Update the values in this LazyFrame with the non-null values in other.
`var`	Aggregate the columns in the LazyFrame to their variance value.
`with_columns`	Add columns to this DataFrame.
`with_columns_seq`	Add columns to this DataFrame.
`with_context`	Add an external context to the computation graph.
`with_row_count`	Add a column at index 0 that counts the rows.

Attributes:

`columns`	Get column names.
`dtypes`	Get dtypes of columns in LazyFrame.
`schema`	Get a dict[column name, DataType].
`width`	Get the width of the LazyFrame.

approx_n_unique() → Self[source]

Approximate count of unique values.

This is done using the HyperLogLog++ algorithm for cardinality estimation.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 2, 3, 4],
...         "b": [1, 2, 1, 1],
...     }
... )
>>> lf.approx_n_unique().collect()
shape: (1, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ u32 ┆ u32 │
╞═════╪═════╡
│ 4   ┆ 2   │
└─────┴─────┘

bottom_k( k: int, *, by: IntoExpr | Iterable[IntoExpr], descending: bool | Sequence[bool] = False, nulls_last: bool = False, maintain_order: bool = False, ) → Self[source]

Return the k smallest elements.

If ‘descending=True` the largest elements will be given.

Parameters:

k: Number of rows to return.
by: Column(s) included in sort order. Accepts expression input. Strings are parsed as column names.
descending: Return the ‘k’ smallest. Top-k by multiple columns can be specified per column by passing a sequence of booleans.
nulls_last: Place null values last.
maintain_order: Whether the order should be maintained if elements are equal. Note that if true streaming is not possible and performance might be worse since this requires a stable search.

See also

top_k

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": ["a", "b", "a", "b", "b", "c"],
...         "b": [2, 1, 1, 3, 2, 1],
...     }
... )

Get the rows which contain the 4 smallest values in column b.

>>> lf.bottom_k(4, by="b").collect()
shape: (4, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ b   ┆ 1   │
│ a   ┆ 1   │
│ c   ┆ 1   │
│ a   ┆ 2   │
└─────┴─────┘

Get the rows which contain the 4 smallest values when sorting on column a and b.

>>> lf.bottom_k(4, by=["a", "b"]).collect()
shape: (4, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ a   ┆ 1   │
│ a   ┆ 2   │
│ b   ┆ 1   │
│ b   ┆ 2   │
└─────┴─────┘

cache() → Self[source]: Cache the result once the execution of the physical plan hits this node.

clear(n: int = 0) → LazyFrame[source]

Create an empty copy of the current LazyFrame, with zero to ‘n’ rows.

Returns a copy with an identical schema but no data.

Parameters:

n: Number of (empty) rows to return in the cleared frame.

See also

clone: Cheap deepcopy/clone.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [None, 2, 3, 4],
...         "b": [0.5, None, 2.5, 13],
...         "c": [True, True, False, None],
...     }
... )
>>> lf.clear().fetch()
shape: (0, 3)
┌─────┬─────┬──────┐
│ a   ┆ b   ┆ c    │
│ --- ┆ --- ┆ ---  │
│ i64 ┆ f64 ┆ bool │
╞═════╪═════╪══════╡
└─────┴─────┴──────┘

>>> lf.clear(2).fetch()
shape: (2, 3)
┌──────┬──────┬──────┐
│ a    ┆ b    ┆ c    │
│ ---  ┆ ---  ┆ ---  │
│ i64  ┆ f64  ┆ bool │
╞══════╪══════╪══════╡
│ null ┆ null ┆ null │
│ null ┆ null ┆ null │
└──────┴──────┴──────┘

clone() → Self[source]

Very cheap deepcopy/clone.

See also

clear: Create an empty copy of the current LazyFrame, with identical schema but no data.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [None, 2, 3, 4],
...         "b": [0.5, None, 2.5, 13],
...         "c": [True, True, False, None],
...     }
... )
>>> lf.clone()  
<LazyFrame [3 cols, {"a": Int64 … "c": Boolean}] at ...>

collect( *, type_coercion: bool = True, predicate_pushdown: bool = True, projection_pushdown: bool = True, simplify_expression: bool = True, no_optimization: bool = False, slice_pushdown: bool = True, comm_subplan_elim: bool = True, comm_subexpr_elim: bool = True, streaming: bool = False, ) → DataFrame[source]

Collect into a DataFrame.

Note: use fetch() if you want to run your query on the first n rows only. This can be a huge time saver in debugging queries.

Parameters:

type_coercion: Do type coercion optimization.
predicate_pushdown: Do predicate pushdown optimization.
projection_pushdown: Do projection pushdown optimization.
simplify_expression: Run simplify expressions optimization.
no_optimization: Turn off (certain) optimizations.
slice_pushdown: Slice pushdown optimization.
comm_subplan_elim: Will try to cache branching subplans that occur on self-joins or unions.
comm_subexpr_elim: Common subexpressions will be cached and reused.
streaming: Run parts of the query in a streaming fashion (this is in an alpha state)

Returns:

DataFrame

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": ["a", "b", "a", "b", "b", "c"],
...         "b": [1, 2, 3, 4, 5, 6],
...         "c": [6, 5, 4, 3, 2, 1],
...     }
... )
>>> lf.groupby("a", maintain_order=True).agg(pl.all().sum()).collect()
shape: (3, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ a   ┆ 4   ┆ 10  │
│ b   ┆ 11  ┆ 10  │
│ c   ┆ 6   ┆ 1   │
└─────┴─────┴─────┘

property columns: list[str][source]

Get column names.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6, 7, 8],
...         "ham": ["a", "b", "c"],
...     }
... ).select(["foo", "bar"])
>>> lf.columns
['foo', 'bar']

classmethod deserialize(source: str | Path | IOBase) → Self[source]

Read a logical plan from a JSON file to construct a LazyFrame.

Parameters:

source: Path to a file or a file-like object (by file-like object, we refer to objects that have a read() method, such as a file handler (e.g. via builtin open function) or BytesIO).

See also

LazyFrame.serialize

Examples

>>> import io
>>> lf = pl.LazyFrame({"a": [1, 2, 3]}).sum()
>>> json = lf.serialize()
>>> pl.LazyFrame.deserialize(io.StringIO(json)).collect()
shape: (1, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 6   │
└─────┘

drop( columns: ColumnNameOrSelector | Collection[ColumnNameOrSelector], *more_columns: ColumnNameOrSelector, ) → Self[source]

Remove columns from the dataframe.

Parameters:

columns: Name of the column(s) that should be removed from the dataframe.
*more_columns: Additional columns to drop, specified as positional arguments.

Examples

Drop a single column by passing the name of that column.

>>> lf = pl.LazyFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6.0, 7.0, 8.0],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> lf.drop("ham").collect()
shape: (3, 2)
┌─────┬─────┐
│ foo ┆ bar │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════╪═════╡
│ 1   ┆ 6.0 │
│ 2   ┆ 7.0 │
│ 3   ┆ 8.0 │
└─────┴─────┘

Drop multiple columns by passing a selector.

>>> import polars.selectors as cs
>>> lf.drop(cs.numeric()).collect()
shape: (3, 1)
┌─────┐
│ ham │
│ --- │
│ str │
╞═════╡
│ a   │
│ b   │
│ c   │
└─────┘

Use positional arguments to drop multiple columns.

>>> lf.drop("foo", "ham").collect()
shape: (3, 1)
┌─────┐
│ bar │
│ --- │
│ f64 │
╞═════╡
│ 6.0 │
│ 7.0 │
│ 8.0 │
└─────┘

drop_nulls( subset: ColumnNameOrSelector | Collection[ColumnNameOrSelector] | None = None, ) → Self[source]

Drop all rows that contain null values.

Returns a new LazyFrame.

Parameters:

subset: Column name(s) for which null values are considered. If set to None (default), use all columns.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6, None, 8],
...         "ham": ["a", "b", None],
...     }
... )

The default behavior of this method is to drop rows where any single value of the row is null.

>>> lf.drop_nulls().collect()
shape: (1, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 1   ┆ 6   ┆ a   │
└─────┴─────┴─────┘

This behaviour can be constrained to consider only a subset of columns, as defined by name or with a selector. For example, dropping rows if there is a null in any of the integer columns:

>>> import polars.selectors as cs
>>> lf.drop_nulls(subset=cs.integer()).collect()
shape: (2, 3)
┌─────┬─────┬──────┐
│ foo ┆ bar ┆ ham  │
│ --- ┆ --- ┆ ---  │
│ i64 ┆ i64 ┆ str  │
╞═════╪═════╪══════╡
│ 1   ┆ 6   ┆ a    │
│ 3   ┆ 8   ┆ null │
└─────┴─────┴──────┘

This method drops a row if any single value of the row is null.

Below are some example snippets that show how you could drop null values based on other conditions:

>>> lf = pl.LazyFrame(
...     {
...         "a": [None, None, None, None],
...         "b": [1, 2, None, 1],
...         "c": [1, None, None, 1],
...     }
... )
>>> lf.collect()
shape: (4, 3)
┌──────┬──────┬──────┐
│ a    ┆ b    ┆ c    │
│ ---  ┆ ---  ┆ ---  │
│ f32  ┆ i64  ┆ i64  │
╞══════╪══════╪══════╡
│ null ┆ 1    ┆ 1    │
│ null ┆ 2    ┆ null │
│ null ┆ null ┆ null │
│ null ┆ 1    ┆ 1    │
└──────┴──────┴──────┘

Drop a row only if all values are null:

>>> lf.filter(~pl.all_horizontal(pl.all().is_null())).collect()
shape: (3, 3)
┌──────┬─────┬──────┐
│ a    ┆ b   ┆ c    │
│ ---  ┆ --- ┆ ---  │
│ f32  ┆ i64 ┆ i64  │
╞══════╪═════╪══════╡
│ null ┆ 1   ┆ 1    │
│ null ┆ 2   ┆ null │
│ null ┆ 1   ┆ 1    │
└──────┴─────┴──────┘

property dtypes: list[PolarsDataType][source]

Get dtypes of columns in LazyFrame.

See also

schema: Returns a {colname:dtype} mapping.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6.0, 7.0, 8.0],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> lf.dtypes
[Int64, Float64, Utf8]

explain( *, optimized: bool = True, type_coercion: bool = True, predicate_pushdown: bool = True, projection_pushdown: bool = True, simplify_expression: bool = True, slice_pushdown: bool = True, comm_subplan_elim: bool = True, comm_subexpr_elim: bool = True, streaming: bool = False, ) → str[source]

Create a string representation of the query plan.

Different optimizations can be turned on or off.

Parameters:

optimized: Return an optimized query plan. Defaults to False. If this is set to True the subsequent optimization flags control which optimizations run.
type_coercion: Do type coercion optimization.
predicate_pushdown: Do predicate pushdown optimization.
projection_pushdown: Do projection pushdown optimization.
simplify_expression: Run simplify expressions optimization.
slice_pushdown: Slice pushdown optimization.
comm_subplan_elim: Will try to cache branching subplans that occur on self-joins or unions.
comm_subexpr_elim: Common subexpressions will be cached and reused.
streaming: Run parts of the query in a streaming fashion (this is in an alpha state)

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": ["a", "b", "a", "b", "b", "c"],
...         "b": [1, 2, 3, 4, 5, 6],
...         "c": [6, 5, 4, 3, 2, 1],
...     }
... )
>>> lf.groupby("a", maintain_order=True).agg(pl.all().sum()).sort(
...     "a"
... ).explain()  

explode( columns: str | Expr | Sequence[str | Expr], *more_columns: str | Expr, ) → Self[source]

Explode the dataframe to long format by exploding the given columns.

Parameters:

columns: Column names, expressions, or a selector defining them. The underlying columns being exploded must be of List or Utf8 datatype.
*more_columns: Additional names of columns to explode, specified as positional arguments.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "letters": ["a", "a", "b", "c"],
...         "numbers": [[1], [2, 3], [4, 5], [6, 7, 8]],
...     }
... )
>>> lf.explode("numbers").collect()
shape: (8, 2)
┌─────────┬─────────┐
│ letters ┆ numbers │
│ ---     ┆ ---     │
│ str     ┆ i64     │
╞═════════╪═════════╡
│ a       ┆ 1       │
│ a       ┆ 2       │
│ a       ┆ 3       │
│ b       ┆ 4       │
│ b       ┆ 5       │
│ c       ┆ 6       │
│ c       ┆ 7       │
│ c       ┆ 8       │
└─────────┴─────────┘

fetch( n_rows: int = 500, *, type_coercion: bool = True, predicate_pushdown: bool = True, projection_pushdown: bool = True, simplify_expression: bool = True, no_optimization: bool = False, slice_pushdown: bool = True, comm_subplan_elim: bool = True, comm_subexpr_elim: bool = True, streaming: bool = False, ) → DataFrame[source]

Collect a small number of rows for debugging purposes.

Fetch is like a collect() operation, but it overwrites the number of rows read by every scan operation. This is a utility that helps debug a query on a smaller number of rows.

Note that the fetch does not guarantee the final number of rows in the DataFrame. Filter, join operations and a lower number of rows available in the scanned file influence the final number of rows.

Parameters:

n_rows: Collect n_rows from the data sources.
type_coercion: Run type coercion optimization.
predicate_pushdown: Run predicate pushdown optimization.
projection_pushdown: Run projection pushdown optimization.
simplify_expression: Run simplify expressions optimization.
no_optimization: Turn off optimizations.
slice_pushdown: Slice pushdown optimization
comm_subplan_elim: Will try to cache branching subplans that occur on self-joins or unions.
comm_subexpr_elim: Common subexpressions will be cached and reused.
streaming: Run parts of the query in a streaming fashion (this is in an alpha state)

Returns:

DataFrame

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": ["a", "b", "a", "b", "b", "c"],
...         "b": [1, 2, 3, 4, 5, 6],
...         "c": [6, 5, 4, 3, 2, 1],
...     }
... )
>>> lf.groupby("a", maintain_order=True).agg(pl.all().sum()).fetch(2)
shape: (2, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ a   ┆ 1   ┆ 6   │
│ b   ┆ 2   ┆ 5   │
└─────┴─────┴─────┘

fill_nan(value: int | float | Expr | None) → Self[source]

Fill floating point NaN values.

Parameters:

value: Value to fill the NaN values with.

Warning

Note that floating point NaN (Not a Number) are not missing values! To replace missing values, use fill_null() instead.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [1.5, 2, float("NaN"), 4],
...         "b": [0.5, 4, float("NaN"), 13],
...     }
... )
>>> lf.fill_nan(99).collect()
shape: (4, 2)
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ f64  ┆ f64  │
╞══════╪══════╡
│ 1.5  ┆ 0.5  │
│ 2.0  ┆ 4.0  │
│ 99.0 ┆ 99.0 │
│ 4.0  ┆ 13.0 │
└──────┴──────┘

fill_null( value: Any | None = None, strategy: FillNullStrategy | None = None, limit: int | None = None, *, matches_supertype: bool = True, ) → Self[source]

Fill null values using the specified value or strategy.

Parameters:

value: Value used to fill null values.
strategy{None, ‘forward’, ‘backward’, ‘min’, ‘max’, ‘mean’, ‘zero’, ‘one’}: Strategy used to fill null values.
limit: Number of consecutive null values to fill when using the ‘forward’ or ‘backward’ strategy.
matches_supertype: Fill all matching supertypes of the fill value literal.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 2, None, 4],
...         "b": [0.5, 4, None, 13],
...     }
... )
>>> lf.fill_null(99).collect()
shape: (4, 2)
┌─────┬──────┐
│ a   ┆ b    │
│ --- ┆ ---  │
│ i64 ┆ f64  │
╞═════╪══════╡
│ 1   ┆ 0.5  │
│ 2   ┆ 4.0  │
│ 99  ┆ 99.0 │
│ 4   ┆ 13.0 │
└─────┴──────┘
>>> lf.fill_null(strategy="forward").collect()
shape: (4, 2)
┌─────┬──────┐
│ a   ┆ b    │
│ --- ┆ ---  │
│ i64 ┆ f64  │
╞═════╪══════╡
│ 1   ┆ 0.5  │
│ 2   ┆ 4.0  │
│ 2   ┆ 4.0  │
│ 4   ┆ 13.0 │
└─────┴──────┘

>>> lf.fill_null(strategy="max").collect()
shape: (4, 2)
┌─────┬──────┐
│ a   ┆ b    │
│ --- ┆ ---  │
│ i64 ┆ f64  │
╞═════╪══════╡
│ 1   ┆ 0.5  │
│ 2   ┆ 4.0  │
│ 4   ┆ 13.0 │
│ 4   ┆ 13.0 │
└─────┴──────┘

>>> lf.fill_null(strategy="zero").collect()
shape: (4, 2)
┌─────┬──────┐
│ a   ┆ b    │
│ --- ┆ ---  │
│ i64 ┆ f64  │
╞═════╪══════╡
│ 1   ┆ 0.5  │
│ 2   ┆ 4.0  │
│ 0   ┆ 0.0  │
│ 4   ┆ 13.0 │
└─────┴──────┘

filter(predicate: IntoExpr) → Self[source]

Filter the rows in the LazyFrame based on a predicate expression.

Parameters:

predicate: Expression that evaluates to a boolean Series.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6, 7, 8],
...         "ham": ["a", "b", "c"],
...     }
... )

Filter on one condition:

>>> lf.filter(pl.col("foo") < 3).collect()
shape: (2, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 1   ┆ 6   ┆ a   │
│ 2   ┆ 7   ┆ b   │
└─────┴─────┴─────┘

Filter on multiple conditions:

>>> lf.filter((pl.col("foo") < 3) & (pl.col("ham") == "a")).collect()
shape: (1, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 1   ┆ 6   ┆ a   │
└─────┴─────┴─────┘

Filter on an OR condition:

>>> lf.filter((pl.col("foo") == 1) | (pl.col("ham") == "c")).collect()
shape: (2, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 1   ┆ 6   ┆ a   │
│ 3   ┆ 8   ┆ c   │
└─────┴─────┴─────┘

first() → Self[source]

Get the first row of the DataFrame.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 3, 5],
...         "b": [2, 4, 6],
...     }
... )
>>> lf.first().collect()
shape: (1, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 2   │
└─────┴─────┘

classmethod from_json(json: str) → Self[source]

Read a logical plan from a JSON string to construct a LazyFrame.

Deprecated since version 0.18.12: This method is deprecated. Convert the JSON string to StringIO and then use LazyFrame.deserialize.

Parameters:

json: String in JSON format.

See also

deserialize

groupby( by: IntoExpr | Iterable[IntoExpr], *more_by: IntoExpr, maintain_order: bool = False, ) → LazyGroupBy[source]

Start a groupby operation.

Parameters:

by: Column(s) to group by. Accepts expression input. Strings are parsed as column names.
*more_by: Additional columns to group by, specified as positional arguments.
maintain_order: Ensure that the order of the groups is consistent with the input data. This is slower than a default groupby. Settings this to True blocks the possibility to run on the streaming engine.

Examples

Group by one column and call agg to compute the grouped sum of another column.

>>> lf = pl.LazyFrame(
...     {
...         "a": ["a", "b", "a", "b", "c"],
...         "b": [1, 2, 1, 3, 3],
...         "c": [5, 4, 3, 2, 1],
...     }
... )
>>> lf.groupby("a").agg(pl.col("b").sum()).collect()  
shape: (3, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ a   ┆ 2   │
│ b   ┆ 5   │
│ c   ┆ 3   │
└─────┴─────┘

Set maintain_order=True to ensure the order of the groups is consistent with the input.

>>> lf.groupby("a", maintain_order=True).agg(pl.col("c")).collect()
shape: (3, 2)
┌─────┬───────────┐
│ a   ┆ c         │
│ --- ┆ ---       │
│ str ┆ list[i64] │
╞═════╪═══════════╡
│ a   ┆ [5, 3]    │
│ b   ┆ [4, 2]    │
│ c   ┆ [1]       │
└─────┴───────────┘

Group by multiple columns by passing a list of column names.

>>> lf.groupby(["a", "b"]).agg(pl.max("c")).collect()  
shape: (4, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ a   ┆ 1   ┆ 5   │
│ b   ┆ 2   ┆ 4   │
│ b   ┆ 3   ┆ 2   │
│ c   ┆ 3   ┆ 1   │
└─────┴─────┴─────┘

Or use positional arguments to group by multiple columns in the same way. Expressions are also accepted.

>>> lf.groupby("a", pl.col("b") // 2).agg(
...     pl.col("c").mean()
... ).collect()  
shape: (3, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 │
╞═════╪═════╪═════╡
│ a   ┆ 0   ┆ 4.0 │
│ b   ┆ 1   ┆ 3.0 │
│ c   ┆ 1   ┆ 1.0 │
└─────┴─────┴─────┘

groupby_dynamic( index_column: IntoExpr, *, every: str | timedelta, period: str | timedelta | None = None, offset: str | timedelta | None = None, truncate: bool = True, include_boundaries: bool = False, closed: ClosedInterval = 'left', by: IntoExpr | Iterable[IntoExpr] | None = None, start_by: StartBy = 'window', check_sorted: bool = True, ) → LazyGroupBy[source]

Group based on a time value (or index value of type Int32, Int64).

Time windows are calculated and rows are assigned to windows. Different from a normal groupby is that a row can be member of multiple groups. The time/index window could be seen as a rolling window, with a window size determined by dates/times/values instead of slots in the DataFrame.

A window is defined by:

every: interval of the window
period: length of the window
offset: offset of the window

The every, period and offset arguments are created with the following string language:

1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
1i (1 index count)

Or combine them: “3d12h4m25s” # 3 days, 12 hours, 4 minutes, and 25 seconds

Suffix with “_saturating” to indicate that dates too large for their month should saturate at the largest date (e.g. 2022-02-29 -> 2022-02-28) instead of erroring.

By “calendar day”, we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for “calendar week”, “calendar month”, “calendar quarter”, and “calendar year”.

In case of a groupby_dynamic on an integer column, the windows are defined by:

“1i” # length 1
“10i” # length 10

Warning

The index column must be sorted in ascending order. If by is passed, then the index column must be sorted in ascending order within each group.

Parameters:

index_column

Column used to group based on the time window. Often of type Date/Datetime. This column must be sorted in ascending order (or, if by is specified, then it must be sorted in ascending order within each group).

In case of a dynamic groupby on indices, dtype needs to be one of {Int32, Int64}. Note that Int32 gets temporarily cast to Int64, so if performance matters use an Int64 column.

every

interval of the window

period

length of the window, if None it is equal to ‘every’

offset

offset of the window if None and period is None it will be equal to negative every

truncate

truncate the time value to the window lower bound

include_boundaries

Add the lower and upper bound of the window to the “_lower_bound” and “_upper_bound” columns. This will impact performance because it’s harder to parallelize

closed{‘right’, ‘left’, ‘both’, ‘none’}

Define which sides of the temporal interval are closed (inclusive).

by

Also group by this column/these columns

start_by{‘window’, ‘datapoint’, ‘monday’, ‘tuesday’, ‘wednesday’, ‘thursday’, ‘friday’, ‘saturday’, ‘sunday’}

The strategy to determine the start of the first window by.

‘window’: Truncate the start of the window with the ‘every’ argument. Note that weekly windows start on Monday.
‘datapoint’: Start from the first encountered data point.
a day of the week (only takes effect if every contains 'w'):
- ‘monday’: Start the window on the Monday before the first data point.
- ‘tuesday’: Start the window on the Tuesday before the first data point.
- …
- ‘sunday’: Start the window on the Sunday before the first data point.

check_sorted

When the by argument is given, polars can not check sortedness by the metadata and has to do a full scan on the index column to verify data is sorted. This is expensive. If you are sure the data within the by groups is sorted, you can set this to False. Doing so incorrectly will lead to incorrect output

Returns:

LazyGroupBy: Object you can call .agg on to aggregate by groups, the result of which will be sorted by index_column (but note that if by columns are passed, it will only be sorted within each by group).

See also

groupby_rolling

Notes

If you’re coming from pandas, then

# polars
df.groupby_dynamic("ts", every="1d").agg(pl.col("value").sum())

is equivalent to

# pandas
df.set_index("ts").resample("D")["value"].sum().reset_index()

though note that, unlike pandas, polars doesn’t add extra rows for empty windows. If you need index_column to be evenly spaced, then please combine with DataFrame.upsample().

Examples

>>> from datetime import datetime
>>> # create an example dataframe
>>> lf = pl.LazyFrame(
...     {
...         "time": pl.date_range(
...             start=datetime(2021, 12, 16),
...             end=datetime(2021, 12, 16, 3),
...             interval="30m",
...             eager=True,
...         ),
...         "n": range(7),
...     }
... )
>>> lf.collect()
shape: (7, 2)
┌─────────────────────┬─────┐
│ time                ┆ n   │
│ ---                 ┆ --- │
│ datetime[μs]        ┆ i64 │
╞═════════════════════╪═════╡
│ 2021-12-16 00:00:00 ┆ 0   │
│ 2021-12-16 00:30:00 ┆ 1   │
│ 2021-12-16 01:00:00 ┆ 2   │
│ 2021-12-16 01:30:00 ┆ 3   │
│ 2021-12-16 02:00:00 ┆ 4   │
│ 2021-12-16 02:30:00 ┆ 5   │
│ 2021-12-16 03:00:00 ┆ 6   │
└─────────────────────┴─────┘

Group by windows of 1 hour starting at 2021-12-16 00:00:00.

>>> lf.groupby_dynamic("time", every="1h", closed="right").agg(
...     [
...         pl.col("time").min().alias("time_min"),
...         pl.col("time").max().alias("time_max"),
...     ]
... ).collect()
shape: (4, 3)
┌─────────────────────┬─────────────────────┬─────────────────────┐
│ time                ┆ time_min            ┆ time_max            │
│ ---                 ┆ ---                 ┆ ---                 │
│ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        │
╞═════════════════════╪═════════════════════╪═════════════════════╡
│ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-16 00:00:00 │
│ 2021-12-16 00:00:00 ┆ 2021-12-16 00:30:00 ┆ 2021-12-16 01:00:00 │
│ 2021-12-16 01:00:00 ┆ 2021-12-16 01:30:00 ┆ 2021-12-16 02:00:00 │
│ 2021-12-16 02:00:00 ┆ 2021-12-16 02:30:00 ┆ 2021-12-16 03:00:00 │
└─────────────────────┴─────────────────────┴─────────────────────┘

The window boundaries can also be added to the aggregation result

>>> lf.groupby_dynamic(
...     "time", every="1h", include_boundaries=True, closed="right"
... ).agg([pl.col("time").count().alias("time_count")]).collect()
shape: (4, 4)
┌─────────────────────┬─────────────────────┬─────────────────────┬────────────┐
│ _lower_boundary     ┆ _upper_boundary     ┆ time                ┆ time_count │
│ ---                 ┆ ---                 ┆ ---                 ┆ ---        │
│ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        ┆ u32        │
╞═════════════════════╪═════════════════════╪═════════════════════╪════════════╡
│ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-15 23:00:00 ┆ 1          │
│ 2021-12-16 00:00:00 ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 00:00:00 ┆ 2          │
│ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 2          │
│ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 2          │
└─────────────────────┴─────────────────────┴─────────────────────┴────────────┘

When closed=”left”, should not include right end of interval [lower_bound, upper_bound)

>>> lf.groupby_dynamic("time", every="1h", closed="left").agg(
...     [
...         pl.col("time").count().alias("time_count"),
...         pl.col("time").alias("time_agg_list"),
...     ]
... ).collect()
shape: (4, 3)
┌─────────────────────┬────────────┬───────────────────────────────────┐
│ time                ┆ time_count ┆ time_agg_list                     │
│ ---                 ┆ ---        ┆ ---                               │
│ datetime[μs]        ┆ u32        ┆ list[datetime[μs]]                │
╞═════════════════════╪════════════╪═══════════════════════════════════╡
│ 2021-12-16 00:00:00 ┆ 2          ┆ [2021-12-16 00:00:00, 2021-12-16… │
│ 2021-12-16 01:00:00 ┆ 2          ┆ [2021-12-16 01:00:00, 2021-12-16… │
│ 2021-12-16 02:00:00 ┆ 2          ┆ [2021-12-16 02:00:00, 2021-12-16… │
│ 2021-12-16 03:00:00 ┆ 1          ┆ [2021-12-16 03:00:00]             │
└─────────────────────┴────────────┴───────────────────────────────────┘

When closed=”both” the time values at the window boundaries belong to 2 groups.

>>> lf.groupby_dynamic("time", every="1h", closed="both").agg(
...     pl.col("time").count().alias("time_count")
... ).collect()
shape: (5, 2)
┌─────────────────────┬────────────┐
│ time                ┆ time_count │
│ ---                 ┆ ---        │
│ datetime[μs]        ┆ u32        │
╞═════════════════════╪════════════╡
│ 2021-12-15 23:00:00 ┆ 1          │
│ 2021-12-16 00:00:00 ┆ 3          │
│ 2021-12-16 01:00:00 ┆ 3          │
│ 2021-12-16 02:00:00 ┆ 3          │
│ 2021-12-16 03:00:00 ┆ 1          │
└─────────────────────┴────────────┘

Dynamic groupbys can also be combined with grouping on normal keys

>>> lf = pl.LazyFrame(
...     {
...         "time": pl.date_range(
...             start=datetime(2021, 12, 16),
...             end=datetime(2021, 12, 16, 3),
...             interval="30m",
...             eager=True,
...         ),
...         "groups": ["a", "a", "a", "b", "b", "a", "a"],
...     }
... )
>>> lf.collect()
shape: (7, 2)
┌─────────────────────┬────────┐
│ time                ┆ groups │
│ ---                 ┆ ---    │
│ datetime[μs]        ┆ str    │
╞═════════════════════╪════════╡
│ 2021-12-16 00:00:00 ┆ a      │
│ 2021-12-16 00:30:00 ┆ a      │
│ 2021-12-16 01:00:00 ┆ a      │
│ 2021-12-16 01:30:00 ┆ b      │
│ 2021-12-16 02:00:00 ┆ b      │
│ 2021-12-16 02:30:00 ┆ a      │
│ 2021-12-16 03:00:00 ┆ a      │
└─────────────────────┴────────┘
>>> (
...     lf.groupby_dynamic(
...         "time",
...         every="1h",
...         closed="both",
...         by="groups",
...         include_boundaries=True,
...     )
... ).agg([pl.col("time").count().alias("time_count")]).collect()
shape: (7, 5)
┌────────┬─────────────────────┬─────────────────────┬─────────────────────┬────────────┐
│ groups ┆ _lower_boundary     ┆ _upper_boundary     ┆ time                ┆ time_count │
│ ---    ┆ ---                 ┆ ---                 ┆ ---                 ┆ ---        │
│ str    ┆ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        ┆ u32        │
╞════════╪═════════════════════╪═════════════════════╪═════════════════════╪════════════╡
│ a      ┆ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-15 23:00:00 ┆ 1          │
│ a      ┆ 2021-12-16 00:00:00 ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 00:00:00 ┆ 3          │
│ a      ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 1          │
│ a      ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 2          │
│ a      ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 04:00:00 ┆ 2021-12-16 03:00:00 ┆ 1          │
│ b      ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 2          │
│ b      ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 1          │
└────────┴─────────────────────┴─────────────────────┴─────────────────────┴────────────┘

Dynamic groupby on an index column

>>> lf = pl.LazyFrame(
...     {
...         "idx": pl.int_range(0, 6, eager=True),
...         "A": ["A", "A", "B", "B", "B", "C"],
...     }
... )
>>> lf.groupby_dynamic(
...     "idx",
...     every="2i",
...     period="3i",
...     include_boundaries=True,
...     closed="right",
... ).agg(pl.col("A").alias("A_agg_list")).collect()
shape: (3, 4)
┌─────────────────┬─────────────────┬─────┬─────────────────┐
│ _lower_boundary ┆ _upper_boundary ┆ idx ┆ A_agg_list      │
│ ---             ┆ ---             ┆ --- ┆ ---             │
│ i64             ┆ i64             ┆ i64 ┆ list[str]       │
╞═════════════════╪═════════════════╪═════╪═════════════════╡
│ 0               ┆ 3               ┆ 0   ┆ ["A", "B", "B"] │
│ 2               ┆ 5               ┆ 2   ┆ ["B", "B", "C"] │
│ 4               ┆ 7               ┆ 4   ┆ ["C"]           │
└─────────────────┴─────────────────┴─────┴─────────────────┘

Create rolling groups based on a time, Int32, or Int64 column.

Different from a dynamic_groupby the windows are now determined by the individual values and are not of constant intervals. For constant intervals use groupby_dynamic.

If you have a time series <t_0, t_1, ..., t_n>, then by default the windows created will be

(t_0 - period, t_0]

(t_1 - period, t_1]

…

(t_n - period, t_n]

The period and offset arguments are created either from a timedelta, or by using the following string language:

1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
1i (1 index count)

Or combine them: “3d12h4m25s” # 3 days, 12 hours, 4 minutes, and 25 seconds

Suffix with “_saturating” to indicate that dates too large for their month should saturate at the largest date (e.g. 2022-02-29 -> 2022-02-28) instead of erroring.

In case of a groupby_rolling on an integer column, the windows are defined by:

“1i” # length 1
“10i” # length 10

Parameters:

index_column

In case of a rolling groupby on indices, dtype needs to be one of {Int32, Int64}. Note that Int32 gets temporarily cast to Int64, so if performance matters use an Int64 column.

period

length of the window - must be non-negative

offset

offset of the window. Default is -period

closed{‘right’, ‘left’, ‘both’, ‘none’}

Define which sides of the temporal interval are closed (inclusive).

by

Also group by this column/these columns

check_sorted

Returns:

LazyGroupBy: Object you can call .agg on to aggregate by groups, the result of which will be sorted by index_column (but note that if by columns are passed, it will only be sorted within each by group).

See also

groupby_dynamic

Examples

>>> dates = [
...     "2020-01-01 13:45:48",
...     "2020-01-01 16:42:13",
...     "2020-01-01 16:45:09",
...     "2020-01-02 18:12:48",
...     "2020-01-03 19:45:32",
...     "2020-01-08 23:16:43",
... ]
>>> df = pl.LazyFrame({"dt": dates, "a": [3, 7, 5, 9, 2, 1]}).with_columns(
...     pl.col("dt").str.strptime(pl.Datetime).set_sorted()
... )
>>> out = (
...     df.groupby_rolling(index_column="dt", period="2d")
...     .agg(
...         [
...             pl.sum("a").alias("sum_a"),
...             pl.min("a").alias("min_a"),
...             pl.max("a").alias("max_a"),
...         ]
...     )
...     .collect()
... )
>>> assert out["sum_a"].to_list() == [3, 10, 15, 24, 11, 1]
>>> assert out["max_a"].to_list() == [3, 7, 7, 9, 9, 1]
>>> assert out["min_a"].to_list() == [3, 3, 3, 3, 2, 1]
>>> out
shape: (6, 4)
┌─────────────────────┬───────┬───────┬───────┐
│ dt                  ┆ sum_a ┆ min_a ┆ max_a │
│ ---                 ┆ ---   ┆ ---   ┆ ---   │
│ datetime[μs]        ┆ i64   ┆ i64   ┆ i64   │
╞═════════════════════╪═══════╪═══════╪═══════╡
│ 2020-01-01 13:45:48 ┆ 3     ┆ 3     ┆ 3     │
│ 2020-01-01 16:42:13 ┆ 10    ┆ 3     ┆ 7     │
│ 2020-01-01 16:45:09 ┆ 15    ┆ 3     ┆ 7     │
│ 2020-01-02 18:12:48 ┆ 24    ┆ 3     ┆ 9     │
│ 2020-01-03 19:45:32 ┆ 11    ┆ 2     ┆ 9     │
│ 2020-01-08 23:16:43 ┆ 1     ┆ 1     ┆ 1     │
└─────────────────────┴───────┴───────┴───────┘

head(n: int = 5) → Self[source]

Get the first n rows.

Parameters:

n: Number of rows to return.

Notes

Consider using the fetch() operation if you only want to test your query. The fetch() operation will load the first n rows at the scan level, whereas the head()/limit() are applied at the end.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 2, 3, 4, 5, 6],
...         "b": [7, 8, 9, 10, 11, 12],
...     }
... )
>>> lf.head().collect()
shape: (5, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 7   │
│ 2   ┆ 8   │
│ 3   ┆ 9   │
│ 4   ┆ 10  │
│ 5   ┆ 11  │
└─────┴─────┘
>>> lf.head(2).collect()
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 7   │
│ 2   ┆ 8   │
└─────┴─────┘

inspect(fmt: str = '{}') → Self[source]

Inspect a node in the computation graph.

Print the value that this node in the computation graph evaluates to and passes on the value.

Examples

>>> lf = pl.LazyFrame({"foo": [1, 1, -2, 3]})
>>> (
...     lf.select(pl.col("foo").cumsum().alias("bar"))
...     .inspect()  # print the node before the filter
...     .filter(pl.col("bar") == pl.col("foo"))
... )  
<LazyFrame [1 col, {"bar": Int64}] at ...>

interpolate() → Self[source]

Interpolate intermediate values. The interpolation method is linear.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "foo": [1, None, 9, 10],
...         "bar": [6, 7, 9, None],
...         "baz": [1, None, None, 9],
...     }
... )
>>> lf.interpolate().collect()
shape: (4, 3)
┌─────┬──────┬─────┐
│ foo ┆ bar  ┆ baz │
│ --- ┆ ---  ┆ --- │
│ i64 ┆ i64  ┆ i64 │
╞═════╪══════╪═════╡
│ 1   ┆ 6    ┆ 1   │
│ 5   ┆ 7    ┆ 3   │
│ 9   ┆ 9    ┆ 6   │
│ 10  ┆ null ┆ 9   │
└─────┴──────┴─────┘

Add a join operation to the Logical Plan.

Parameters:

other

Lazy DataFrame to join with.

on

Join column of both DataFrames. If set, left_on and right_on should be None.

how{‘inner’, ‘left’, ‘outer’, ‘semi’, ‘anti’, ‘cross’}

Join strategy.

left_on

Join column of the left DataFrame.

right_on

Join column of the right DataFrame.

suffix

Suffix to append to columns with a duplicate name.

validate: {‘m:m’, ‘m:1’, ‘1:m’, ‘1:1’}

Checks if join is of specified type.

many_to_many
“m:m”: default, does not result in checks

one_to_one
“1:1”: check if join keys are unique in both left and right datasets

one_to_many
“1:m”: check if join keys are unique in left dataset

many_to_one
“m:1”: check if join keys are unique in right dataset

Note

This is currently not supported the streaming engine.
This is only supported when joined by single columns.

allow_parallel

Allow the physical plan to optionally evaluate the computation of both DataFrames up to the join in parallel.

force_parallel

Force the physical plan to evaluate the computation of both DataFrames up to the join in parallel.

See also

join_asof

Examples

>>> lf = pl.LazyFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6.0, 7.0, 8.0],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> other_lf = pl.LazyFrame(
...     {
...         "apple": ["x", "y", "z"],
...         "ham": ["a", "b", "d"],
...     }
... )
>>> lf.join(other_lf, on="ham").collect()
shape: (2, 4)
┌─────┬─────┬─────┬───────┐
│ foo ┆ bar ┆ ham ┆ apple │
│ --- ┆ --- ┆ --- ┆ ---   │
│ i64 ┆ f64 ┆ str ┆ str   │
╞═════╪═════╪═════╪═══════╡
│ 1   ┆ 6.0 ┆ a   ┆ x     │
│ 2   ┆ 7.0 ┆ b   ┆ y     │
└─────┴─────┴─────┴───────┘
>>> lf.join(other_lf, on="ham", how="outer").collect()
shape: (4, 4)
┌──────┬──────┬─────┬───────┐
│ foo  ┆ bar  ┆ ham ┆ apple │
│ ---  ┆ ---  ┆ --- ┆ ---   │
│ i64  ┆ f64  ┆ str ┆ str   │
╞══════╪══════╪═════╪═══════╡
│ 1    ┆ 6.0  ┆ a   ┆ x     │
│ 2    ┆ 7.0  ┆ b   ┆ y     │
│ null ┆ null ┆ d   ┆ z     │
│ 3    ┆ 8.0  ┆ c   ┆ null  │
└──────┴──────┴─────┴───────┘
>>> lf.join(other_lf, on="ham", how="left").collect()
shape: (3, 4)
┌─────┬─────┬─────┬───────┐
│ foo ┆ bar ┆ ham ┆ apple │
│ --- ┆ --- ┆ --- ┆ ---   │
│ i64 ┆ f64 ┆ str ┆ str   │
╞═════╪═════╪═════╪═══════╡
│ 1   ┆ 6.0 ┆ a   ┆ x     │
│ 2   ┆ 7.0 ┆ b   ┆ y     │
│ 3   ┆ 8.0 ┆ c   ┆ null  │
└─────┴─────┴─────┴───────┘
>>> lf.join(other_lf, on="ham", how="semi").collect()
shape: (2, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ str │
╞═════╪═════╪═════╡
│ 1   ┆ 6.0 ┆ a   │
│ 2   ┆ 7.0 ┆ b   │
└─────┴─────┴─────┘
>>> lf.join(other_lf, on="ham", how="anti").collect()
shape: (1, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ str │
╞═════╪═════╪═════╡
│ 3   ┆ 8.0 ┆ c   │
└─────┴─────┴─────┘

Perform an asof join.

This is similar to a left-join except that we match on nearest key rather than equal keys.

Both DataFrames must be sorted by the join_asof key.

For each row in the left DataFrame:

A “backward” search selects the last row in the right DataFrame whose ‘on’ key is less than or equal to the left’s key.

A “forward” search selects the first row in the right DataFrame whose ‘on’ key is greater than or equal to the left’s key.

A “nearest” search selects the last row in the right DataFrame whose value is nearest to the left’s key.

The default is “backward”.

Parameters:

other

Lazy DataFrame to join with.

left_on

Join column of the left DataFrame.

right_on

Join column of the right DataFrame.

on

Join column of both DataFrames. If set, left_on and right_on should be None.

by

Join on these columns before doing asof join.

by_left

Join on these columns before doing asof join.

by_right

Join on these columns before doing asof join.

strategy{‘backward’, ‘forward’, ‘nearest’}

Join strategy.

suffix

Suffix to append to columns with a duplicate name.

tolerance

Numeric tolerance. By setting this the join will only be done if the near keys are within this distance. If an asof join is done on columns of dtype “Date”, “Datetime”, “Duration” or “Time” you use the following string language:

1ns (1 nanosecond)

1us (1 microsecond)

1ms (1 millisecond)

1s (1 second)

1m (1 minute)

1h (1 hour)

1d (1 calendar day)

1w (1 calendar week)

1mo (1 calendar month)

1q (1 calendar quarter)

1y (1 calendar year)

1i (1 index count)

Or combine them: “3d12h4m25s” # 3 days, 12 hours, 4 minutes, and 25 seconds

Suffix with “_saturating” to indicate that dates too large for their month should saturate at the largest date (e.g. 2022-02-29 -> 2022-02-28) instead of erroring.

By “calendar day”, we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for “calendar week”, “calendar month”, “calendar quarter”, and “calendar year”.

allow_parallel

Allow the physical plan to optionally evaluate the computation of both DataFrames up to the join in parallel.

force_parallel

Force the physical plan to evaluate the computation of both DataFrames up to the join in parallel.

Examples

>>> from datetime import datetime
>>> gdp = pl.LazyFrame(
...     {
...         "date": [
...             datetime(2016, 1, 1),
...             datetime(2017, 1, 1),
...             datetime(2018, 1, 1),
...             datetime(2019, 1, 1),
...         ],  # note record date: Jan 1st (sorted!)
...         "gdp": [4164, 4411, 4566, 4696],
...     }
... ).set_sorted("date")
>>> population = pl.LazyFrame(
...     {
...         "date": [
...             datetime(2016, 5, 12),
...             datetime(2017, 5, 12),
...             datetime(2018, 5, 12),
...             datetime(2019, 5, 12),
...         ],  # note record date: May 12th (sorted!)
...         "population": [82.19, 82.66, 83.12, 83.52],
...     }
... ).set_sorted("date")
>>> population.join_asof(gdp, on="date", strategy="backward").collect()
shape: (4, 3)
┌─────────────────────┬────────────┬──────┐
│ date                ┆ population ┆ gdp  │
│ ---                 ┆ ---        ┆ ---  │
│ datetime[μs]        ┆ f64        ┆ i64  │
╞═════════════════════╪════════════╪══════╡
│ 2016-05-12 00:00:00 ┆ 82.19      ┆ 4164 │
│ 2017-05-12 00:00:00 ┆ 82.66      ┆ 4411 │
│ 2018-05-12 00:00:00 ┆ 83.12      ┆ 4566 │
│ 2019-05-12 00:00:00 ┆ 83.52      ┆ 4696 │
└─────────────────────┴────────────┴──────┘

last() → Self[source]

Get the last row of the DataFrame.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 3, 5],
...         "b": [2, 4, 6],
...     }
... )
>>> lf.last().collect()
shape: (1, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 5   ┆ 6   │
└─────┴─────┘

lazy() → Self[source]

Return lazy representation, i.e. itself.

Useful for writing code that expects either a DataFrame or LazyFrame.

Returns:

LazyFrame

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [None, 2, 3, 4],
...         "b": [0.5, None, 2.5, 13],
...         "c": [True, True, False, None],
...     }
... )
>>> lf.lazy()  
<LazyFrame [3 cols, {"a": Int64 … "c": Boolean}] at ...>

limit(n: int = 5) → Self[source]

Get the first n rows.

Alias for LazyFrame.head().

Parameters:

n: Number of rows to return.

Notes

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 2, 3, 4, 5, 6],
...         "b": [7, 8, 9, 10, 11, 12],
...     }
... )
>>> lf.limit().collect()
shape: (5, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 7   │
│ 2   ┆ 8   │
│ 3   ┆ 9   │
│ 4   ┆ 10  │
│ 5   ┆ 11  │
└─────┴─────┘
>>> lf.limit(2).collect()
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 7   │
│ 2   ┆ 8   │
└─────┴─────┘

map( function: Callable[[DataFrame], DataFrame], *, predicate_pushdown: bool = True, projection_pushdown: bool = True, slice_pushdown: bool = True, no_optimizations: bool = False, schema: None | SchemaDict = None, validate_output_schema: bool = True, streamable: bool = False, ) → Self[source]

Apply a custom function.

It is important that the function returns a Polars DataFrame.

Parameters:

function: Lambda/ function to apply.
predicate_pushdown: Allow predicate pushdown optimization to pass this node.
projection_pushdown: Allow projection pushdown optimization to pass this node.
slice_pushdown: Allow slice pushdown optimization to pass this node.
no_optimizations: Turn off all optimizations past this point.
schema: Output schema of the function, if set to None we assume that the schema will remain unchanged by the applied function.
validate_output_schema: It is paramount that polars’ schema is correct. This flag will ensure that the output schema of this function will be checked with the expected schema. Setting this to False will not do this check, but may lead to hard to debug bugs.
streamable: Whether the function that is given is eligible to be running with the streaming engine. That means that the function must produce the same result when it is executed in batches or when it is be executed on the full dataset.

Warning

The schema of a LazyFrame must always be correct. It is up to the caller of this function to ensure that this invariant is upheld.

It is important that the optimization flags are correct. If the custom function for instance does an aggregation of a column, predicate_pushdown should not be allowed, as this prunes rows and will influence your aggregation results.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 2],
...         "b": [3, 4],
...     }
... )
>>> lf.map(lambda x: 2 * x).collect()
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 2   ┆ 6   │
│ 4   ┆ 8   │
└─────┴─────┘

max() → Self[source]

Aggregate the columns in the LazyFrame to their maximum value.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 2, 3, 4],
...         "b": [1, 2, 1, 1],
...     }
... )
>>> lf.max().collect()
shape: (1, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 4   ┆ 2   │
└─────┴─────┘

mean() → Self[source]

Aggregate the columns in the LazyFrame to their mean value.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 2, 3, 4],
...         "b": [1, 2, 1, 1],
...     }
... )
>>> lf.mean().collect()
shape: (1, 2)
┌─────┬──────┐
│ a   ┆ b    │
│ --- ┆ ---  │
│ f64 ┆ f64  │
╞═════╪══════╡
│ 2.5 ┆ 1.25 │
└─────┴──────┘

median() → Self[source]

Aggregate the columns in the LazyFrame to their median value.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 2, 3, 4],
...         "b": [1, 2, 1, 1],
...     }
... )
>>> lf.median().collect()
shape: (1, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞═════╪═════╡
│ 2.5 ┆ 1.0 │
└─────┴─────┘

Unpivot a DataFrame from wide to long format.

Optionally leaves identifiers set.

This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars) while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis leaving just two non-identifier columns, ‘variable’ and ‘value’.

Parameters:

id_vars: Column(s) or selector(s) to use as identifier variables.
value_vars: Column(s) or selector(s) to use as values variables; if value_vars is empty all columns that are not in id_vars will be used.
variable_name: Name to give to the variable column. Defaults to “variable”
value_name: Name to give to the value column. Defaults to “value”
streamable: Allow this node to run in the streaming engine. If this runs in streaming, the output of the melt operation will not have a stable ordering.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": ["x", "y", "z"],
...         "b": [1, 3, 5],
...         "c": [2, 4, 6],
...     }
... )
>>> import polars.selectors as cs
>>> lf.melt(id_vars="a", value_vars=cs.numeric()).collect()
shape: (6, 3)
┌─────┬──────────┬───────┐
│ a   ┆ variable ┆ value │
│ --- ┆ ---      ┆ ---   │
│ str ┆ str      ┆ i64   │
╞═════╪══════════╪═══════╡
│ x   ┆ b        ┆ 1     │
│ y   ┆ b        ┆ 3     │
│ z   ┆ b        ┆ 5     │
│ x   ┆ c        ┆ 2     │
│ y   ┆ c        ┆ 4     │
│ z   ┆ c        ┆ 6     │
└─────┴──────────┴───────┘

merge_sorted(other: LazyFrame, key: str) → Self[source]

Take two sorted DataFrames and merge them by the sorted key.

The output of this operation will also be sorted. It is the callers responsibility that the frames are sorted by that key otherwise the output will not make sense.

The schemas of both LazyFrames must be equal.

Parameters:

other: Other DataFrame that must be merged
key: Key that is sorted.

Examples

>>> df0 = pl.LazyFrame(
...     {"name": ["steve", "elise", "bob"], "age": [42, 44, 18]}
... ).sort("age")
>>> df0.collect()
shape: (3, 2)
┌───────┬─────┐
│ name  ┆ age │
│ ---   ┆ --- │
│ str   ┆ i64 │
╞═══════╪═════╡
│ bob   ┆ 18  │
│ steve ┆ 42  │
│ elise ┆ 44  │
└───────┴─────┘
>>> df1 = pl.LazyFrame(
...     {"name": ["anna", "megan", "steve", "thomas"], "age": [21, 33, 42, 20]}
... ).sort("age")
>>> df1.collect()
shape: (4, 2)
┌────────┬─────┐
│ name   ┆ age │
│ ---    ┆ --- │
│ str    ┆ i64 │
╞════════╪═════╡
│ thomas ┆ 20  │
│ anna   ┆ 21  │
│ megan  ┆ 33  │
│ steve  ┆ 42  │
└────────┴─────┘
>>> df0.merge_sorted(df1, key="age").collect()
shape: (7, 2)
┌────────┬─────┐
│ name   ┆ age │
│ ---    ┆ --- │
│ str    ┆ i64 │
╞════════╪═════╡
│ bob    ┆ 18  │
│ thomas ┆ 20  │
│ anna   ┆ 21  │
│ megan  ┆ 33  │
│ steve  ┆ 42  │
│ steve  ┆ 42  │
│ elise  ┆ 44  │
└────────┴─────┘

min() → Self[source]

Aggregate the columns in the LazyFrame to their minimum value.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 2, 3, 4],
...         "b": [1, 2, 1, 1],
...     }
... )
>>> lf.min().collect()
shape: (1, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 1   │
└─────┴─────┘

null_count() → Self[source]

Aggregate the columns in the LazyFrame as the sum of their null value count.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "foo": [1, None, 3],
...         "bar": [6, 7, None],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> lf.null_count().collect()
shape: (1, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ u32 ┆ u32 ┆ u32 │
╞═════╪═════╪═════╡
│ 1   ┆ 1   ┆ 0   │
└─────┴─────┴─────┘

pipe(

function: Callable[Concatenate[LazyFrame, P], T],

*args: P.args,

**kwargs: P.kwargs,

) → T[source]

Offers a structured way to apply a sequence of user-defined functions (UDFs).

Parameters:

function: Callable; will receive the frame as the first parameter, followed by any given args/kwargs.
*args: Arguments to pass to the UDF.
**kwargs: Keyword arguments to pass to the UDF.

Examples

>>> def cast_str_to_int(data, col_name):
...     return data.with_columns(pl.col(col_name).cast(pl.Int64))
...
>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 2, 3, 4],
...         "b": ["10", "20", "30", "40"],
...     }
... )
>>> lf.pipe(cast_str_to_int, col_name="b").collect()
shape: (4, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 10  │
│ 2   ┆ 20  │
│ 3   ┆ 30  │
│ 4   ┆ 40  │
└─────┴─────┘

>>> lf = pl.LazyFrame(
...     {
...         "b": [1, 2],
...         "a": [3, 4],
...     }
... )
>>> lf.collect()
shape: (2, 2)
┌─────┬─────┐
│ b   ┆ a   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 3   │
│ 2   ┆ 4   │
└─────┴─────┘
>>> lf.pipe(lambda tdf: tdf.select(sorted(tdf.columns))).collect()
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 3   ┆ 1   │
│ 4   ┆ 2   │
└─────┴─────┘

profile( *, type_coercion: bool = True, predicate_pushdown: bool = True, projection_pushdown: bool = True, simplify_expression: bool = True, no_optimization: bool = False, slice_pushdown: bool = True, comm_subplan_elim: bool = True, comm_subexpr_elim: bool = True, show_plot: bool = False, truncate_nodes: int = 0, figsize: tuple[int, int] = (18, 8), streaming: bool = False, ) → tuple[DataFrame, DataFrame][source]

Profile a LazyFrame.

This will run the query and return a tuple containing the materialized DataFrame and a DataFrame that contains profiling information of each node that is executed.

The units of the timings are microseconds.

Parameters:

type_coercion: Do type coercion optimization.
predicate_pushdown: Do predicate pushdown optimization.
projection_pushdown: Do projection pushdown optimization.
simplify_expression: Run simplify expressions optimization.
no_optimization: Turn off (certain) optimizations.
slice_pushdown: Slice pushdown optimization.
comm_subplan_elim: Will try to cache branching subplans that occur on self-joins or unions.
comm_subexpr_elim: Common subexpressions will be cached and reused.
show_plot: Show a gantt chart of the profiling result
truncate_nodes: Truncate the label lengths in the gantt chart to this number of characters.
figsize: matplotlib figsize of the profiling plot
streaming: Run parts of the query in a streaming fashion (this is in an alpha state)

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": ["a", "b", "a", "b", "b", "c"],
...         "b": [1, 2, 3, 4, 5, 6],
...         "c": [6, 5, 4, 3, 2, 1],
...     }
... )
>>> lf.groupby("a", maintain_order=True).agg(pl.all().sum()).sort(
...     "a"
... ).profile()  
(shape: (3, 3)
 ┌─────┬─────┬─────┐
 │ a   ┆ b   ┆ c   │
 │ --- ┆ --- ┆ --- │
 │ str ┆ i64 ┆ i64 │
 ╞═════╪═════╪═════╡
 │ a   ┆ 4   ┆ 10  │
 │ b   ┆ 11  ┆ 10  │
 │ c   ┆ 6   ┆ 1   │
 └─────┴─────┴─────┘,
 shape: (3, 3)
 ┌────────────────────────┬───────┬──────┐
 │ node                   ┆ start ┆ end  │
 │ ---                    ┆ ---   ┆ ---  │
 │ str                    ┆ u64   ┆ u64  │
 ╞════════════════════════╪═══════╪══════╡
 │ optimization           ┆ 0     ┆ 5    │
 │ groupby_partitioned(a) ┆ 5     ┆ 470  │
 │ sort(a)                ┆ 475   ┆ 1964 │
 └────────────────────────┴───────┴──────┘)

quantile( quantile: float | Expr, interpolation: RollingInterpolationMethod = 'nearest', ) → Self[source]

Aggregate the columns in the LazyFrame to their quantile value.

Parameters:

quantile: Quantile between 0.0 and 1.0.
interpolation{‘nearest’, ‘higher’, ‘lower’, ‘midpoint’, ‘linear’}: Interpolation method.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 2, 3, 4],
...         "b": [1, 2, 1, 1],
...     }
... )
>>> lf.quantile(0.7).collect()
shape: (1, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞═════╪═════╡
│ 3.0 ┆ 1.0 │
└─────┴─────┘

classmethod read_json(source: str | Path | IOBase) → Self[source]

Read a logical plan from a JSON file to construct a LazyFrame.

Deprecated since version 0.18.12: This class method has been renamed to deserialize.

Parameters:

source: Path to a file or a file-like object (by file-like object, we refer to objects that have a read() method, such as a file handler (e.g. via builtin open function) or BytesIO).

See also

deserialize

rename(mapping: dict[str, str]) → Self[source]

Rename column names.

Parameters:

mapping: Key value pairs that map from old name to new name.

Notes

If names are swapped. E.g. ‘A’ points to ‘B’ and ‘B’ points to ‘A’, polars will block projection and predicate pushdowns at this node.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6, 7, 8],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> lf.rename({"foo": "apple"}).collect()
shape: (3, 3)
┌───────┬─────┬─────┐
│ apple ┆ bar ┆ ham │
│ ---   ┆ --- ┆ --- │
│ i64   ┆ i64 ┆ str │
╞═══════╪═════╪═════╡
│ 1     ┆ 6   ┆ a   │
│ 2     ┆ 7   ┆ b   │
│ 3     ┆ 8   ┆ c   │
└───────┴─────┴─────┘

reverse() → Self[source]

Reverse the DataFrame.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "key": ["a", "b", "c"],
...         "val": [1, 2, 3],
...     }
... )
>>> lf.reverse().collect()
shape: (3, 2)
┌─────┬─────┐
│ key ┆ val │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ c   ┆ 3   │
│ b   ┆ 2   │
│ a   ┆ 1   │
└─────┴─────┘

property schema: SchemaDict[source]

Get a dict[column name, DataType].

Examples

>>> lf = pl.LazyFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6.0, 7.0, 8.0],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> lf.schema
{'foo': Int64, 'bar': Float64, 'ham': Utf8}

select(*exprs: IntoExpr | Iterable[IntoExpr], **named_exprs: IntoExpr) → Self[source]

Select columns from this LazyFrame.

Parameters:

*exprs: Column(s) to select, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.
**named_exprs: Additional columns to select, specified as keyword arguments. The columns will be renamed to the keyword used.

Examples

Pass the name of a column to select that column.

>>> lf = pl.LazyFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6, 7, 8],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> lf.select("foo").collect()
shape: (3, 1)
┌─────┐
│ foo │
│ --- │
│ i64 │
╞═════╡
│ 1   │
│ 2   │
│ 3   │
└─────┘

Multiple columns can be selected by passing a list of column names.

>>> lf.select(["foo", "bar"]).collect()
shape: (3, 2)
┌─────┬─────┐
│ foo ┆ bar │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 6   │
│ 2   ┆ 7   │
│ 3   ┆ 8   │
└─────┴─────┘

Multiple columns can also be selected using positional arguments instead of a list. Expressions are also accepted.

>>> lf.select(pl.col("foo"), pl.col("bar") + 1).collect()
shape: (3, 2)
┌─────┬─────┐
│ foo ┆ bar │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 7   │
│ 2   ┆ 8   │
│ 3   ┆ 9   │
└─────┴─────┘

Use keyword arguments to easily name your expression inputs.

>>> lf.select(
...     threshold=pl.when(pl.col("foo") > 2).then(10).otherwise(0)
... ).collect()
shape: (3, 1)
┌───────────┐
│ threshold │
│ ---       │
│ i32       │
╞═══════════╡
│ 0         │
│ 0         │
│ 10        │
└───────────┘

Expressions with multiple outputs can be automatically instantiated as Structs by enabling the experimental setting Config.set_auto_structify(True):

>>> with pl.Config(auto_structify=True):
...     lf.select(
...         is_odd=(pl.col(pl.INTEGER_DTYPES) % 2).suffix("_is_odd"),
...     ).collect()
...
shape: (3, 1)
┌───────────┐
│ is_odd    │
│ ---       │
│ struct[2] │
╞═══════════╡
│ {1,0}     │
│ {0,1}     │
│ {1,0}     │
└───────────┘

select_seq(

*exprs: IntoExpr | Iterable[IntoExpr],

**named_exprs: IntoExpr,

) → Self[source]

Select columns from this LazyFrame.

This will run all expression sequentially instead of in parallel. Use this when the work per expression is cheap.

Parameters:

*exprs: Column(s) to select, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.
**named_exprs: Additional columns to select, specified as keyword arguments. The columns will be renamed to the keyword used.

See also

select

serialize(file: None = None) → str[source]

serialize(file: IOBase | str | Path) → None

Serialize the logical plan of this LazyFrame to a file or string in JSON format.

Parameters:

file: File path to which the result should be written. If set to None (default), the output is returned as a string instead.

See also

LazyFrame.deserialize

Examples

Serialize the logical plan into a JSON string.

>>> lf = pl.LazyFrame({"a": [1, 2, 3]}).sum()
>>> json = lf.serialize()
>>> json
'{"LocalProjection":{"expr":[{"Agg":{"Sum":{"Column":"a"}}}],"input":{"DataFrameScan":{"df":{"columns":[{"name":"a","datatype":"Int64","bit_settings":"","values":[1,2,3]}]},"schema":{"inner":{"a":"Int64"}},"output_schema":null,"projection":null,"selection":null}},"schema":{"inner":{"a":"Int64"}}}}'

The logical plan can later be deserialized back into a LazyFrame.

>>> import io
>>> pl.LazyFrame.deserialize(io.StringIO(json)).collect()
shape: (1, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 6   │
└─────┘

set_sorted( column: str | Iterable[str], *more_columns: str, descending: bool = False, ) → Self[source]

Indicate that one or multiple columns are sorted.

Parameters:

column: Columns that are sorted
more_columns: Additional columns that are sorted, specified as positional arguments.
descending: Whether the columns are sorted in descending order.

shift(periods: int) → Self[source]

Shift the values by a given period.

Parameters:

periods: Number of places to shift (may be negative).

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 3, 5],
...         "b": [2, 4, 6],
...     }
... )
>>> lf.shift(periods=1).collect()
shape: (3, 2)
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ i64  ┆ i64  │
╞══════╪══════╡
│ null ┆ null │
│ 1    ┆ 2    │
│ 3    ┆ 4    │
└──────┴──────┘
>>> lf.shift(periods=-1).collect()
shape: (3, 2)
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ i64  ┆ i64  │
╞══════╪══════╡
│ 3    ┆ 4    │
│ 5    ┆ 6    │
│ null ┆ null │
└──────┴──────┘

shift_and_fill( fill_value: Expr | int | str | float, *, periods: int = 1, ) → Self[source]

Shift the values by a given period and fill the resulting null values.

Parameters:

fill_value: fill None values with the result of this expression.
periods: Number of places to shift (may be negative).

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 3, 5],
...         "b": [2, 4, 6],
...     }
... )
>>> lf.shift_and_fill(fill_value=0, periods=1).collect()
shape: (3, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 0   ┆ 0   │
│ 1   ┆ 2   │
│ 3   ┆ 4   │
└─────┴─────┘
>>> lf.shift_and_fill(periods=-1, fill_value=0).collect()
shape: (3, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 3   ┆ 4   │
│ 5   ┆ 6   │
│ 0   ┆ 0   │
└─────┴─────┘

show_graph( *, optimized: bool = True, show: bool = True, output_path: str | Path | None = None, raw_output: bool = False, figsize: tuple[float, float] = (16.0, 12.0), type_coercion: bool = True, predicate_pushdown: bool = True, projection_pushdown: bool = True, simplify_expression: bool = True, slice_pushdown: bool = True, comm_subplan_elim: bool = True, comm_subexpr_elim: bool = True, streaming: bool = False, ) → str | None[source]

Show a plot of the query plan. Note that you should have graphviz installed.

Parameters:

optimized: Optimize the query plan.
show: Show the figure.
output_path: Write the figure to disk.
raw_output: Return dot syntax. This cannot be combined with show and/or output_path.
figsize: Passed to matplotlib if show == True.
type_coercion: Do type coercion optimization.
predicate_pushdown: Do predicate pushdown optimization.
projection_pushdown: Do projection pushdown optimization.
simplify_expression: Run simplify expressions optimization.
slice_pushdown: Slice pushdown optimization.
comm_subplan_elim: Will try to cache branching subplans that occur on self-joins or unions.
comm_subexpr_elim: Common subexpressions will be cached and reused.
streaming: Run parts of the query in a streaming fashion (this is in an alpha state)

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": ["a", "b", "a", "b", "b", "c"],
...         "b": [1, 2, 3, 4, 5, 6],
...         "c": [6, 5, 4, 3, 2, 1],
...     }
... )
>>> lf.groupby("a", maintain_order=True).agg(pl.all().sum()).sort(
...     "a"
... ).show_graph()  

sink_ipc( path: str | Path, *, compression: str | None = 'zstd', maintain_order: bool = True, type_coercion: bool = True, predicate_pushdown: bool = True, projection_pushdown: bool = True, simplify_expression: bool = True, no_optimization: bool = False, slice_pushdown: bool = True, ) → DataFrame[source]

Persists a LazyFrame at the provided path.

This allows streaming results that are larger than RAM to be written to disk.

Parameters:

path: File path to which the file should be written.
compression{‘lz4’, ‘zstd’}: Choose “zstd” for good compression performance. Choose “lz4” for fast compression/decompression.
maintain_order: Maintain the order in which data is processed. Setting this to False will be slightly faster.
type_coercion: Do type coercion optimization.
predicate_pushdown: Do predicate pushdown optimization.
projection_pushdown: Do projection pushdown optimization.
simplify_expression: Run simplify expressions optimization.
no_optimization: Turn off (certain) optimizations.
slice_pushdown: Slice pushdown optimization.

Returns:

DataFrame

Examples

>>> lf = pl.scan_csv("/path/to/my_larger_than_ram_file.csv")  
>>> lf.sink_ipc("out.arrow")  

sink_parquet( path: str | Path, *, compression: str = 'zstd', compression_level: int | None = None, statistics: bool = False, row_group_size: int | None = None, data_pagesize_limit: int | None = None, maintain_order: bool = True, type_coercion: bool = True, predicate_pushdown: bool = True, projection_pushdown: bool = True, simplify_expression: bool = True, no_optimization: bool = False, slice_pushdown: bool = True, ) → DataFrame[source]

Persists a LazyFrame at the provided path.

This allows streaming results that are larger than RAM to be written to disk.

Parameters:

path

File path to which the file should be written.

compression{‘lz4’, ‘uncompressed’, ‘snappy’, ‘gzip’, ‘lzo’, ‘brotli’, ‘zstd’}

Choose “zstd” for good compression performance. Choose “lz4” for fast compression/decompression. Choose “snappy” for more backwards compatibility guarantees when you deal with older parquet readers.

compression_level

The level of compression to use. Higher compression means smaller files on disk.

“gzip” : min-level: 0, max-level: 10.
“brotli” : min-level: 0, max-level: 11.
“zstd” : min-level: 1, max-level: 22.

statistics

Write statistics to the parquet headers. This requires extra compute.

row_group_size

Size of the row groups in number of rows. If None (default), the chunks of the DataFrame are used. Writing in smaller chunks may reduce memory pressure and improve writing speeds. If None and use_pyarrow=True, the row group size will be the minimum of the DataFrame size and 64 * 1024 * 1024.

data_pagesize_limit

Size limit of individual data pages. If not set defaults to 1024 * 1024 bytes

maintain_order

Maintain the order in which data is processed. Setting this to False will be slightly faster.

type_coercion

Do type coercion optimization.

predicate_pushdown

Do predicate pushdown optimization.

projection_pushdown

Do projection pushdown optimization.

simplify_expression

Run simplify expressions optimization.

no_optimization

Turn off (certain) optimizations.

slice_pushdown

Slice pushdown optimization.

Returns:

DataFrame

Examples

>>> lf = pl.scan_csv("/path/to/my_larger_than_ram_file.csv")  
>>> lf.sink_parquet("out.parquet")  

slice(offset: int, length: int | None = None) → Self[source]

Get a slice of this DataFrame.

Parameters:

offset: Start index. Negative indexing is supported.
length: Length of the slice. If set to None, all rows starting at the offset will be selected.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": ["x", "y", "z"],
...         "b": [1, 3, 5],
...         "c": [2, 4, 6],
...     }
... )
>>> lf.slice(1, 2).collect()
shape: (2, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ y   ┆ 3   ┆ 4   │
│ z   ┆ 5   ┆ 6   │
└─────┴─────┴─────┘

sort( by: IntoExpr | Iterable[IntoExpr], *more_by: IntoExpr, descending: bool | Sequence[bool] = False, nulls_last: bool = False, maintain_order: bool = False, ) → Self[source]

Sort the dataframe by the given columns.

Parameters:

by: Column(s) to sort by. Accepts expression input. Strings are parsed as column names.
*more_by: Additional columns to sort by, specified as positional arguments.
descending: Sort in descending order. When sorting by multiple columns, can be specified per column by passing a sequence of booleans.
nulls_last: Place null values last.
maintain_order: Whether the order should be maintained if elements are equal. Note that if true streaming is not possible and performance might be worse since this requires a stable search.

Examples

Pass a single column name to sort by that column.

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 2, None],
...         "b": [6.0, 5.0, 4.0],
...         "c": ["a", "c", "b"],
...     }
... )
>>> lf.sort("a").collect()
shape: (3, 3)
┌──────┬─────┬─────┐
│ a    ┆ b   ┆ c   │
│ ---  ┆ --- ┆ --- │
│ i64  ┆ f64 ┆ str │
╞══════╪═════╪═════╡
│ null ┆ 4.0 ┆ b   │
│ 1    ┆ 6.0 ┆ a   │
│ 2    ┆ 5.0 ┆ c   │
└──────┴─────┴─────┘

Sorting by expressions is also supported.

>>> lf.sort(pl.col("a") + pl.col("b") * 2, nulls_last=True).collect()
shape: (3, 3)
┌──────┬─────┬─────┐
│ a    ┆ b   ┆ c   │
│ ---  ┆ --- ┆ --- │
│ i64  ┆ f64 ┆ str │
╞══════╪═════╪═════╡
│ 2    ┆ 5.0 ┆ c   │
│ 1    ┆ 6.0 ┆ a   │
│ null ┆ 4.0 ┆ b   │
└──────┴─────┴─────┘

Sort by multiple columns by passing a list of columns.

>>> lf.sort(["c", "a"], descending=True).collect()
shape: (3, 3)
┌──────┬─────┬─────┐
│ a    ┆ b   ┆ c   │
│ ---  ┆ --- ┆ --- │
│ i64  ┆ f64 ┆ str │
╞══════╪═════╪═════╡
│ 2    ┆ 5.0 ┆ c   │
│ null ┆ 4.0 ┆ b   │
│ 1    ┆ 6.0 ┆ a   │
└──────┴─────┴─────┘

Or use positional arguments to sort by multiple columns in the same way.

>>> lf.sort("c", "a", descending=[False, True]).collect()
shape: (3, 3)
┌──────┬─────┬─────┐
│ a    ┆ b   ┆ c   │
│ ---  ┆ --- ┆ --- │
│ i64  ┆ f64 ┆ str │
╞══════╪═════╪═════╡
│ 1    ┆ 6.0 ┆ a   │
│ null ┆ 4.0 ┆ b   │
│ 2    ┆ 5.0 ┆ c   │
└──────┴─────┴─────┘

std(ddof: int = 1) → Self[source]

Aggregate the columns in the LazyFrame to their standard deviation value.

Parameters:

ddof: “Delta Degrees of Freedom”: the divisor used in the calculation is N - ddof, where N represents the number of elements. By default ddof is 1.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 2, 3, 4],
...         "b": [1, 2, 1, 1],
...     }
... )
>>> lf.std().collect()
shape: (1, 2)
┌──────────┬─────┐
│ a        ┆ b   │
│ ---      ┆ --- │
│ f64      ┆ f64 │
╞══════════╪═════╡
│ 1.290994 ┆ 0.5 │
└──────────┴─────┘
>>> lf.std(ddof=0).collect()
shape: (1, 2)
┌──────────┬──────────┐
│ a        ┆ b        │
│ ---      ┆ ---      │
│ f64      ┆ f64      │
╞══════════╪══════════╡
│ 1.118034 ┆ 0.433013 │
└──────────┴──────────┘

sum() → Self[source]

Aggregate the columns in the LazyFrame to their sum value.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 2, 3, 4],
...         "b": [1, 2, 1, 1],
...     }
... )
>>> lf.sum().collect()
shape: (1, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 10  ┆ 5   │
└─────┴─────┘

tail(n: int = 5) → Self[source]

Get the last n rows.

Parameters:

n: Number of rows to return.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 2, 3, 4, 5, 6],
...         "b": [7, 8, 9, 10, 11, 12],
...     }
... )
>>> lf.tail().collect()
shape: (5, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 2   ┆ 8   │
│ 3   ┆ 9   │
│ 4   ┆ 10  │
│ 5   ┆ 11  │
│ 6   ┆ 12  │
└─────┴─────┘
>>> lf.tail(2).collect()
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 5   ┆ 11  │
│ 6   ┆ 12  │
└─────┴─────┘

take_every(n: int) → Self[source]

Take every nth row in the LazyFrame and return as a new LazyFrame.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 2, 3, 4],
...         "b": [5, 6, 7, 8],
...     }
... )
>>> lf.take_every(2).collect()
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 5   │
│ 3   ┆ 7   │
└─────┴─────┘

top_k( k: int, *, by: IntoExpr | Iterable[IntoExpr], descending: bool | Sequence[bool] = False, nulls_last: bool = False, maintain_order: bool = False, ) → Self[source]

Return the k largest elements.

If ‘descending=True` the smallest elements will be given.

Parameters:

k: Number of rows to return.
by: Column(s) included in sort order. Accepts expression input. Strings are parsed as column names.
descending: Return the ‘k’ smallest. Top-k by multiple columns can be specified per column by passing a sequence of booleans.
nulls_last: Place null values last.
maintain_order: Whether the order should be maintained if elements are equal. Note that if true streaming is not possible and performance might be worse since this requires a stable search.

See also

bottom_k

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": ["a", "b", "a", "b", "b", "c"],
...         "b": [2, 1, 1, 3, 2, 1],
...     }
... )

Get the rows which contain the 4 largest values in column b.

>>> lf.top_k(4, by="b").collect()
shape: (4, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ b   ┆ 3   │
│ a   ┆ 2   │
│ b   ┆ 2   │
│ b   ┆ 1   │
└─────┴─────┘

Get the rows which contain the 4 largest values when sorting on column b and a.

>>> lf.top_k(4, by=["b", "a"]).collect()
shape: (4, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ b   ┆ 3   │
│ b   ┆ 2   │
│ a   ┆ 2   │
│ c   ┆ 1   │
└─────┴─────┘

unique( subset: ColumnNameOrSelector | Collection[ColumnNameOrSelector] | None = None, *, keep: UniqueKeepStrategy = 'any', maintain_order: bool = False, ) → Self[source]

Drop duplicate rows from this dataframe.

Parameters:

subset

Column name(s) or selector(s), to consider when identifying duplicate rows. If set to None (default), use all columns.

keep{‘first’, ‘last’, ‘any’, ‘none’}

Which of the duplicate rows to keep.

‘any’: Does not give any guarantee of which row is kept.
This allows more optimizations.
‘none’: Don’t keep duplicate rows.
‘first’: Keep first unique row.
‘last’: Keep last unique row.

maintain_order

Keep the same order as the original DataFrame. This is more expensive to compute. Settings this to True blocks the possibility to run on the streaming engine.

Returns:

LazyFrame: LazyFrame with unique rows.

Warning

This method will fail if there is a column of type List in the DataFrame or subset.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "foo": [1, 2, 3, 1],
...         "bar": ["a", "a", "a", "a"],
...         "ham": ["b", "b", "b", "b"],
...     }
... )
>>> lf.unique(maintain_order=True).collect()
shape: (3, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞═════╪═════╪═════╡
│ 1   ┆ a   ┆ b   │
│ 2   ┆ a   ┆ b   │
│ 3   ┆ a   ┆ b   │
└─────┴─────┴─────┘
>>> lf.unique(subset=["bar", "ham"], maintain_order=True).collect()
shape: (1, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞═════╪═════╪═════╡
│ 1   ┆ a   ┆ b   │
└─────┴─────┴─────┘
>>> lf.unique(keep="last", maintain_order=True).collect()
shape: (3, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞═════╪═════╪═════╡
│ 2   ┆ a   ┆ b   │
│ 3   ┆ a   ┆ b   │
│ 1   ┆ a   ┆ b   │
└─────┴─────┴─────┘

unnest(columns: str | Sequence[str], *more_columns: str) → Self[source]

Decompose struct columns into separate columns for each of their fields.

The new columns will be inserted into the dataframe at the location of the struct column.

Parameters:

columns: Name of the struct column(s) that should be unnested.
*more_columns: Additional columns to unnest, specified as positional arguments.

Examples

>>> df = pl.LazyFrame(
...     {
...         "before": ["foo", "bar"],
...         "t_a": [1, 2],
...         "t_b": ["a", "b"],
...         "t_c": [True, None],
...         "t_d": [[1, 2], [3]],
...         "after": ["baz", "womp"],
...     }
... ).select("before", pl.struct(pl.col("^t_.$")).alias("t_struct"), "after")
>>> df.collect()
shape: (2, 3)
┌────────┬─────────────────────┬───────┐
│ before ┆ t_struct            ┆ after │
│ ---    ┆ ---                 ┆ ---   │
│ str    ┆ struct[4]           ┆ str   │
╞════════╪═════════════════════╪═══════╡
│ foo    ┆ {1,"a",true,[1, 2]} ┆ baz   │
│ bar    ┆ {2,"b",null,[3]}    ┆ womp  │
└────────┴─────────────────────┴───────┘
>>> df.unnest("t_struct").collect()
shape: (2, 6)
┌────────┬─────┬─────┬──────┬───────────┬───────┐
│ before ┆ t_a ┆ t_b ┆ t_c  ┆ t_d       ┆ after │
│ ---    ┆ --- ┆ --- ┆ ---  ┆ ---       ┆ ---   │
│ str    ┆ i64 ┆ str ┆ bool ┆ list[i64] ┆ str   │
╞════════╪═════╪═════╪══════╪═══════════╪═══════╡
│ foo    ┆ 1   ┆ a   ┆ true ┆ [1, 2]    ┆ baz   │
│ bar    ┆ 2   ┆ b   ┆ null ┆ [3]       ┆ womp  │
└────────┴─────┴─────┴──────┴───────────┴───────┘

update( other: LazyFrame, on: str | Sequence[str] | None = None, how: Literal['left', 'inner'] = 'left', ) → Self[source]

Update the values in this LazyFrame with the non-null values in other.

Parameters:

other: LazyFrame that will be used to update the values
on: Column names that will be joined on. If none given the row count is used.
how{‘left’, ‘inner’}: ‘left’ will keep the left table rows as is. ‘inner’ will remove rows that are not found in other

Warning

This functionality is experimental and may change without it being considered a breaking change.

Notes

This is syntactic sugar for a left/inner join + coalesce

Examples

>>> df = pl.DataFrame(
...     {
...         "A": [1, 2, 3, 4],
...         "B": [400, 500, 600, 700],
...     }
... )
>>> df
shape: (4, 2)
┌─────┬─────┐
│ A   ┆ B   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 400 │
│ 2   ┆ 500 │
│ 3   ┆ 600 │
│ 4   ┆ 700 │
└─────┴─────┘
>>> new_df = pl.DataFrame(
...     {
...         "B": [4, None, 6],
...         "C": [7, 8, 9],
...     }
... )
>>> new_df
shape: (3, 2)
┌──────┬─────┐
│ B    ┆ C   │
│ ---  ┆ --- │
│ i64  ┆ i64 │
╞══════╪═════╡
│ 4    ┆ 7   │
│ null ┆ 8   │
│ 6    ┆ 9   │
└──────┴─────┘
>>> df.update(new_df)
shape: (4, 2)
┌─────┬─────┐
│ A   ┆ B   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 4   │
│ 2   ┆ 500 │
│ 3   ┆ 6   │
│ 4   ┆ 700 │
└─────┴─────┘

var(ddof: int = 1) → Self[source]

Aggregate the columns in the LazyFrame to their variance value.

Parameters:

ddof: “Delta Degrees of Freedom”: the divisor used in the calculation is N - ddof, where N represents the number of elements. By default ddof is 1.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 2, 3, 4],
...         "b": [1, 2, 1, 1],
...     }
... )
>>> lf.var().collect()
shape: (1, 2)
┌──────────┬──────┐
│ a        ┆ b    │
│ ---      ┆ ---  │
│ f64      ┆ f64  │
╞══════════╪══════╡
│ 1.666667 ┆ 0.25 │
└──────────┴──────┘
>>> lf.var(ddof=0).collect()
shape: (1, 2)
┌──────┬────────┐
│ a    ┆ b      │
│ ---  ┆ ---    │
│ f64  ┆ f64    │
╞══════╪════════╡
│ 1.25 ┆ 0.1875 │
└──────┴────────┘

property width: int[source]

Get the width of the LazyFrame.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [4, 5, 6],
...     }
... )
>>> lf.width
2

with_columns(

*exprs: IntoExpr | Iterable[IntoExpr],

**named_exprs: IntoExpr,

) → Self[source]

Add columns to this DataFrame.

Added columns will replace existing columns with the same name.

Parameters:

*exprs: Column(s) to add, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.
**named_exprs: Additional columns to add, specified as keyword arguments. The columns will be renamed to the keyword used.

Returns:

LazyFrame: A new LazyFrame with the columns added.

Notes

Creating a new LazyFrame using this method does not create a new copy of existing data.

Examples

Pass an expression to add it as a new column.

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 2, 3, 4],
...         "b": [0.5, 4, 10, 13],
...         "c": [True, True, False, True],
...     }
... )
>>> lf.with_columns((pl.col("a") ** 2).alias("a^2")).collect()
shape: (4, 4)
┌─────┬──────┬───────┬──────┐
│ a   ┆ b    ┆ c     ┆ a^2  │
│ --- ┆ ---  ┆ ---   ┆ ---  │
│ i64 ┆ f64  ┆ bool  ┆ f64  │
╞═════╪══════╪═══════╪══════╡
│ 1   ┆ 0.5  ┆ true  ┆ 1.0  │
│ 2   ┆ 4.0  ┆ true  ┆ 4.0  │
│ 3   ┆ 10.0 ┆ false ┆ 9.0  │
│ 4   ┆ 13.0 ┆ true  ┆ 16.0 │
└─────┴──────┴───────┴──────┘

Added columns will replace existing columns with the same name.

>>> lf.with_columns(pl.col("a").cast(pl.Float64)).collect()
shape: (4, 3)
┌─────┬──────┬───────┐
│ a   ┆ b    ┆ c     │
│ --- ┆ ---  ┆ ---   │
│ f64 ┆ f64  ┆ bool  │
╞═════╪══════╪═══════╡
│ 1.0 ┆ 0.5  ┆ true  │
│ 2.0 ┆ 4.0  ┆ true  │
│ 3.0 ┆ 10.0 ┆ false │
│ 4.0 ┆ 13.0 ┆ true  │
└─────┴──────┴───────┘

Multiple columns can be added by passing a list of expressions.

>>> lf.with_columns(
...     [
...         (pl.col("a") ** 2).alias("a^2"),
...         (pl.col("b") / 2).alias("b/2"),
...         (pl.col("c").is_not()).alias("not c"),
...     ]
... ).collect()
shape: (4, 6)
┌─────┬──────┬───────┬──────┬──────┬───────┐
│ a   ┆ b    ┆ c     ┆ a^2  ┆ b/2  ┆ not c │
│ --- ┆ ---  ┆ ---   ┆ ---  ┆ ---  ┆ ---   │
│ i64 ┆ f64  ┆ bool  ┆ f64  ┆ f64  ┆ bool  │
╞═════╪══════╪═══════╪══════╪══════╪═══════╡
│ 1   ┆ 0.5  ┆ true  ┆ 1.0  ┆ 0.25 ┆ false │
│ 2   ┆ 4.0  ┆ true  ┆ 4.0  ┆ 2.0  ┆ false │
│ 3   ┆ 10.0 ┆ false ┆ 9.0  ┆ 5.0  ┆ true  │
│ 4   ┆ 13.0 ┆ true  ┆ 16.0 ┆ 6.5  ┆ false │
└─────┴──────┴───────┴──────┴──────┴───────┘

Multiple columns also can be added using positional arguments instead of a list.

>>> lf.with_columns(
...     (pl.col("a") ** 2).alias("a^2"),
...     (pl.col("b") / 2).alias("b/2"),
...     (pl.col("c").is_not()).alias("not c"),
... ).collect()
shape: (4, 6)
┌─────┬──────┬───────┬──────┬──────┬───────┐
│ a   ┆ b    ┆ c     ┆ a^2  ┆ b/2  ┆ not c │
│ --- ┆ ---  ┆ ---   ┆ ---  ┆ ---  ┆ ---   │
│ i64 ┆ f64  ┆ bool  ┆ f64  ┆ f64  ┆ bool  │
╞═════╪══════╪═══════╪══════╪══════╪═══════╡
│ 1   ┆ 0.5  ┆ true  ┆ 1.0  ┆ 0.25 ┆ false │
│ 2   ┆ 4.0  ┆ true  ┆ 4.0  ┆ 2.0  ┆ false │
│ 3   ┆ 10.0 ┆ false ┆ 9.0  ┆ 5.0  ┆ true  │
│ 4   ┆ 13.0 ┆ true  ┆ 16.0 ┆ 6.5  ┆ false │
└─────┴──────┴───────┴──────┴──────┴───────┘

Use keyword arguments to easily name your expression inputs.

>>> lf.with_columns(
...     ab=pl.col("a") * pl.col("b"),
...     not_c=pl.col("c").is_not(),
... ).collect()
shape: (4, 5)
┌─────┬──────┬───────┬──────┬───────┐
│ a   ┆ b    ┆ c     ┆ ab   ┆ not_c │
│ --- ┆ ---  ┆ ---   ┆ ---  ┆ ---   │
│ i64 ┆ f64  ┆ bool  ┆ f64  ┆ bool  │
╞═════╪══════╪═══════╪══════╪═══════╡
│ 1   ┆ 0.5  ┆ true  ┆ 0.5  ┆ false │
│ 2   ┆ 4.0  ┆ true  ┆ 8.0  ┆ false │
│ 3   ┆ 10.0 ┆ false ┆ 30.0 ┆ true  │
│ 4   ┆ 13.0 ┆ true  ┆ 52.0 ┆ false │
└─────┴──────┴───────┴──────┴───────┘

Expressions with multiple outputs can be automatically instantiated as Structs by enabling the experimental setting Config.set_auto_structify(True):

>>> with pl.Config(auto_structify=True):
...     lf.drop("c").with_columns(
...         diffs=pl.col(["a", "b"]).diff().suffix("_diff"),
...     ).collect()
...
shape: (4, 3)
┌─────┬──────┬─────────────┐
│ a   ┆ b    ┆ diffs       │
│ --- ┆ ---  ┆ ---         │
│ i64 ┆ f64  ┆ struct[2]   │
╞═════╪══════╪═════════════╡
│ 1   ┆ 0.5  ┆ {null,null} │
│ 2   ┆ 4.0  ┆ {1,3.5}     │
│ 3   ┆ 10.0 ┆ {1,6.0}     │
│ 4   ┆ 13.0 ┆ {1,3.0}     │
└─────┴──────┴─────────────┘

with_columns_seq(

*exprs: IntoExpr | Iterable[IntoExpr],

**named_exprs: IntoExpr,

) → Self[source]

Add columns to this DataFrame.

Added columns will replace existing columns with the same name.

This will run all expression sequentially instead of in parallel. Use this when the work per expression is cheap.

Parameters:

*exprs: Column(s) to add, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.
**named_exprs: Additional columns to add, specified as keyword arguments. The columns will be renamed to the keyword used.

Returns:

LazyFrame: A new LazyFrame with the columns added.

See also

with_columns

with_context(other: Self | list[Self]) → Self[source]

Add an external context to the computation graph.

This allows expressions to also access columns from DataFrames that are not part of this one.

Parameters:

other: Lazy DataFrame to join with.

Examples

>>> lf = pl.LazyFrame({"a": [1, 2, 3], "b": ["a", "c", None]})
>>> lf_other = pl.LazyFrame({"c": ["foo", "ham"]})
>>> lf.with_context(lf_other).select(
...     pl.col("b") + pl.col("c").first()
... ).collect()
shape: (3, 1)
┌──────┐
│ b    │
│ ---  │
│ str  │
╞══════╡
│ afoo │
│ cfoo │
│ null │
└──────┘

Fill nulls with the median from another dataframe:

>>> train_lf = pl.LazyFrame(
...     {"feature_0": [-1.0, 0, 1], "feature_1": [-1.0, 0, 1]}
... )
>>> test_lf = pl.LazyFrame(
...     {"feature_0": [-1.0, None, 1], "feature_1": [-1.0, 0, 1]}
... )
>>> test_lf.with_context(train_lf.select(pl.all().suffix("_train"))).select(
...     pl.col("feature_0").fill_null(pl.col("feature_0_train").median())
... ).collect()
shape: (3, 1)
┌───────────┐
│ feature_0 │
│ ---       │
│ f64       │
╞═══════════╡
│ -1.0      │
│ 0.0       │
│ 1.0       │
└───────────┘

with_row_count(name: str = 'row_nr', offset: int = 0) → Self[source]

Add a column at index 0 that counts the rows.

Parameters:

name: Name of the column to add.
offset: Start the row count at this offset.

Warning

This can have a negative effect on query performance. This may, for instance, block predicate pushdown optimization.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 3, 5],
...         "b": [2, 4, 6],
...     }
... )
>>> lf.with_row_count().collect()
shape: (3, 3)
┌────────┬─────┬─────┐
│ row_nr ┆ a   ┆ b   │
│ ---    ┆ --- ┆ --- │
│ u32    ┆ i64 ┆ i64 │
╞════════╪═════╪═════╡
│ 0      ┆ 1   ┆ 2   │
│ 1      ┆ 3   ┆ 4   │
│ 2      ┆ 5   ┆ 6   │
└────────┴─────┴─────┘