LazyFrame#

This page gives an overview of all public LazyFrame methods.

class polars.LazyFrame( data: FrameInitTypes | None = None, schema: SchemaDefinition | None = None, *, schema_overrides: SchemaDict | None = None, strict: bool = True, orient: Orientation | None = None, infer_schema_length: int | None = 100, nan_to_null: bool = False, height: int | None = None, )[source]

Representation of a Lazy computation graph/query against a DataFrame.

This allows for whole-query optimisation in addition to parallelism, and is the preferred (and highest-performance) mode of operation for polars.

Parameters:

datadict, Sequence, ndarray, Series, or pandas.DataFrame

Two-dimensional data in various forms; dict input must contain Sequences, Generators, or a range. Sequence may contain Series or other Sequences.

schemaSequence of str, (str,DataType) pairs, or a {str:DataType,} dict

The LazyFrame schema may be declared in several ways:

As a dict of {name:type} pairs; if type is None, it will be auto-inferred.
As a list of column names; in this case types are automatically inferred.
As a list of (name,type) pairs; this is equivalent to the dictionary form.

The order of the schema determines the column order of the frame. When passing a dict, its insertion order is respected. To override specific column data types by name without changing column order, use schema_overrides instead.

If you supply a list of column names that does not match the names in the underlying data, the names given here will overwrite them. The number of names given in the schema should match the underlying data dimensions.

schema_overridesdict, default None

Support type specification or override of one or more columns; note that any dtypes inferred from the schema param will be overridden.

The number of entries in the schema should match the underlying data dimensions, unless a sequence of dictionaries is being passed, in which case a partial schema can be declared to prevent specific fields from being loaded.

strictbool, default True

Throw an error if any data value does not exactly match the given or inferred data type for that column. If set to False, values that do not match the data type are cast to that data type or, if casting is not possible, set to null instead.

orient{‘col’, ‘row’}, default None

Whether to interpret two-dimensional data as columns or as rows. If None, the orientation is inferred by matching the columns and data dimensions. If this does not yield conclusive results, column orientation is used.

infer_schema_lengthint or None

The maximum number of rows to scan for schema inference. If set to None, the full data may be scanned (this can be slow). This parameter only applies if the input data is a sequence or generator of rows; other input is read as-is.

nan_to_nullbool, default False

If the data comes from one or more numpy arrays, can optionally convert input data np.nan values to null instead. This is a no-op for all other input data.

heightint or None, default None

Allows constructing DataFrames with 0 width and a specified height. If passed with data, ensures the resulting DataFrame has this height.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

Notes

Initialising LazyFrame(...) directly is equivalent to DataFrame(...).lazy().

Examples

Constructing a LazyFrame directly from a dictionary:

>>> data = {"a": [1, 2], "b": [3, 4]}
>>> lf = pl.LazyFrame(data)
>>> lf.collect()
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 3   │
│ 2   ┆ 4   │
└─────┴─────┘

Notice that the dtypes are automatically inferred as Polars Int64:

>>> lf.collect_schema().dtypes()
[Int64, Int64]

To specify a more detailed/specific frame schema you can supply the schema parameter with a dictionary of (name,dtype) pairs…

>>> data = {"col1": [0, 2], "col2": [3, 7]}
>>> lf2 = pl.LazyFrame(data, schema={"col1": pl.Float32, "col2": pl.Int64})
>>> lf2.collect()
shape: (2, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ f32  ┆ i64  │
╞══════╪══════╡
│ 0.0  ┆ 3    │
│ 2.0  ┆ 7    │
└──────┴──────┘

…a sequence of (name,dtype) pairs…

>>> data = {"col1": [1, 2], "col2": [3, 4]}
>>> lf3 = pl.LazyFrame(data, schema=[("col1", pl.Float32), ("col2", pl.Int64)])
>>> lf3.collect()
shape: (2, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ f32  ┆ i64  │
╞══════╪══════╡
│ 1.0  ┆ 3    │
│ 2.0  ┆ 4    │
└──────┴──────┘

…or a list of typed Series.

>>> data = [
...     pl.Series("col1", [1, 2], dtype=pl.Float32),
...     pl.Series("col2", [3, 4], dtype=pl.Int64),
... ]
>>> lf4 = pl.LazyFrame(data)
>>> lf4.collect()
shape: (2, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ f32  ┆ i64  │
╞══════╪══════╡
│ 1.0  ┆ 3    │
│ 2.0  ┆ 4    │
└──────┴──────┘

Constructing a LazyFrame from a numpy ndarray, specifying column names:

>>> import numpy as np
>>> data = np.array([(1, 2), (3, 4)], dtype=np.int64)
>>> lf5 = pl.LazyFrame(data, schema=["a", "b"], orient="col")
>>> lf5.collect()
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 3   │
│ 2   ┆ 4   │
└─────┴─────┘

Constructing a LazyFrame from a list of lists, row orientation specified:

>>> data = [[1, 2, 3], [4, 5, 6]]
>>> lf6 = pl.LazyFrame(data, schema=["a", "b", "c"], orient="row")
>>> lf6.collect()
shape: (2, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1   ┆ 2   ┆ 3   │
│ 4   ┆ 5   ┆ 6   │
└─────┴─────┴─────┘

Methods:

`approx_n_unique`	Approximate count of unique values.
`bottom_k`	Return the `k` smallest rows.
`cache`	Cache the result once the execution of the physical plan hits this node.
`cast`	Cast LazyFrame column(s) to the specified dtype(s).
`clear`	Create an empty copy of the current LazyFrame, with zero to 'n' rows.
`clone`	Create a copy of this LazyFrame.
`collect`	Materialize this `LazyFrame` into a `DataFrame`.
`collect_async`	Collect DataFrame asynchronously in thread pool.
`collect_batches`	Evaluate the query in streaming mode and get a generator that returns chunks.
`collect_schema`	Resolve the schema of this LazyFrame.
`count`	Return the number of non-null elements for each column.
`describe`	Creates a summary of statistics for a LazyFrame, returning a DataFrame.
`deserialize`	Read a logical plan from a file to construct a LazyFrame.
`drop`	Remove columns from the DataFrame.
`drop_nans`	Drop all rows that contain one or more NaN values.
`drop_nulls`	Drop all rows that contain one or more null values.
`execute`	Execute the query into a `QueryResult`.
`explain`	Create a string representation of the query plan.
`explode`	Explode the DataFrame to long format by exploding the given columns.
`fetch`	Collect a small number of rows for debugging purposes.
`fill_nan`	Fill floating point NaN values.
`fill_null`	Fill null values using the specified value or strategy.
`filter`	Filter rows in the LazyFrame based on a predicate expression.
`first`	Get the first row of the DataFrame.
`gather`	Selects rows from this LazyFrame at the given indices.
`gather_every`	Take every nth row in the LazyFrame and return as a new LazyFrame.
`group_by`	Start a group by operation.
`group_by_dynamic`	Group based on a time value (or index value of type Int32, Int64).
`head`	Get the first `n` rows.
`inspect`	Inspect a node in the computation graph.
`interpolate`	Interpolate intermediate values.
`join`	Add a join operation to the Logical Plan.
`join_asof`	Perform an asof join.
`join_where`	Perform a join based on one or multiple (in)equality predicates.
`last`	Get the last row of the DataFrame.
`lazy`	Return lazy representation, i.e. itself.
`limit`	Get the first `n` rows.
`map_batches`	Apply a custom function.
`match_to_schema`	Match or evolve the schema of a LazyFrame into a specific schema.
`max`	Aggregate the columns in the LazyFrame to their maximum value.
`mean`	Aggregate the columns in the LazyFrame to their mean value.
`median`	Aggregate the columns in the LazyFrame to their median value.
`melt`	Unpivot a DataFrame from wide to long format.
`merge_sorted`	Take two sorted DataFrames and merge them by the sorted key.
`min`	Aggregate the columns in the LazyFrame to their minimum value.
`null_count`	Aggregate the columns in the LazyFrame as the sum of their null value count.
`pipe`	Offers a structured way to apply a sequence of user-defined functions (UDFs).
`pipe_with_schema`	Allows to alter the lazy frame during the plan stage with the resolved schema.
`pivot`	Create a spreadsheet-style pivot table as a DataFrame.
`profile`	Profile a LazyFrame.
`quantile`	Aggregate the columns in the LazyFrame to their quantile value.
`remote`	Run a query remotely on Polars Cloud.
`remove`	Remove rows, dropping those that match the given predicate expression(s).
`rename`	Rename column names.
`reverse`	Reverse the DataFrame.
`rolling`	Create rolling groups based on a temporal or integer column.
`select`	Select columns from this LazyFrame.
`select_seq`	Select columns from this LazyFrame.
`serialize`	Serialize the logical plan of this LazyFrame to a file or string in JSON format.
`set_sorted`	Flag a column as sorted.
`shift`	Shift values by the given number of indices.
`show`	Show the first `n` rows.
`show_graph`	Show a plot of the query plan.
`sink_batches`	Evaluate the query and call a user-defined function for every ready batch.
`sink_csv`	Evaluate the query in streaming mode and write to a CSV file.
`sink_delta`	Sink DataFrame as delta table.
`sink_iceberg`	Sink a LazyFrame to an Iceberg table.
`sink_ipc`	Evaluate the query in streaming mode and write to an IPC file.
`sink_ndjson`	Evaluate the query in streaming mode and write to an NDJSON file.
`sink_parquet`	Evaluate the query in streaming mode and write to a Parquet file.
`slice`	Get a slice of this DataFrame.
`sort`	Sort the LazyFrame by the given columns.
`sql`	Execute a SQL query against the LazyFrame.
`std`	Aggregate the columns in the LazyFrame to their standard deviation value.
`sum`	Aggregate the columns in the LazyFrame to their sum value.
`tail`	Get the last `n` rows.
`top_k`	Return the `k` largest rows.
`unique`	Drop duplicate rows from this LazyFrame.
`unnest`	Decompose struct columns into separate columns for each of their fields.
`unpivot`	Unpivot a DataFrame from wide to long format.
`update`	Update the values in this `LazyFrame` with the values in `other`.
`var`	Aggregate the columns in the LazyFrame to their variance value.
`with_columns`	Add columns to this LazyFrame.
`with_columns_seq`	Add columns to this LazyFrame.
`with_context`	Add an external context to the computation graph.
`with_row_count`	Add a column at index 0 that counts the rows.
`with_row_index`	Add a row index as the first column in the LazyFrame.

Attributes:

`columns`	Get the column names.
`dtypes`	Get the column data types.
`schema`	Get an ordered mapping of column names to their data type.
`width`	Get the number of columns.

approx_n_unique() → LazyFrame[source]

Approximate count of unique values.

Deprecated since version 0.20.11: Use select(pl.all().approx_n_unique()) instead.

This is done using the HyperLogLog++ algorithm for cardinality estimation.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 2, 3, 4],
...         "b": [1, 2, 1, 1],
...     }
... )
>>> lf.approx_n_unique().collect()  
shape: (1, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ u32 ┆ u32 │
╞═════╪═════╡
│ 4   ┆ 2   │
└─────┴─────┘

bottom_k( k: int, *, by: IntoExpr | Iterable[IntoExpr], reverse: bool | Sequence[bool] = False, ) → LazyFrame[source]

Return the k smallest rows.

Non-null elements are always preferred over null elements, regardless of the value of reverse. The output is not guaranteed to be in any particular order, call sort() after this function if you wish the output to be sorted.

Changed in version 1.0.0: The descending parameter was renamed reverse.

Parameters:

k: Number of rows to return.
by: Column(s) used to determine the bottom rows. Accepts expression input. Strings are parsed as column names.
reverse: Consider the k largest elements of the by column(s) (instead of the k smallest). This can be specified per column by passing a sequence of booleans.

See also

top_k

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": ["a", "b", "a", "b", "b", "c"],
...         "b": [2, 1, 1, 3, 2, 1],
...     }
... )

Get the rows which contain the 4 smallest values in column b.

>>> lf.bottom_k(4, by="b").collect()
shape: (4, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ b   ┆ 1   │
│ a   ┆ 1   │
│ c   ┆ 1   │
│ a   ┆ 2   │
└─────┴─────┘

Get the rows which contain the 4 smallest values when sorting on column a and b.

>>> lf.bottom_k(4, by=["a", "b"]).collect()
shape: (4, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ a   ┆ 1   │
│ a   ┆ 2   │
│ b   ┆ 1   │
│ b   ┆ 2   │
└─────┴─────┘

cache() → LazyFrame[source]

Cache the result once the execution of the physical plan hits this node.

It is not recommended using this as the optimizer likely can do a better job.

Cast LazyFrame column(s) to the specified dtype(s).

Parameters:

dtypes: Mapping of column names (or selector) to dtypes, or a single dtype to which all columns will be cast.
strict: Throw an error if a cast could not be done (for instance, due to an overflow).

Examples

>>> from datetime import date
>>> lf = pl.LazyFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6.0, 7.0, 8.0],
...         "ham": [date(2020, 1, 2), date(2021, 3, 4), date(2022, 5, 6)],
...     }
... )

Cast specific frame columns to the specified dtypes:

>>> lf.cast({"foo": pl.Float32, "bar": pl.UInt8}).collect()
shape: (3, 3)
┌─────┬─────┬────────────┐
│ foo ┆ bar ┆ ham        │
│ --- ┆ --- ┆ ---        │
│ f32 ┆ u8  ┆ date       │
╞═════╪═════╪════════════╡
│ 1.0 ┆ 6   ┆ 2020-01-02 │
│ 2.0 ┆ 7   ┆ 2021-03-04 │
│ 3.0 ┆ 8   ┆ 2022-05-06 │
└─────┴─────┴────────────┘

Cast all frame columns matching one dtype (or dtype group) to another dtype:

>>> lf.cast({pl.Date: pl.Datetime}).collect()
shape: (3, 3)
┌─────┬─────┬─────────────────────┐
│ foo ┆ bar ┆ ham                 │
│ --- ┆ --- ┆ ---                 │
│ i64 ┆ f64 ┆ datetime[μs]        │
╞═════╪═════╪═════════════════════╡
│ 1   ┆ 6.0 ┆ 2020-01-02 00:00:00 │
│ 2   ┆ 7.0 ┆ 2021-03-04 00:00:00 │
│ 3   ┆ 8.0 ┆ 2022-05-06 00:00:00 │
└─────┴─────┴─────────────────────┘

Use selectors to define the columns being cast:

>>> import polars.selectors as cs
>>> lf.cast({cs.numeric(): pl.UInt32, cs.temporal(): pl.String}).collect()
shape: (3, 3)
┌─────┬─────┬────────────┐
│ foo ┆ bar ┆ ham        │
│ --- ┆ --- ┆ ---        │
│ u32 ┆ u32 ┆ str        │
╞═════╪═════╪════════════╡
│ 1   ┆ 6   ┆ 2020-01-02 │
│ 2   ┆ 7   ┆ 2021-03-04 │
│ 3   ┆ 8   ┆ 2022-05-06 │
└─────┴─────┴────────────┘

Cast all frame columns to the specified dtype:

>>> lf.cast(pl.String).collect().to_dict(as_series=False)
{'foo': ['1', '2', '3'],
 'bar': ['6.0', '7.0', '8.0'],
 'ham': ['2020-01-02', '2021-03-04', '2022-05-06']}

clear(n: int = 0) → LazyFrame[source]

Create an empty copy of the current LazyFrame, with zero to ‘n’ rows.

Returns a copy with an identical schema but no data.

Parameters:

n: Number of (empty) rows to return in the cleared frame.

See also

clone: Cheap deepcopy/clone.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [None, 2, 3, 4],
...         "b": [0.5, None, 2.5, 13],
...         "c": [True, True, False, None],
...     }
... )
>>> lf.clear().collect()
shape: (0, 3)
┌─────┬─────┬──────┐
│ a   ┆ b   ┆ c    │
│ --- ┆ --- ┆ ---  │
│ i64 ┆ f64 ┆ bool │
╞═════╪═════╪══════╡
└─────┴─────┴──────┘

>>> lf.clear(2).collect()
shape: (2, 3)
┌──────┬──────┬──────┐
│ a    ┆ b    ┆ c    │
│ ---  ┆ ---  ┆ ---  │
│ i64  ┆ f64  ┆ bool │
╞══════╪══════╪══════╡
│ null ┆ null ┆ null │
│ null ┆ null ┆ null │
└──────┴──────┴──────┘

clone() → LazyFrame[source]

Create a copy of this LazyFrame.

This is a cheap operation that does not copy data.

See also

clear: Create an empty copy of the current LazyFrame, with identical schema but no data.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [None, 2, 3, 4],
...         "b": [0.5, None, 2.5, 13],
...         "c": [True, True, False, None],
...     }
... )
>>> lf.clone()
<LazyFrame at ...>

collect(

*,

type_coercion: bool = True,

predicate_pushdown: bool = True,

projection_pushdown: bool = True,

simplify_expression: bool = True,

slice_pushdown: bool = True,

comm_subplan_elim: bool = True,

comm_subexpr_elim: bool = True,

cluster_with_columns: bool = True,

collapse_joins: bool = True,

no_optimization: bool = False,

engine: EngineType = 'auto',

background: bool = False,

optimizations: QueryOptFlags = (),

**_kwargs: Any,

) → DataFrame | InProcessQuery[source]

Materialize this LazyFrame into a DataFrame.

By default, all query optimizations are enabled. Individual optimizations may be disabled by setting the corresponding parameter to False.

Parameters:

type_coercion

Do type coercion optimization.

Deprecated since version 1.30.0: Use the optimizations parameters.

predicate_pushdown

Do predicate pushdown optimization.

Deprecated since version 1.30.0: Use the optimizations parameters.

projection_pushdown

Do projection pushdown optimization.

Deprecated since version 1.30.0: Use the optimizations parameters.

simplify_expression

Run simplify expressions optimization.

Deprecated since version 1.30.0: Use the optimizations parameters.

slice_pushdown

Slice pushdown optimization.

Deprecated since version 1.30.0: Use the optimizations parameters.

comm_subplan_elim

Will try to cache branching subplans that occur on self-joins or unions.

Deprecated since version 1.30.0: Use the optimizations parameters.

comm_subexpr_elim

Common subexpressions will be cached and reused.

Deprecated since version 1.30.0: Use the optimizations parameters.

cluster_with_columns

Combine sequential independent calls to with_columns

Deprecated since version 1.30.0: Use the optimizations parameters.

collapse_joins

Collapse a join and filters into a faster join

Deprecated since version 1.30.0: Use the optimizations parameters.

no_optimization

Turn off (certain) optimizations.

Deprecated since version 1.30.0: Use the optimizations parameters.

engine

Select the engine used to process the query (default "auto"):

"auto": use the engine set by Config.set_engine_affinity or the POLARS_ENGINE_AFFINITY environment variable, falling back to "in-memory" if unset (this default may change in a future release).
"in-memory": use the in-memory engine, this is the default engine.
"streaming": use the streaming engine, which processes queries in batches, reducing memory pressure and often outperforming the in-memory engine. This will soon become the default engine of Polars.
"gpu": use the CUDA GPU engine (requires an Nvidia GPU and cudf-polars). Pass a GPUEngine object for fine-grained control (e.g. device selection on multi-GPU systems).

If the selected engine cannot run the query, Polars falls back to the in-memory engine.

Note

GPU mode is considered unstable. Not all queries will run successfully on the GPU, however, they should fall back transparently to the default engine if execution is not supported.

Running with POLARS_VERBOSE=1 will provide information if a query falls back (and why).

background

Run the query in the background and get a handle to the query. This handle can be used to fetch the result or cancel the query.

Warning

Background mode is considered unstable. It may be changed at any point without it being considered a breaking change.

optimizations

The optimization passes done during query optimization.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

Returns:

DataFrame

See also

explain: Print the query plan that is evaluated with collect.
profile: Collect the LazyFrame and time each node in the computation graph.
polars.collect_all: Collect multiple LazyFrames at the same time.
polars.Config.set_streaming_chunk_size: Set the size of streaming batches.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": ["a", "b", "a", "b", "b", "c"],
...         "b": [1, 2, 3, 4, 5, 6],
...         "c": [6, 5, 4, 3, 2, 1],
...     }
... )
>>> lf.group_by("a").agg(pl.all().sum()).collect()  
shape: (3, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ a   ┆ 4   ┆ 10  │
│ b   ┆ 11  ┆ 10  │
│ c   ┆ 6   ┆ 1   │
└─────┴─────┴─────┘

Collect in streaming mode

>>> lf.group_by("a").agg(pl.all().sum()).collect(
...     engine="streaming"
... )  
shape: (3, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ a   ┆ 4   ┆ 10  │
│ b   ┆ 11  ┆ 10  │
│ c   ┆ 6   ┆ 1   │
└─────┴─────┴─────┘

Collect in GPU mode

>>> lf.group_by("a").agg(pl.all().sum()).collect(engine="gpu")  
shape: (3, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ b   ┆ 11  ┆ 10  │
│ a   ┆ 4   ┆ 10  │
│ c   ┆ 6   ┆ 1   │
└─────┴─────┴─────┘

With control over the device used

>>> lf.group_by("a").agg(pl.all().sum()).collect(
...     engine=pl.GPUEngine(device=1)
... )  
shape: (3, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ b   ┆ 11  ┆ 10  │
│ a   ┆ 4   ┆ 10  │
│ c   ┆ 6   ┆ 1   │
└─────┴─────┴─────┘

collect_async( *, gevent: bool = False, engine: EngineType = 'auto', optimizations: QueryOptFlags = (), ) → Awaitable[DataFrame] | _GeventDataFrameResult[DataFrame][source]

Collect DataFrame asynchronously in thread pool.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

Collects into a DataFrame (like collect()) but, instead of returning a DataFrame directly, it is scheduled to be collected inside a thread pool, while this method returns almost instantly.

This can be useful if you use gevent or asyncio and want to release control to other greenlets/tasks while LazyFrames are being collected.

Parameters:

gevent

Return wrapper to gevent.event.AsyncResult instead of Awaitable

engine

Select the engine used to process the query (default "auto"):

"auto": use the engine set by Config.set_engine_affinity or the POLARS_ENGINE_AFFINITY environment variable, falling back to "in-memory" if unset (this default may change in a future release).
"in-memory": use the in-memory engine, this is the default engine.
"streaming": use the streaming engine, which processes queries in batches, reducing memory pressure and often outperforming the in-memory engine. This will soon become the default engine of Polars.
"gpu": use the CUDA GPU engine (requires an Nvidia GPU and cudf-polars). Pass a GPUEngine object for fine-grained control (e.g. device selection on multi-GPU systems).

If the selected engine cannot run the query, Polars falls back to the in-memory engine.

Note

The GPU engine does not support async, or running in the background. If either are enabled, then GPU execution is switched off.

optimizations

The optimization passes done during query optimization.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

Returns:

If gevent=False (default) then returns an awaitable.
If gevent=True then returns wrapper that has a
.get(block=True, timeout=None) method.

See also

polars.collect_all: Collect multiple LazyFrames at the same time.
polars.collect_all_async: Collect multiple LazyFrames at the same time lazily.

Notes

In case of error set_exception is used on asyncio.Future/gevent.event.AsyncResult and will be reraised by them.

Examples

>>> import asyncio
>>> lf = pl.LazyFrame(
...     {
...         "a": ["a", "b", "a", "b", "b", "c"],
...         "b": [1, 2, 3, 4, 5, 6],
...         "c": [6, 5, 4, 3, 2, 1],
...     }
... )
>>> async def main():
...     return await (
...         lf.group_by("a", maintain_order=True)
...         .agg(pl.all().sum())
...         .collect_async()
...     )
>>> asyncio.run(main())
shape: (3, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ a   ┆ 4   ┆ 10  │
│ b   ┆ 11  ┆ 10  │
│ c   ┆ 6   ┆ 1   │
└─────┴─────┴─────┘

collect_batches( *, chunk_size: int | None = None, maintain_order: bool = True, lazy: bool = False, engine: EngineType = 'auto', optimizations: QueryOptFlags = (), ) → Iterator[DataFrame][source]

Evaluate the query in streaming mode and get a generator that returns chunks.

This allows streaming results that are larger than RAM to be written to disk.

The query will always be fully executed unless stop is called, so you should call next until all chunks have been seen.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

Warning

This method is much slower than native sinks. Only use it if you cannot implement your logic otherwise.

Parameters:

chunk_size

The number of rows that are buffered before a chunk is given.

maintain_order

Maintain the order in which data is processed. Setting this to False will be slightly faster.

lazy

Start the query when first requesting a batch.

engine

Select the engine used to process the query (default "auto"):

"auto": use the engine set by Config.set_engine_affinity or the POLARS_ENGINE_AFFINITY environment variable, falling back to "streaming" if unset.
"in-memory": use the in-memory engine before writing, this is the default engine.
"streaming": use the streaming engine, which processes queries in batches, reducing memory pressure and often outperforming the in-memory engine. This will soon become the default engine of Polars.
"gpu": use the CUDA GPU engine (requires an Nvidia GPU and cudf-polars). Pass a GPUEngine object for fine-grained control.

If the selected engine cannot run the query, Polars falls back to the streaming engine.

optimizations

The optimization passes done during query optimization.

Examples

>>> lf = pl.scan_csv("/path/to/my_larger_than_ram_file.csv")  
>>> for df in lf.collect_batches():
...     print(df)  

collect_schema() → Schema[source]

Resolve the schema of this LazyFrame.

Caution

Computing the schema of a LazyFrame is a potentially expensive operation, as it may involve reading metadata from (slow) disk storage, or performing network requests if the data is remote.

Examples

Determine the schema.

>>> lf = pl.LazyFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6.0, 7.0, 8.0],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> lf.collect_schema()
Schema({'foo': Int64, 'bar': Float64, 'ham': String})

Access various properties of the schema.

>>> schema = lf.collect_schema()
>>> schema["bar"]
Float64
>>> schema.names()
['foo', 'bar', 'ham']
>>> schema.dtypes()
[Int64, Float64, String]
>>> schema.len()
3

property columns: list[str][source]

Get the column names.

Returns:

list of str: A list containing the name of each column in order.

Warning

Determining the column names of a LazyFrame requires resolving its schema, which is a potentially expensive operation. Using collect_schema() is the idiomatic way of resolving the schema. This property exists only for symmetry with the DataFrame class.

See also

collect_schema
Schema.names

Examples

>>> lf = pl.LazyFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6, 7, 8],
...         "ham": ["a", "b", "c"],
...     }
... ).select("foo", "bar")
>>> lf.columns  
['foo', 'bar']

count() → LazyFrame[source]

Return the number of non-null elements for each column.

Examples

>>> lf = pl.LazyFrame(
...     {"a": [1, 2, 3, 4], "b": [1, 2, 1, None], "c": [None, None, None, None]}
... )
>>> lf.count().collect()
shape: (1, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ u32 ┆ u32 ┆ u32 │
╞═════╪═════╪═════╡
│ 4   ┆ 3   ┆ 0   │
└─────┴─────┴─────┘

describe( percentiles: Sequence[float] | float | None = (0.25, 0.5, 0.75), *, interpolation: QuantileMethod = 'nearest', ) → DataFrame[source]

Creates a summary of statistics for a LazyFrame, returning a DataFrame.

Parameters:

percentiles: One or more percentiles to include in the summary statistics. All values must be in the range [0, 1].
interpolation{‘nearest’, ‘higher’, ‘lower’, ‘midpoint’, ‘linear’, ‘equiprobable’}: Interpolation method used when calculating percentiles.

Returns:

DataFrame

Warning

This method does not maintain the laziness of the frame, and will collect the final result. This could potentially be an expensive operation.
We do not guarantee the output of describe to be stable. It will show statistics that we deem informative, and may be updated in the future. Using describe programmatically (versus interactive exploration) is not recommended for this reason.

Notes

The median is included by default as the 50% percentile.

Examples

>>> from datetime import date, time
>>> lf = pl.LazyFrame(
...     {
...         "float": [1.0, 2.8, 3.0],
...         "int": [40, 50, None],
...         "bool": [True, False, True],
...         "str": ["zz", "xx", "yy"],
...         "date": [date(2020, 1, 1), date(2021, 7, 5), date(2022, 12, 31)],
...         "time": [time(10, 20, 30), time(14, 45, 50), time(23, 15, 10)],
...     }
... )

Show default frame statistics:

>>> lf.describe()
shape: (9, 7)
┌────────────┬──────────┬──────────┬──────────┬──────┬─────────────────────┬──────────┐
│ statistic  ┆ float    ┆ int      ┆ bool     ┆ str  ┆ date                ┆ time     │
│ ---        ┆ ---      ┆ ---      ┆ ---      ┆ ---  ┆ ---                 ┆ ---      │
│ str        ┆ f64      ┆ f64      ┆ f64      ┆ str  ┆ str                 ┆ str      │
╞════════════╪══════════╪══════════╪══════════╪══════╪═════════════════════╪══════════╡
│ count      ┆ 3.0      ┆ 2.0      ┆ 3.0      ┆ 3    ┆ 3                   ┆ 3        │
│ null_count ┆ 0.0      ┆ 1.0      ┆ 0.0      ┆ 0    ┆ 0                   ┆ 0        │
│ mean       ┆ 2.266667 ┆ 45.0     ┆ 0.666667 ┆ null ┆ 2021-07-02 16:00:00 ┆ 16:07:10 │
│ std        ┆ 1.101514 ┆ 7.071068 ┆ null     ┆ null ┆ null                ┆ null     │
│ min        ┆ 1.0      ┆ 40.0     ┆ 0.0      ┆ xx   ┆ 2020-01-01          ┆ 10:20:30 │
│ 25%        ┆ 2.8      ┆ 40.0     ┆ null     ┆ null ┆ 2021-07-05          ┆ 14:45:50 │
│ 50%        ┆ 2.8      ┆ 50.0     ┆ null     ┆ null ┆ 2021-07-05          ┆ 14:45:50 │
│ 75%        ┆ 3.0      ┆ 50.0     ┆ null     ┆ null ┆ 2022-12-31          ┆ 23:15:10 │
│ max        ┆ 3.0      ┆ 50.0     ┆ 1.0      ┆ zz   ┆ 2022-12-31          ┆ 23:15:10 │
└────────────┴──────────┴──────────┴──────────┴──────┴─────────────────────┴──────────┘

Customize which percentiles are displayed, applying linear interpolation:

>>> with pl.Config(tbl_rows=12):
...     lf.describe(
...         percentiles=[0.1, 0.3, 0.5, 0.7, 0.9],
...         interpolation="linear",
...     )
shape: (11, 7)
┌────────────┬──────────┬──────────┬──────────┬──────┬─────────────────────┬──────────┐
│ statistic  ┆ float    ┆ int      ┆ bool     ┆ str  ┆ date                ┆ time     │
│ ---        ┆ ---      ┆ ---      ┆ ---      ┆ ---  ┆ ---                 ┆ ---      │
│ str        ┆ f64      ┆ f64      ┆ f64      ┆ str  ┆ str                 ┆ str      │
╞════════════╪══════════╪══════════╪══════════╪══════╪═════════════════════╪══════════╡
│ count      ┆ 3.0      ┆ 2.0      ┆ 3.0      ┆ 3    ┆ 3                   ┆ 3        │
│ null_count ┆ 0.0      ┆ 1.0      ┆ 0.0      ┆ 0    ┆ 0                   ┆ 0        │
│ mean       ┆ 2.266667 ┆ 45.0     ┆ 0.666667 ┆ null ┆ 2021-07-02 16:00:00 ┆ 16:07:10 │
│ std        ┆ 1.101514 ┆ 7.071068 ┆ null     ┆ null ┆ null                ┆ null     │
│ min        ┆ 1.0      ┆ 40.0     ┆ 0.0      ┆ xx   ┆ 2020-01-01          ┆ 10:20:30 │
│ 10%        ┆ 1.36     ┆ 41.0     ┆ null     ┆ null ┆ 2020-04-20          ┆ 11:13:34 │
│ 30%        ┆ 2.08     ┆ 43.0     ┆ null     ┆ null ┆ 2020-11-26          ┆ 12:59:42 │
│ 50%        ┆ 2.8      ┆ 45.0     ┆ null     ┆ null ┆ 2021-07-05          ┆ 14:45:50 │
│ 70%        ┆ 2.88     ┆ 47.0     ┆ null     ┆ null ┆ 2022-02-07          ┆ 18:09:34 │
│ 90%        ┆ 2.96     ┆ 49.0     ┆ null     ┆ null ┆ 2022-09-13          ┆ 21:33:18 │
│ max        ┆ 3.0      ┆ 50.0     ┆ 1.0      ┆ zz   ┆ 2022-12-31          ┆ 23:15:10 │
└────────────┴──────────┴──────────┴──────────┴──────┴─────────────────────┴──────────┘

classmethod deserialize( source: str | bytes | Path | IOBase, *, format: SerializationFormat = 'binary', ) → LazyFrame[source]

Read a logical plan from a file to construct a LazyFrame.

Parameters:

source

Path to a file or a file-like object (by file-like object, we refer to objects that have a read() method, such as a file handler (e.g. via builtin open function) or BytesIO).

format

The format with which the LazyFrame was serialized. Options:

"binary": Deserialize from binary format (bytes). This is the default.
"json": Deserialize from JSON format (string).

Warning

This function uses pickle if the logical plan contains Python UDFs, and as such inherits the security implications. Deserializing can execute arbitrary code, so it should only be attempted on trusted data.

See also

LazyFrame.serialize

Notes

Serialization is not stable across Polars versions: a LazyFrame serialized in one Polars version may not be deserializable in another Polars version.

Examples

>>> import io
>>> lf = pl.LazyFrame({"a": [1, 2, 3]}).sum()
>>> bytes = lf.serialize()
>>> pl.LazyFrame.deserialize(io.BytesIO(bytes)).collect()
shape: (1, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 6   │
└─────┘

drop( *columns: ColumnNameOrSelector | Iterable[ColumnNameOrSelector], strict: bool = True, ) → LazyFrame[source]

Remove columns from the DataFrame.

Parameters:

*columns: Names of the columns that should be removed from the dataframe. Accepts column selector input.
strict: Validate that all column names exist in the current schema, and throw an exception if any do not.

Examples

Drop a single column by passing the name of that column.

>>> lf = pl.LazyFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6.0, 7.0, 8.0],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> lf.drop("ham").collect()
shape: (3, 2)
┌─────┬─────┐
│ foo ┆ bar │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════╪═════╡
│ 1   ┆ 6.0 │
│ 2   ┆ 7.0 │
│ 3   ┆ 8.0 │
└─────┴─────┘

Drop multiple columns by passing a selector.

>>> import polars.selectors as cs
>>> lf.drop(cs.numeric()).collect()
shape: (3, 1)
┌─────┐
│ ham │
│ --- │
│ str │
╞═════╡
│ a   │
│ b   │
│ c   │
└─────┘

Use positional arguments to drop multiple columns.

>>> lf.drop("foo", "ham").collect()
shape: (3, 1)
┌─────┐
│ bar │
│ --- │
│ f64 │
╞═════╡
│ 6.0 │
│ 7.0 │
│ 8.0 │
└─────┘

drop_nans( subset: ColumnNameOrSelector | Collection[ColumnNameOrSelector] | None = None, ) → LazyFrame[source]

Drop all rows that contain one or more NaN values.

The original order of the remaining rows is preserved.

Parameters:

subset: Column name(s) for which NaN values are considered; if set to None (default), use all columns (note that only floating-point columns can contain NaNs).

See also

drop_nulls

Notes

A NaN value is not the same as a null value. To drop null values, use drop_nulls().

Examples

>>> lf = pl.LazyFrame(
...     {
...         "foo": [-20.5, float("nan"), 80.0],
...         "bar": [float("nan"), 110.0, 25.5],
...         "ham": ["xxx", "yyy", None],
...     }
... )

The default behavior of this method is to drop rows where any single value in the row is NaN:

>>> lf.drop_nans().collect()
shape: (1, 3)
┌──────┬──────┬──────┐
│ foo  ┆ bar  ┆ ham  │
│ ---  ┆ ---  ┆ ---  │
│ f64  ┆ f64  ┆ str  │
╞══════╪══════╪══════╡
│ 80.0 ┆ 25.5 ┆ null │
└──────┴──────┴──────┘

This behaviour can be constrained to consider only a subset of columns, as defined by name, or with a selector. For example, dropping rows only if there is a NaN in the “bar” column:

>>> lf.drop_nans(subset=["bar"]).collect()
shape: (2, 3)
┌──────┬───────┬──────┐
│ foo  ┆ bar   ┆ ham  │
│ ---  ┆ ---   ┆ ---  │
│ f64  ┆ f64   ┆ str  │
╞══════╪═══════╪══════╡
│ NaN  ┆ 110.0 ┆ yyy  │
│ 80.0 ┆ 25.5  ┆ null │
└──────┴───────┴──────┘

Dropping a row only if all values are NaN requires a different formulation:

>>> lf = pl.LazyFrame(
...     {
...         "a": [float("nan"), float("nan"), float("nan"), float("nan")],
...         "b": [10.0, 2.5, float("nan"), 5.25],
...         "c": [65.75, float("nan"), float("nan"), 10.5],
...     }
... )
>>> lf.filter(~pl.all_horizontal(pl.all().is_nan())).collect()
shape: (3, 3)
┌─────┬──────┬───────┐
│ a   ┆ b    ┆ c     │
│ --- ┆ ---  ┆ ---   │
│ f64 ┆ f64  ┆ f64   │
╞═════╪══════╪═══════╡
│ NaN ┆ 10.0 ┆ 65.75 │
│ NaN ┆ 2.5  ┆ NaN   │
│ NaN ┆ 5.25 ┆ 10.5  │
└─────┴──────┴───────┘

drop_nulls( subset: ColumnNameOrSelector | Collection[ColumnNameOrSelector] | None = None, ) → LazyFrame[source]

Drop all rows that contain one or more null values.

The original order of the remaining rows is preserved.

See also

drop_nans

Notes

A null value is not the same as a NaN value. To drop NaN values, use drop_nans().

Examples

>>> lf = pl.LazyFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6, None, 8],
...         "ham": ["a", "b", None],
...     }
... )

The default behavior of this method is to drop rows where any single value in the row is null:

>>> lf.drop_nulls().collect()
shape: (1, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 1   ┆ 6   ┆ a   │
└─────┴─────┴─────┘

This behaviour can be constrained to consider only a subset of columns, as defined by name or with a selector. For example, dropping rows if there is a null in any of the integer columns:

>>> import polars.selectors as cs
>>> lf.drop_nulls(subset=cs.integer()).collect()
shape: (2, 3)
┌─────┬─────┬──────┐
│ foo ┆ bar ┆ ham  │
│ --- ┆ --- ┆ ---  │
│ i64 ┆ i64 ┆ str  │
╞═════╪═════╪══════╡
│ 1   ┆ 6   ┆ a    │
│ 3   ┆ 8   ┆ null │
└─────┴─────┴──────┘

Dropping a row only if all values are null requires a different formulation:

>>> lf = pl.LazyFrame(
...     {
...         "a": [None, None, None, None],
...         "b": [1, 2, None, 1],
...         "c": [1, None, None, 1],
...     }
... )
>>> lf.filter(~pl.all_horizontal(pl.all().is_null())).collect()
shape: (3, 3)
┌──────┬─────┬──────┐
│ a    ┆ b   ┆ c    │
│ ---  ┆ --- ┆ ---  │
│ null ┆ i64 ┆ i64  │
╞══════╪═════╪══════╡
│ null ┆ 1   ┆ 1    │
│ null ┆ 2   ┆ null │
│ null ┆ 1   ┆ 1    │
└──────┴─────┴──────┘

property dtypes: list[DataType][source]

Get the column data types.

Returns:

list of DataType: A list containing the data type of each column in order.

Warning

Determining the data types of a LazyFrame requires resolving its schema, which is a potentially expensive operation. Using collect_schema() is the idiomatic way to resolve the schema. This property exists only for symmetry with the DataFrame class.

See also

collect_schema
Schema.dtypes

Examples

>>> lf = pl.LazyFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6.0, 7.0, 8.0],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> lf.dtypes  
[Int64, Float64, String]

execute(

*,

optimizations: QueryOptFlags = (),

engine: EngineType = 'auto',

**_kwargs: Any,

) → QueryResult[source]

Execute the query into a QueryResult.

This method of materializing a LazyFrame makes no guarantees as to where the result is materialized. This can be on the GPU for the GPU-engine, on the cluster or remote storage for the distributed engine and the streaming engine could spill the result if it needed to.

The QueryResult can always be consumed as a new LazyFrame by calling .lazy

Parameters:

engine: Select the engine used to process the query, optional. At the moment, if set to "auto" (default), the query is run using the polars in-memory engine. Polars will also attempt to use the engine set by the POLARS_ENGINE_AFFINITY environment variable. If it cannot run the query using the selected engine, the query is run using the polars in-memory engine. If set to "gpu", the GPU engine is used. Fine-grained control over the GPU engine, for example which device to use on a system with multiple devices, is possible by providing a GPUEngine object with configuration options.

Note

GPU mode is considered unstable. Not all queries will run successfully on the GPU, however, they should fall back transparently to the default engine if execution is not supported.

Running with POLARS_VERBOSE=1 will provide information if a query falls back (and why).
optimizations: The optimization passes done during query optimization.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

Returns:

DataFrame

See also

explain: Print the query plan that is evaluated with collect.
profile: Collect the LazyFrame and time each node in the computation graph.
polars.collect_all: Collect multiple LazyFrames at the same time.
polars.Config.set_streaming_chunk_size: Set the size of streaming batches.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": ["a", "b", "a", "b", "b", "c"],
...         "b": [1, 2, 3, 4, 5, 6],
...         "c": [6, 5, 4, 3, 2, 1],
...     }
... )
>>> lf.group_by("a").agg(pl.all().sum()).collect()  
shape: (3, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ a   ┆ 4   ┆ 10  │
│ b   ┆ 11  ┆ 10  │
│ c   ┆ 6   ┆ 1   │
└─────┴─────┴─────┘

Collect in streaming mode

>>> lf.group_by("a").agg(pl.all().sum()).collect(
...     engine="streaming"
... )  
shape: (3, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ a   ┆ 4   ┆ 10  │
│ b   ┆ 11  ┆ 10  │
│ c   ┆ 6   ┆ 1   │
└─────┴─────┴─────┘

Collect in GPU mode

explain( *, format: ExplainFormat = 'plain', optimized: bool = True, type_coercion: bool = True, predicate_pushdown: bool = True, projection_pushdown: bool = True, simplify_expression: bool = True, slice_pushdown: bool = True, comm_subplan_elim: bool = True, comm_subexpr_elim: bool = True, cluster_with_columns: bool = True, collapse_joins: bool = True, streaming: bool = False, engine: EngineType = 'auto', tree_format: bool | None = None, optimizations: QueryOptFlags = (), ) → str[source]

Create a string representation of the query plan.

Different optimizations can be turned on or off.

Parameters:

format{‘plain’, ‘tree’}

The format to use for displaying the logical plan.

optimized

Return an optimized query plan. Defaults to True. If this is set to True the subsequent optimization flags control which optimizations run.

type_coercion

Do type coercion optimization.

Deprecated since version 1.30.0: Use the optimizations parameters.

predicate_pushdown

Do predicate pushdown optimization.

Deprecated since version 1.30.0: Use the optimizations parameters.

projection_pushdown

Do projection pushdown optimization.

Deprecated since version 1.30.0: Use the optimizations parameters.

simplify_expression

Run simplify expressions optimization.

Deprecated since version 1.30.0: Use the optimizations parameters.

slice_pushdown

Slice pushdown optimization.

Deprecated since version 1.30.0: Use the optimizations parameters.

comm_subplan_elim

Will try to cache branching subplans that occur on self-joins or unions.

Deprecated since version 1.30.0: Use the optimizations parameters.

comm_subexpr_elim

Common subexpressions will be cached and reused.

Deprecated since version 1.30.0: Use the optimizations parameters.

cluster_with_columns

Combine sequential independent calls to with_columns

Deprecated since version 1.30.0: Use the optimizations parameters.

collapse_joins

Collapse a join and filters into a faster join

Deprecated since version 1.30.0: Use the optimizations parameters.

streaming

Unused parameter, kept for backward compatibility.

Deprecated since version 1.30.0: Use the engine parameter instead.

engine

Select the engine used to process the query (default "auto"):

"auto": use the engine set by Config.set_engine_affinity or the POLARS_ENGINE_AFFINITY environment variable, falling back to "in-memory" if unset (this default may change in a future release).
"in-memory": use the in-memory engine, this is the default engine.
"streaming": use the streaming engine, which processes queries in batches, reducing memory pressure and often outperforming the in-memory engine. This will soon become the default engine of Polars.
"gpu": use the CUDA GPU engine (requires an Nvidia GPU and cudf-polars). Pass a GPUEngine object for fine-grained control (e.g. device selection on multi-GPU systems).

If the selected engine cannot run the query, Polars falls back to the in-memory engine.

Note

GPU mode is considered unstable. Not all queries will run successfully on the GPU, however, they should fall back transparently to the default engine if execution is not supported.

Running with POLARS_VERBOSE=1 will provide information if a query falls back (and why).

optimizations

The optimization passes done during query optimization.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

tree_format

Format the output as a tree.

Deprecated since version 0.20.30: Use format="tree" instead.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": ["a", "b", "a", "b", "b", "c"],
...         "b": [1, 2, 3, 4, 5, 6],
...         "c": [6, 5, 4, 3, 2, 1],
...     }
... )
>>> lf.group_by("a", maintain_order=True).agg(pl.all().sum()).sort(
...     "a"
... ).explain()  

explode(columns: ColumnNameOrSelector | Iterable[ColumnNameOrSelector], *more_columns: ColumnNameOrSelector, empty_as_null: bool = <object object>, keep_nulls: bool = True) → LazyFrame[source]

Explode the DataFrame to long format by exploding the given columns.

Parameters:

columns: Column names, expressions, or a selector defining them. The underlying columns being exploded must be of the List or Array data type.
*more_columns: Additional names of columns to explode, specified as positional arguments.
empty_as_null: Explode an empty list/array into a null.
keep_nulls: Explode a null list/array into a null.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "letters": ["a", "a", "b", "c"],
...         "numbers": [[1], [2, 3], [4, 5], [6, 7, 8]],
...     }
... )
>>> lf.explode("numbers", empty_as_null=False).collect()
shape: (8, 2)
┌─────────┬─────────┐
│ letters ┆ numbers │
│ ---     ┆ ---     │
│ str     ┆ i64     │
╞═════════╪═════════╡
│ a       ┆ 1       │
│ a       ┆ 2       │
│ a       ┆ 3       │
│ b       ┆ 4       │
│ b       ┆ 5       │
│ c       ┆ 6       │
│ c       ┆ 7       │
│ c       ┆ 8       │
└─────────┴─────────┘

fetch(n_rows: int = 500, **kwargs: Any) → DataFrame[source]

Collect a small number of rows for debugging purposes.

Deprecated since version 1.0: Use collect() instead, in conjunction with a call to head().`

Warning

This is strictly a utility function that can help to debug queries using a smaller number of rows, and should not be used in production code.

Notes

This is similar to a collect() operation, but it overwrites the number of rows read by every scan operation. Be aware that fetch does not guarantee the final number of rows in the DataFrame. Filters, join operations and fewer rows being available in the scanned data will all influence the final number of rows (joins are especially susceptible to this, and may return no data at all if n_rows is too small as the join keys may not be present).

fill_nan(value: int | float | Expr | None) → LazyFrame[source]

Fill floating point NaN values.

Parameters:

value: Value used to fill NaN values.

See also

fill_null

Notes

A NaN value is not the same as a null value. To fill null values, use fill_null().

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [1.5, 2, float("nan"), 4],
...         "b": [0.5, 4, float("nan"), 13],
...     }
... )
>>> lf.fill_nan(99).collect()
shape: (4, 2)
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ f64  ┆ f64  │
╞══════╪══════╡
│ 1.5  ┆ 0.5  │
│ 2.0  ┆ 4.0  │
│ 99.0 ┆ 99.0 │
│ 4.0  ┆ 13.0 │
└──────┴──────┘

fill_null( value: Any | Expr | None = None, strategy: FillNullStrategy | None = None, limit: int | None = None, *, matches_supertype: bool = True, ) → LazyFrame[source]

Fill null values using the specified value or strategy.

Parameters:

value: Value used to fill null values.
strategy{None, ‘forward’, ‘backward’, ‘min’, ‘max’, ‘mean’, ‘zero’, ‘one’}: Strategy used to fill null values.
limit: Number of consecutive null values to fill when using the ‘forward’ or ‘backward’ strategy.
matches_supertype: Fill all matching supertypes of the fill value literal.

See also

fill_nan

Notes

A null value is not the same as a NaN value. To fill NaN values, use fill_nan().

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 2, None, 4],
...         "b": [0.5, 4, None, 13],
...     }
... )
>>> lf.fill_null(99).collect()
shape: (4, 2)
┌─────┬──────┐
│ a   ┆ b    │
│ --- ┆ ---  │
│ i64 ┆ f64  │
╞═════╪══════╡
│ 1   ┆ 0.5  │
│ 2   ┆ 4.0  │
│ 99  ┆ 99.0 │
│ 4   ┆ 13.0 │
└─────┴──────┘
>>> lf.fill_null(strategy="forward").collect()
shape: (4, 2)
┌─────┬──────┐
│ a   ┆ b    │
│ --- ┆ ---  │
│ i64 ┆ f64  │
╞═════╪══════╡
│ 1   ┆ 0.5  │
│ 2   ┆ 4.0  │
│ 2   ┆ 4.0  │
│ 4   ┆ 13.0 │
└─────┴──────┘

>>> lf.fill_null(strategy="max").collect()
shape: (4, 2)
┌─────┬──────┐
│ a   ┆ b    │
│ --- ┆ ---  │
│ i64 ┆ f64  │
╞═════╪══════╡
│ 1   ┆ 0.5  │
│ 2   ┆ 4.0  │
│ 4   ┆ 13.0 │
│ 4   ┆ 13.0 │
└─────┴──────┘

>>> lf.fill_null(strategy="zero").collect()
shape: (4, 2)
┌─────┬──────┐
│ a   ┆ b    │
│ --- ┆ ---  │
│ i64 ┆ f64  │
╞═════╪══════╡
│ 1   ┆ 0.5  │
│ 2   ┆ 4.0  │
│ 0   ┆ 0.0  │
│ 4   ┆ 13.0 │
└─────┴──────┘

filter(

*predicates: IntoExprColumn | Iterable[IntoExprColumn] | bool | list[bool] | np.ndarray[Any, Any],

**constraints: Any,

) → LazyFrame[source]

Filter rows in the LazyFrame based on a predicate expression.

The original order of the remaining rows is preserved.

Rows where the filter predicate does not evaluate to True are discarded (this includes rows where the predicate evaluates as null).

Parameters:

predicates: Expression that evaluates to a boolean Series.
constraints: Column filters; use name = value to filter columns using the supplied value. Each constraint behaves the same as pl.col(name).eq(value), and is implicitly joined with the other filter conditions using &.

See also

remove

Notes

If you are transitioning from Pandas, and performing filter operations based on the comparison of two or more columns, please note that in Polars any comparison involving null values will result in a null result, not boolean True or False. As a result, these rows will not be retained. Ensure that null values are handled appropriately to avoid unexpected behaviour (see examples below).

Examples

>>> lf = pl.LazyFrame(
...     {
...         "foo": [1, 2, 3, None, 4, None, 0],
...         "bar": [6, 7, 8, None, None, 9, 0],
...         "ham": ["a", "b", "c", None, "d", "e", "f"],
...     }
... )

Filter on one condition:

>>> lf.filter(pl.col("foo") > 1).collect()
shape: (3, 3)
┌─────┬──────┬─────┐
│ foo ┆ bar  ┆ ham │
│ --- ┆ ---  ┆ --- │
│ i64 ┆ i64  ┆ str │
╞═════╪══════╪═════╡
│ 2   ┆ 7    ┆ b   │
│ 3   ┆ 8    ┆ c   │
│ 4   ┆ null ┆ d   │
└─────┴──────┴─────┘

Filter on multiple conditions:

>>> lf.filter((pl.col("foo") < 3) & (pl.col("ham") == "a")).collect()
shape: (1, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 1   ┆ 6   ┆ a   │
└─────┴─────┴─────┘

Provide multiple filters using *args syntax:

>>> lf.filter(
...     pl.col("foo") == 1,
...     pl.col("ham") == "a",
... ).collect()
shape: (1, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 1   ┆ 6   ┆ a   │
└─────┴─────┴─────┘

Provide multiple filters using **kwargs syntax:

>>> lf.filter(foo=1, ham="a").collect()
shape: (1, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 1   ┆ 6   ┆ a   │
└─────┴─────┴─────┘

Filter on an OR condition:

>>> lf.filter(
...     (pl.col("foo") == 1) | (pl.col("ham") == "c"),
... ).collect()
shape: (2, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 1   ┆ 6   ┆ a   │
│ 3   ┆ 8   ┆ c   │
└─────┴─────┴─────┘

Filter by comparing two columns against each other

>>> lf.filter(
...     pl.col("foo") == pl.col("bar"),
... ).collect()
shape: (1, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 0   ┆ 0   ┆ f   │
└─────┴─────┴─────┘

>>> lf.filter(
...     pl.col("foo") != pl.col("bar"),
... ).collect()
shape: (3, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 1   ┆ 6   ┆ a   │
│ 2   ┆ 7   ┆ b   │
│ 3   ┆ 8   ┆ c   │
└─────┴─────┴─────┘

Notice how the row with None values is filtered out; using ne_missing ensures that null values compare equal, and we get similar behaviour to Pandas:

>>> lf.filter(
...     pl.col("foo").ne_missing(pl.col("bar")),
... ).collect()
shape: (5, 3)
┌──────┬──────┬─────┐
│ foo  ┆ bar  ┆ ham │
│ ---  ┆ ---  ┆ --- │
│ i64  ┆ i64  ┆ str │
╞══════╪══════╪═════╡
│ 1    ┆ 6    ┆ a   │
│ 2    ┆ 7    ┆ b   │
│ 3    ┆ 8    ┆ c   │
│ 4    ┆ null ┆ d   │
│ null ┆ 9    ┆ e   │
└──────┴──────┴─────┘

first() → LazyFrame[source]

Get the first row of the DataFrame.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 3, 5],
...         "b": [2, 4, 6],
...     }
... )
>>> lf.first().collect()
shape: (1, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 2   │
└─────┴─────┘

Selects rows from this LazyFrame at the given indices.

Warning

This functionality is experimental. It may be changed at any point without it being considered a breaking change.

Parameters:

indices

The indices of the rows to select.

Due to the lack of a LazySeries it’s permitted to pass a single-width LazyFrame as indices as well.

null_on_oob

If true when an index is out-of-bounds a null row will be generated instead of raising an error.

Examples

>>> lf = pl.LazyFrame({"x": [2, 1, 0], "s": ["foo", "bar", "baz"]})
>>> lf.gather([2, 0, 0]).collect()
shape: (3, 2)
┌─────┬─────┐
│ x   ┆ s   │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 0   ┆ baz │
│ 2   ┆ foo │
│ 2   ┆ foo │
└─────┴─────┘

>>> lf.gather([0, 10, 1], null_on_oob=True).collect()
shape: (3, 2)
┌──────┬──────┐
│ x    ┆ s    │
│ ---  ┆ ---  │
│ i64  ┆ str  │
╞══════╪══════╡
│ 2    ┆ foo  │
│ null ┆ null │
│ 1    ┆ bar  │
└──────┴──────┘

>>> idxs = pl.LazyFrame({"i": [1, 10, 0], "b": [True, False, True]})
>>> lf.gather(idxs.filter(pl.col.b).select(pl.col.i)).collect()
shape: (2, 2)
┌─────┬─────┐
│ x   ┆ s   │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 1   ┆ bar │
│ 2   ┆ foo │
└─────┴─────┘

gather_every(n: int, offset: int = 0) → LazyFrame[source]

Take every nth row in the LazyFrame and return as a new LazyFrame.

Parameters:

n: Gather every n-th row.
offset: Starting index.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 2, 3, 4],
...         "b": [5, 6, 7, 8],
...     }
... )
>>> lf.gather_every(2).collect()
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 5   │
│ 3   ┆ 7   │
└─────┴─────┘
>>> lf.gather_every(2, offset=1).collect()
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 2   ┆ 6   │
│ 4   ┆ 8   │
└─────┴─────┘

group_by(

*by: IntoExpr | Iterable[IntoExpr],

maintain_order: bool = False,

**named_by: IntoExpr,

) → LazyGroupBy[source]

Start a group by operation.

Parameters:

*by: Column(s) to group by. Accepts expression input. Strings are parsed as column names.
maintain_order: Ensure that the order of the groups is consistent with the input data. This is slower than a default group by. Setting this to True blocks the possibility to run on the streaming engine.
**named_by: Additional columns to group by, specified as keyword arguments. The columns will be renamed to the keyword used.

Examples

Group by one column and call agg to compute the grouped sum of another column.

>>> lf = pl.LazyFrame(
...     {
...         "a": ["a", "b", "a", "b", "c"],
...         "b": [1, 2, 1, 3, 3],
...         "c": [5, 4, 3, 2, 1],
...     }
... )
>>> lf.group_by("a").agg(pl.col("b").sum()).collect()  
shape: (3, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ a   ┆ 2   │
│ b   ┆ 5   │
│ c   ┆ 3   │
└─────┴─────┘

Set maintain_order=True to ensure the order of the groups is consistent with the input.

>>> lf.group_by("a", maintain_order=True).agg(pl.col("c")).collect()
shape: (3, 2)
┌─────┬───────────┐
│ a   ┆ c         │
│ --- ┆ ---       │
│ str ┆ list[i64] │
╞═════╪═══════════╡
│ a   ┆ [5, 3]    │
│ b   ┆ [4, 2]    │
│ c   ┆ [1]       │
└─────┴───────────┘

Group by multiple columns by passing a list of column names.

>>> lf.group_by(["a", "b"]).agg(pl.max("c")).collect()  
shape: (4, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ a   ┆ 1   ┆ 5   │
│ b   ┆ 2   ┆ 4   │
│ b   ┆ 3   ┆ 2   │
│ c   ┆ 3   ┆ 1   │
└─────┴─────┴─────┘

Or use positional arguments to group by multiple columns in the same way. Expressions are also accepted.

>>> lf.group_by("a", pl.col("b") // 2).agg(
...     pl.col("c").mean()
... ).collect()  
shape: (3, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 │
╞═════╪═════╪═════╡
│ a   ┆ 0   ┆ 4.0 │
│ b   ┆ 1   ┆ 3.0 │
│ c   ┆ 1   ┆ 1.0 │
└─────┴─────┴─────┘

group_by_dynamic( index_column: IntoExpr, *, every: str | timedelta, period: str | timedelta | None = None, offset: str | timedelta | None = None, include_boundaries: bool = False, closed: ClosedInterval = 'left', label: Label = 'left', group_by: IntoExpr | Iterable[IntoExpr] | None = None, start_by: StartBy = 'window', ) → LazyGroupBy[source]

Group based on a time value (or index value of type Int32, Int64).

Time windows are calculated and rows are assigned to windows. Different from a normal group by is that a row can be member of multiple groups. By default, the windows look like:

[start, start + period)
[start + every, start + every + period)
[start + 2*every, start + 2*every + period)
…

where start is determined by start_by, offset, every, and the earliest datapoint. See the start_by argument description for details.

Warning

The index column must be sorted in ascending order. If group_by is passed, then the index column must be sorted in ascending order within each group.

Changed in version 0.20.14: The by parameter was renamed group_by.

Parameters:

index_column

Column used to group based on the time window. Often of type Date/Datetime. This column must be sorted in ascending order (or, if group_by is specified, then it must be sorted in ascending order within each group).

In case of a dynamic group by on indices, dtype needs to be one of {Int32, Int64}. Note that Int32 gets temporarily cast to Int64, so if performance matters use an Int64 column.

every

interval of the window

period

length of the window, if None it will equal ‘every’

offset

offset of the window, does not take effect if start_by is ‘datapoint’. Defaults to zero.

include_boundaries

Add the lower and upper bound of the window to the “_lower_boundary” and “_upper_boundary” columns. This will impact performance because it’s harder to parallelize

closed{‘left’, ‘right’, ‘both’, ‘none’}

Define which sides of the temporal interval are closed (inclusive).

label{‘left’, ‘right’, ‘datapoint’}

Define which label to use for the window:

‘left’: lower boundary of the window
‘right’: upper boundary of the window
‘datapoint’: the first value of the index column in the given window. If you don’t need the label to be at one of the boundaries, choose this option for maximum performance

group_by

Also group by this column/these columns

start_by{‘window’, ‘datapoint’, ‘monday’, ‘tuesday’, ‘wednesday’, ‘thursday’, ‘friday’, ‘saturday’, ‘sunday’}

The strategy to determine the start of the first window by.

‘window’: Start by taking the earliest timestamp, truncating it with every, and then adding offset. Note that weekly windows start on Monday.
‘datapoint’: Start from the first encountered data point.
a day of the week (only takes effect if every contains 'w'):
- ‘monday’: Start the window on the Monday before the first data point.
- ‘tuesday’: Start the window on the Tuesday before the first data point.
- …
- ‘sunday’: Start the window on the Sunday before the first data point.
The resulting window is then shifted back until the earliest datapoint is in or in front of it.

Returns:

LazyGroupBy: Object you can call .agg on to aggregate by groups, the result of which will be sorted by index_column (but note that if group_by columns are passed, it will only be sorted within each group).

See also

rolling

Notes

If you’re coming from pandas, then
```
# polars
df.group_by_dynamic("ts", every="1d").agg(pl.col("value").sum())
```
is equivalent to
```
# pandas
df.set_index("ts").resample("D")["value"].sum().reset_index()
```
though note that, unlike pandas, polars doesn’t add extra rows for empty windows. If you need index_column to be evenly spaced, then please combine with DataFrame.upsample().
The every, period and offset arguments are created with the following string language:
- 1ns (1 nanosecond)
- 1us (1 microsecond)
- 1ms (1 millisecond)
- 1s (1 second)
- 1m (1 minute)
- 1h (1 hour)
- 1d (1 calendar day)
- 1w (1 calendar week)
- 1mo (1 calendar month)
- 1q (1 calendar quarter)
- 1y (1 calendar year)
- 1i (1 index count)
Or combine them (except in every): “3d12h4m25s” # 3 days, 12 hours, 4 minutes, and 25 seconds

By “calendar day”, we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for “calendar week”, “calendar month”, “calendar quarter”, and “calendar year”.

In case of a group_by_dynamic on an integer column, the windows are defined by:
- “1i” # length 1
- “10i” # length 10

Examples

>>> from datetime import datetime
>>> lf = pl.LazyFrame(
...     {
...         "time": pl.datetime_range(
...             start=datetime(2021, 12, 16),
...             end=datetime(2021, 12, 16, 3),
...             interval="30m",
...             eager=True,
...         ),
...         "n": range(7),
...     }
... )
>>> lf.collect()
shape: (7, 2)
┌─────────────────────┬─────┐
│ time                ┆ n   │
│ ---                 ┆ --- │
│ datetime[μs]        ┆ i64 │
╞═════════════════════╪═════╡
│ 2021-12-16 00:00:00 ┆ 0   │
│ 2021-12-16 00:30:00 ┆ 1   │
│ 2021-12-16 01:00:00 ┆ 2   │
│ 2021-12-16 01:30:00 ┆ 3   │
│ 2021-12-16 02:00:00 ┆ 4   │
│ 2021-12-16 02:30:00 ┆ 5   │
│ 2021-12-16 03:00:00 ┆ 6   │
└─────────────────────┴─────┘

Group by windows of 1 hour.

>>> lf.group_by_dynamic("time", every="1h", closed="right").agg(
...     pl.col("n")
... ).collect()
shape: (4, 2)
┌─────────────────────┬───────────┐
│ time                ┆ n         │
│ ---                 ┆ ---       │
│ datetime[μs]        ┆ list[i64] │
╞═════════════════════╪═══════════╡
│ 2021-12-15 23:00:00 ┆ [0]       │
│ 2021-12-16 00:00:00 ┆ [1, 2]    │
│ 2021-12-16 01:00:00 ┆ [3, 4]    │
│ 2021-12-16 02:00:00 ┆ [5, 6]    │
└─────────────────────┴───────────┘

The window boundaries can also be added to the aggregation result

>>> lf.group_by_dynamic(
...     "time", every="1h", include_boundaries=True, closed="right"
... ).agg(pl.col("n").mean()).collect()
shape: (4, 4)
┌─────────────────────┬─────────────────────┬─────────────────────┬─────┐
│ _lower_boundary     ┆ _upper_boundary     ┆ time                ┆ n   │
│ ---                 ┆ ---                 ┆ ---                 ┆ --- │
│ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        ┆ f64 │
╞═════════════════════╪═════════════════════╪═════════════════════╪═════╡
│ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-15 23:00:00 ┆ 0.0 │
│ 2021-12-16 00:00:00 ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 00:00:00 ┆ 1.5 │
│ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 3.5 │
│ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 5.5 │
└─────────────────────┴─────────────────────┴─────────────────────┴─────┘

When closed=”left”, the window excludes the right end of interval: [lower_bound, upper_bound)

>>> lf.group_by_dynamic("time", every="1h", closed="left").agg(
...     pl.col("n")
... ).collect()
shape: (4, 2)
┌─────────────────────┬───────────┐
│ time                ┆ n         │
│ ---                 ┆ ---       │
│ datetime[μs]        ┆ list[i64] │
╞═════════════════════╪═══════════╡
│ 2021-12-16 00:00:00 ┆ [0, 1]    │
│ 2021-12-16 01:00:00 ┆ [2, 3]    │
│ 2021-12-16 02:00:00 ┆ [4, 5]    │
│ 2021-12-16 03:00:00 ┆ [6]       │
└─────────────────────┴───────────┘

When closed=”both” the time values at the window boundaries belong to 2 groups.

>>> lf.group_by_dynamic("time", every="1h", closed="both").agg(
...     pl.col("n")
... ).collect()
shape: (4, 2)
┌─────────────────────┬───────────┐
│ time                ┆ n         │
│ ---                 ┆ ---       │
│ datetime[μs]        ┆ list[i64] │
╞═════════════════════╪═══════════╡
│ 2021-12-16 00:00:00 ┆ [0, 1, 2] │
│ 2021-12-16 01:00:00 ┆ [2, 3, 4] │
│ 2021-12-16 02:00:00 ┆ [4, 5, 6] │
│ 2021-12-16 03:00:00 ┆ [6]       │
└─────────────────────┴───────────┘

Dynamic group bys can also be combined with grouping on normal keys

>>> lf = lf.with_columns(groups=pl.Series(["a", "a", "a", "b", "b", "a", "a"]))
>>> lf.collect()
shape: (7, 3)
┌─────────────────────┬─────┬────────┐
│ time                ┆ n   ┆ groups │
│ ---                 ┆ --- ┆ ---    │
│ datetime[μs]        ┆ i64 ┆ str    │
╞═════════════════════╪═════╪════════╡
│ 2021-12-16 00:00:00 ┆ 0   ┆ a      │
│ 2021-12-16 00:30:00 ┆ 1   ┆ a      │
│ 2021-12-16 01:00:00 ┆ 2   ┆ a      │
│ 2021-12-16 01:30:00 ┆ 3   ┆ b      │
│ 2021-12-16 02:00:00 ┆ 4   ┆ b      │
│ 2021-12-16 02:30:00 ┆ 5   ┆ a      │
│ 2021-12-16 03:00:00 ┆ 6   ┆ a      │
└─────────────────────┴─────┴────────┘
>>> lf.group_by_dynamic(
...     "time",
...     every="1h",
...     closed="both",
...     group_by="groups",
...     include_boundaries=True,
... ).agg(pl.col("n")).collect()
shape: (6, 5)
┌────────┬─────────────────────┬─────────────────────┬─────────────────────┬───────────┐
│ groups ┆ _lower_boundary     ┆ _upper_boundary     ┆ time                ┆ n         │
│ ---    ┆ ---                 ┆ ---                 ┆ ---                 ┆ ---       │
│ str    ┆ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        ┆ list[i64] │
╞════════╪═════════════════════╪═════════════════════╪═════════════════════╪═══════════╡
│ a      ┆ 2021-12-16 00:00:00 ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 00:00:00 ┆ [0, 1, 2] │
│ a      ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ [2]       │
│ a      ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ [5, 6]    │
│ a      ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 04:00:00 ┆ 2021-12-16 03:00:00 ┆ [6]       │
│ b      ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ [3, 4]    │
│ b      ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ [4]       │
└────────┴─────────────────────┴─────────────────────┴─────────────────────┴───────────┘

Dynamic group by on an index column

>>> lf = pl.LazyFrame(
...     {
...         "idx": pl.int_range(0, 6, eager=True),
...         "A": ["A", "A", "B", "B", "B", "C"],
...     }
... )
>>> lf.group_by_dynamic(
...     "idx",
...     every="2i",
...     period="3i",
...     include_boundaries=True,
...     closed="right",
... ).agg(pl.col("A").alias("A_agg_list")).collect()
shape: (4, 4)
┌─────────────────┬─────────────────┬─────┬─────────────────┐
│ _lower_boundary ┆ _upper_boundary ┆ idx ┆ A_agg_list      │
│ ---             ┆ ---             ┆ --- ┆ ---             │
│ i64             ┆ i64             ┆ i64 ┆ list[str]       │
╞═════════════════╪═════════════════╪═════╪═════════════════╡
│ -2              ┆ 1               ┆ -2  ┆ ["A", "A"]      │
│ 0               ┆ 3               ┆ 0   ┆ ["A", "B", "B"] │
│ 2               ┆ 5               ┆ 2   ┆ ["B", "B", "C"] │
│ 4               ┆ 7               ┆ 4   ┆ ["C"]           │
└─────────────────┴─────────────────┴─────┴─────────────────┘

head(n: int = 5) → LazyFrame[source]

Get the first n rows.

Parameters:

n: Number of rows to return.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 2, 3, 4, 5, 6],
...         "b": [7, 8, 9, 10, 11, 12],
...     }
... )
>>> lf.head().collect()
shape: (5, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 7   │
│ 2   ┆ 8   │
│ 3   ┆ 9   │
│ 4   ┆ 10  │
│ 5   ┆ 11  │
└─────┴─────┘
>>> lf.head(2).collect()
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 7   │
│ 2   ┆ 8   │
└─────┴─────┘

inspect(fmt: str = '{}') → LazyFrame[source]

Inspect a node in the computation graph.

Print the value that this node in the computation graph evaluates to and pass on the value.

Examples

>>> lf = pl.LazyFrame({"foo": [1, 1, -2, 3]})
>>> (
...     lf.with_columns(pl.col("foo").cum_sum().alias("bar"))
...     .inspect()  # print the node before the filter
...     .filter(pl.col("bar") == pl.col("foo"))
... )
<LazyFrame at ...>

interpolate() → LazyFrame[source]

Interpolate intermediate values. The interpolation method is linear.

Nulls at the beginning and end of the series remain null.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "foo": [1, None, 9, 10],
...         "bar": [6, 7, 9, None],
...         "baz": [1, None, None, 9],
...     }
... )
>>> lf.interpolate().collect()
shape: (4, 3)
┌──────┬──────┬──────────┐
│ foo  ┆ bar  ┆ baz      │
│ ---  ┆ ---  ┆ ---      │
│ f64  ┆ f64  ┆ f64      │
╞══════╪══════╪══════════╡
│ 1.0  ┆ 6.0  ┆ 1.0      │
│ 5.0  ┆ 7.0  ┆ 3.666667 │
│ 9.0  ┆ 9.0  ┆ 6.333333 │
│ 10.0 ┆ null ┆ 9.0      │
└──────┴──────┴──────────┘

Add a join operation to the Logical Plan.

Changed in version 1.24: The join_nulls parameter was renamed nulls_equal.

Parameters:

other

Lazy DataFrame to join with.

on

Name(s) of the join columns in both DataFrames. If set, left_on and right_on should be None. This should not be specified if how='cross'.

how{‘inner’,’left’, ‘right’, ‘full’, ‘semi’, ‘anti’, ‘cross’}

Join strategy.

inner	(Default) Returns rows that have matching values in both tables.
left	Returns all rows from the left table, and the matched rows from the right table.
right	Returns all rows from the right table, and the matched rows from the left table.
full	Returns all rows from both tables, joining matching rows and filling non-matches with null values.
cross	Returns the Cartesian product of rows from both tables
semi	Returns rows from the left table that have a match in the right table. Does not return columns from the right table.
anti	Returns rows from the left table that have no match in the right table. Does not return columns from the right table.

left_on

Join column of the left DataFrame.

right_on

Join column of the right DataFrame.

suffix

Suffix to append to columns with a duplicate name.

validate: {‘m:m’, ‘m:1’, ‘1:m’, ‘1:1’}

Checks if join is of specified type.

m:m	(Default) Many-to-many. Does not result in checks.
1:1	One-to-one. Checks if join keys are unique in both left and right datasets.
1:m	One-to-many. Checks if join keys are unique in left dataset.
m:1	Many-to-one. Check if join keys are unique in right dataset.

nulls_equal

Join on null values. By default null values will never produce matches.

coalesce

Coalescing behavior (merging of join columns).

None	(Default) Coalesce unless `how='full'` is specified.
True	Always coalesce join columns.
False	Never coalesce join columns.

Note

Joining on any other expressions than col will turn off coalescing.

maintain_order{‘none’, ‘left’, ‘right’, ‘left_right’, ‘right_left’}

Which DataFrame row order to preserve, if any. Do not rely on any observed ordering without explicitly setting this parameter, as your code may break in a future release. Not specifying any ordering can improve performance.

none	(Default) No specific ordering is desired. The ordering might differ across Polars versions or even between different runs.
left	Preserves the order of the left DataFrame.
right	Preserves the order of the right DataFrame.
left_right	First preserves the order of the left DataFrame, then the right.
right_left	First preserves the order of the right DataFrame, then the left.

build_side: {‘auto’, ‘prefer_left’, ‘prefer_right’, ‘force_left’, ‘force_right’}

Which side of the join will be used as the build side. This side will be likely be held in memory as a hash table. Note that unless a force_ variant is chosen, the chosen side might differ across Polars versions or even between different runs.

auto	(Default) Let Polars figure out the build side.
prefer_left	Unless there’s a very good reason to believe that the right side is smaller, use the left side.
prefer_right	Unless there’s a very good reason to believe that the left side is smaller, use the right side.
force_left	Always use the left side.
force_right	Always use the right side.

Warning

This functionality is considered experimental. It may be removed or changed at any point without it being considered a breaking change.

allow_parallel

Allow the physical plan to optionally evaluate the computation of both DataFrames up to the join in parallel.

force_parallel

Force the physical plan to evaluate the computation of both DataFrames up to the join in parallel.

See also

join_asof

Examples

>>> lf = pl.LazyFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6.0, 7.0, 8.0],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> other_lf = pl.LazyFrame(
...     {
...         "apple": ["x", "y", "z"],
...         "ham": ["a", "b", "d"],
...     }
... )
>>> lf.join(other_lf, on="ham").collect()
shape: (2, 4)
┌─────┬─────┬─────┬───────┐
│ foo ┆ bar ┆ ham ┆ apple │
│ --- ┆ --- ┆ --- ┆ ---   │
│ i64 ┆ f64 ┆ str ┆ str   │
╞═════╪═════╪═════╪═══════╡
│ 1   ┆ 6.0 ┆ a   ┆ x     │
│ 2   ┆ 7.0 ┆ b   ┆ y     │
└─────┴─────┴─────┴───────┘
>>> lf.join(other_lf, on="ham", how="full").collect()
shape: (4, 5)
┌──────┬──────┬──────┬───────┬───────────┐
│ foo  ┆ bar  ┆ ham  ┆ apple ┆ ham_right │
│ ---  ┆ ---  ┆ ---  ┆ ---   ┆ ---       │
│ i64  ┆ f64  ┆ str  ┆ str   ┆ str       │
╞══════╪══════╪══════╪═══════╪═══════════╡
│ 1    ┆ 6.0  ┆ a    ┆ x     ┆ a         │
│ 2    ┆ 7.0  ┆ b    ┆ y     ┆ b         │
│ null ┆ null ┆ null ┆ z     ┆ d         │
│ 3    ┆ 8.0  ┆ c    ┆ null  ┆ null      │
└──────┴──────┴──────┴───────┴───────────┘
>>> lf.join(other_lf, on="ham", how="left", coalesce=True).collect()
shape: (3, 4)
┌─────┬─────┬─────┬───────┐
│ foo ┆ bar ┆ ham ┆ apple │
│ --- ┆ --- ┆ --- ┆ ---   │
│ i64 ┆ f64 ┆ str ┆ str   │
╞═════╪═════╪═════╪═══════╡
│ 1   ┆ 6.0 ┆ a   ┆ x     │
│ 2   ┆ 7.0 ┆ b   ┆ y     │
│ 3   ┆ 8.0 ┆ c   ┆ null  │
└─────┴─────┴─────┴───────┘
>>> lf.join(other_lf, on="ham", how="semi").collect()
shape: (2, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ str │
╞═════╪═════╪═════╡
│ 1   ┆ 6.0 ┆ a   │
│ 2   ┆ 7.0 ┆ b   │
└─────┴─────┴─────┘
>>> lf.join(other_lf, on="ham", how="anti").collect()
shape: (1, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ str │
╞═════╪═════╪═════╡
│ 3   ┆ 8.0 ┆ c   │
└─────┴─────┴─────┘

>>> lf.join(other_lf, how="cross").collect()
shape: (9, 5)
┌─────┬─────┬─────┬───────┬───────────┐
│ foo ┆ bar ┆ ham ┆ apple ┆ ham_right │
│ --- ┆ --- ┆ --- ┆ ---   ┆ ---       │
│ i64 ┆ f64 ┆ str ┆ str   ┆ str       │
╞═════╪═════╪═════╪═══════╪═══════════╡
│ 1   ┆ 6.0 ┆ a   ┆ x     ┆ a         │
│ 1   ┆ 6.0 ┆ a   ┆ y     ┆ b         │
│ 1   ┆ 6.0 ┆ a   ┆ z     ┆ d         │
│ 2   ┆ 7.0 ┆ b   ┆ x     ┆ a         │
│ 2   ┆ 7.0 ┆ b   ┆ y     ┆ b         │
│ 2   ┆ 7.0 ┆ b   ┆ z     ┆ d         │
│ 3   ┆ 8.0 ┆ c   ┆ x     ┆ a         │
│ 3   ┆ 8.0 ┆ c   ┆ y     ┆ b         │
│ 3   ┆ 8.0 ┆ c   ┆ z     ┆ d         │
└─────┴─────┴─────┴───────┴───────────┘

Perform an asof join.

This is similar to a left-join except that we match on nearest key rather than equal keys.

Both DataFrames must be sorted by the on key (within each by group, if specified).

For each row in the left DataFrame:

A “backward” search selects the last row in the right DataFrame whose ‘on’ key is less than or equal to the left’s key.

A “forward” search selects the first row in the right DataFrame whose ‘on’ key is greater than or equal to the left’s key.

A “nearest” search selects the last row in the right DataFrame whose value is nearest to the left’s key. String keys are not currently supported for a nearest search.

The default is “backward”.

Parameters:

other

Lazy DataFrame to join with.

left_on

Join column of the left DataFrame.

right_on

Join column of the right DataFrame.

on

Join column of both DataFrames. If set, left_on and right_on should be None.

by_left

Join on these columns before doing asof join.

by_right

Join on these columns before doing asof join.

by

Join on these columns before doing asof join.

strategy{‘backward’, ‘forward’, ‘nearest’}

Join strategy.

suffix

Suffix to append to columns with a duplicate name.

tolerance

Numeric tolerance. By setting this the join will only be done if the near keys are within this distance. If an asof join is done on columns of dtype “Date”, “Datetime”, “Duration” or “Time”, use either a datetime.timedelta object or the following string language:

1ns (1 nanosecond)

1us (1 microsecond)

1ms (1 millisecond)

1s (1 second)

1m (1 minute)

1h (1 hour)

1d (1 calendar day)

1w (1 calendar week)

1mo (1 calendar month)

1q (1 calendar quarter)

1y (1 calendar year)

Or combine them: “3d12h4m25s” # 3 days, 12 hours, 4 minutes, and 25 seconds

By “calendar day”, we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings - in cases of ambiguity, we follow RFC-5545 and preserve the DST fold of the original datetime). Similarly for “calendar week”, “calendar month”, “calendar quarter”, and “calendar year”.

allow_parallel

Allow the physical plan to optionally evaluate the computation of both DataFrames up to the join in parallel.

force_parallel

Force the physical plan to evaluate the computation of both DataFrames up to the join in parallel.

coalesce

Coalescing behavior (merging of on / left_on / right_on columns):

True: -> Always coalesce join columns.
False: -> Never coalesce join columns.

Note that joining on any other expressions than col will turn off coalescing.

allow_exact_matches

Whether exact matches are valid join predicates.

If True, allow matching with the same on value
(i.e. less-than-or-equal-to / greater-than-or-equal-to)
If False, don’t match the same on value
(i.e., strictly less-than / strictly greater-than).

check_sortedness

Check the sortedness of the asof keys. If the keys are not sorted Polars will error. Currently, the in-memory engine cannot check the sortedness if ‘by’ groups are provided. The streaming engine will only check the sortedness of the rows it processes.

Notes

If ‘by’ is set, the implementation will compute the asof join over all of the groups concurrently. This can potentially lead to high memory usage if there are many groups.

This can be mitigated by sorting (via .sort()) both of the input LazyFrames by the ‘by’ keys (or using .set_sorted() if the columns are already sorted) before computing the join operation; and using the streaming engine to collect the results. For example:

>>> # Compute streaming asof join with 'by' groups
>>> result = (
...     left.sort("by", "on").join_asof(  # Sort left manually
...         right.set_sorted("by", "on"),  # Set right as already sorted
...     )
... ).collect(streaming=True)  

Examples

>>> from datetime import date
>>> gdp = pl.LazyFrame(
...     {
...         "date": pl.date_range(
...             date(2016, 1, 1),
...             date(2020, 1, 1),
...             "1y",
...             eager=True,
...         ),
...         "gdp": [4164, 4411, 4566, 4696, 4827],
...     }
... )
>>> gdp.collect()
shape: (5, 2)
┌────────────┬──────┐
│ date       ┆ gdp  │
│ ---        ┆ ---  │
│ date       ┆ i64  │
╞════════════╪══════╡
│ 2016-01-01 ┆ 4164 │
│ 2017-01-01 ┆ 4411 │
│ 2018-01-01 ┆ 4566 │
│ 2019-01-01 ┆ 4696 │
│ 2020-01-01 ┆ 4827 │
└────────────┴──────┘

>>> population = pl.LazyFrame(
...     {
...         "date": [date(2016, 3, 1), date(2018, 8, 1), date(2019, 1, 1)],
...         "population": [82.19, 82.66, 83.12],
...     }
... ).sort("date")
>>> population.collect()
shape: (3, 2)
┌────────────┬────────────┐
│ date       ┆ population │
│ ---        ┆ ---        │
│ date       ┆ f64        │
╞════════════╪════════════╡
│ 2016-03-01 ┆ 82.19      │
│ 2018-08-01 ┆ 82.66      │
│ 2019-01-01 ┆ 83.12      │
└────────────┴────────────┘

Note how the dates don’t quite match. If we join them using join_asof and strategy='backward', then each date from population which doesn’t have an exact match is matched with the closest earlier date from gdp:

>>> population.join_asof(gdp, on="date", strategy="backward").collect()
shape: (3, 3)
┌────────────┬────────────┬──────┐
│ date       ┆ population ┆ gdp  │
│ ---        ┆ ---        ┆ ---  │
│ date       ┆ f64        ┆ i64  │
╞════════════╪════════════╪══════╡
│ 2016-03-01 ┆ 82.19      ┆ 4164 │
│ 2018-08-01 ┆ 82.66      ┆ 4566 │
│ 2019-01-01 ┆ 83.12      ┆ 4696 │
└────────────┴────────────┴──────┘

Note how:

date 2016-03-01 from population is matched with 2016-01-01 from gdp;
date 2018-08-01 from population is matched with 2018-01-01 from gdp.

You can verify this by passing coalesce=False:

>>> population.join_asof(
...     gdp, on="date", strategy="backward", coalesce=False
... ).collect()
shape: (3, 4)
┌────────────┬────────────┬────────────┬──────┐
│ date       ┆ population ┆ date_right ┆ gdp  │
│ ---        ┆ ---        ┆ ---        ┆ ---  │
│ date       ┆ f64        ┆ date       ┆ i64  │
╞════════════╪════════════╪════════════╪══════╡
│ 2016-03-01 ┆ 82.19      ┆ 2016-01-01 ┆ 4164 │
│ 2018-08-01 ┆ 82.66      ┆ 2018-01-01 ┆ 4566 │
│ 2019-01-01 ┆ 83.12      ┆ 2019-01-01 ┆ 4696 │
└────────────┴────────────┴────────────┴──────┘

If we instead use strategy='forward', then each date from population which doesn’t have an exact match is matched with the closest later date from gdp:

>>> population.join_asof(gdp, on="date", strategy="forward").collect()
shape: (3, 3)
┌────────────┬────────────┬──────┐
│ date       ┆ population ┆ gdp  │
│ ---        ┆ ---        ┆ ---  │
│ date       ┆ f64        ┆ i64  │
╞════════════╪════════════╪══════╡
│ 2016-03-01 ┆ 82.19      ┆ 4411 │
│ 2018-08-01 ┆ 82.66      ┆ 4696 │
│ 2019-01-01 ┆ 83.12      ┆ 4696 │
└────────────┴────────────┴──────┘

Note how:

date 2016-03-01 from population is matched with 2017-01-01 from gdp;
date 2018-08-01 from population is matched with 2019-01-01 from gdp.

Finally, strategy='nearest' gives us a mix of the two results above, as each date from population which doesn’t have an exact match is matched with the closest date from gdp, regardless of whether it’s earlier or later:

>>> population.join_asof(gdp, on="date", strategy="nearest").collect()
shape: (3, 3)
┌────────────┬────────────┬──────┐
│ date       ┆ population ┆ gdp  │
│ ---        ┆ ---        ┆ ---  │
│ date       ┆ f64        ┆ i64  │
╞════════════╪════════════╪══════╡
│ 2016-03-01 ┆ 82.19      ┆ 4164 │
│ 2018-08-01 ┆ 82.66      ┆ 4696 │
│ 2019-01-01 ┆ 83.12      ┆ 4696 │
└────────────┴────────────┴──────┘

Note how:

date 2016-03-01 from population is matched with 2016-01-01 from gdp;
date 2018-08-01 from population is matched with 2019-01-01 from gdp.

They by argument allows joining on another column first, before the asof join. In this example we join by country first, then asof join by date, as above.

>>> gdp_dates = pl.date_range(  # fmt: skip
...     date(2016, 1, 1), date(2020, 1, 1), "1y", eager=True
... )
>>> gdp2 = pl.LazyFrame(
...     {
...         "country": ["Germany"] * 5 + ["Netherlands"] * 5,
...         "date": pl.concat([gdp_dates, gdp_dates]),
...         "gdp": [4164, 4411, 4566, 4696, 4827, 784, 833, 914, 910, 909],
...     }
... ).sort("country", "date")
>>>
>>> gdp2.collect()
shape: (10, 3)
┌─────────────┬────────────┬──────┐
│ country     ┆ date       ┆ gdp  │
│ ---         ┆ ---        ┆ ---  │
│ str         ┆ date       ┆ i64  │
╞═════════════╪════════════╪══════╡
│ Germany     ┆ 2016-01-01 ┆ 4164 │
│ Germany     ┆ 2017-01-01 ┆ 4411 │
│ Germany     ┆ 2018-01-01 ┆ 4566 │
│ Germany     ┆ 2019-01-01 ┆ 4696 │
│ Germany     ┆ 2020-01-01 ┆ 4827 │
│ Netherlands ┆ 2016-01-01 ┆ 784  │
│ Netherlands ┆ 2017-01-01 ┆ 833  │
│ Netherlands ┆ 2018-01-01 ┆ 914  │
│ Netherlands ┆ 2019-01-01 ┆ 910  │
│ Netherlands ┆ 2020-01-01 ┆ 909  │
└─────────────┴────────────┴──────┘
>>> pop2 = pl.LazyFrame(
...     {
...         "country": ["Germany"] * 3 + ["Netherlands"] * 3,
...         "date": [
...             date(2016, 3, 1),
...             date(2018, 8, 1),
...             date(2019, 1, 1),
...             date(2016, 3, 1),
...             date(2018, 8, 1),
...             date(2019, 1, 1),
...         ],
...         "population": [82.19, 82.66, 83.12, 17.11, 17.32, 17.40],
...     }
... ).sort("country", "date")
>>>
>>> pop2.collect()
shape: (6, 3)
┌─────────────┬────────────┬────────────┐
│ country     ┆ date       ┆ population │
│ ---         ┆ ---        ┆ ---        │
│ str         ┆ date       ┆ f64        │
╞═════════════╪════════════╪════════════╡
│ Germany     ┆ 2016-03-01 ┆ 82.19      │
│ Germany     ┆ 2018-08-01 ┆ 82.66      │
│ Germany     ┆ 2019-01-01 ┆ 83.12      │
│ Netherlands ┆ 2016-03-01 ┆ 17.11      │
│ Netherlands ┆ 2018-08-01 ┆ 17.32      │
│ Netherlands ┆ 2019-01-01 ┆ 17.4       │
└─────────────┴────────────┴────────────┘
>>> pop2.join_asof(gdp2, by="country", on="date", strategy="nearest").collect()
shape: (6, 4)
┌─────────────┬────────────┬────────────┬──────┐
│ country     ┆ date       ┆ population ┆ gdp  │
│ ---         ┆ ---        ┆ ---        ┆ ---  │
│ str         ┆ date       ┆ f64        ┆ i64  │
╞═════════════╪════════════╪════════════╪══════╡
│ Germany     ┆ 2016-03-01 ┆ 82.19      ┆ 4164 │
│ Germany     ┆ 2018-08-01 ┆ 82.66      ┆ 4696 │
│ Germany     ┆ 2019-01-01 ┆ 83.12      ┆ 4696 │
│ Netherlands ┆ 2016-03-01 ┆ 17.11      ┆ 784  │
│ Netherlands ┆ 2018-08-01 ┆ 17.32      ┆ 910  │
│ Netherlands ┆ 2019-01-01 ┆ 17.4       ┆ 910  │
└─────────────┴────────────┴────────────┴──────┘

join_where( other: LazyFrame, *predicates: Expr | Iterable[Expr], suffix: str = '_right', ) → LazyFrame[source]

Perform a join based on one or multiple (in)equality predicates.

This performs an inner join, so only rows where all predicates are true are included in the result, and a row from either DataFrame may be included multiple times in the result.

Note

The row order of the input DataFrames is not preserved.

Warning

This functionality is experimental. It may be changed at any point without it being considered a breaking change.

Parameters:

other: DataFrame to join with.
*predicates: (In)Equality condition to join the two tables on. When a column name occurs in both tables, the proper suffix must be applied in the predicate.
suffix: Suffix to append to columns with a duplicate name.

Examples

Join two lazyframes together based on two predicates which get AND-ed together.

>>> east = pl.LazyFrame(
...     {
...         "id": [100, 101, 102],
...         "dur": [120, 140, 160],
...         "rev": [12, 14, 16],
...         "cores": [2, 8, 4],
...     }
... )
>>> west = pl.LazyFrame(
...     {
...         "t_id": [404, 498, 676, 742],
...         "time": [90, 130, 150, 170],
...         "cost": [9, 13, 15, 16],
...         "cores": [4, 2, 1, 4],
...     }
... )
>>> east.join_where(
...     west,
...     pl.col("dur") < pl.col("time"),
...     pl.col("rev") < pl.col("cost"),
... ).collect()
shape: (5, 8)
┌─────┬─────┬─────┬───────┬──────┬──────┬──────┬─────────────┐
│ id  ┆ dur ┆ rev ┆ cores ┆ t_id ┆ time ┆ cost ┆ cores_right │
│ --- ┆ --- ┆ --- ┆ ---   ┆ ---  ┆ ---  ┆ ---  ┆ ---         │
│ i64 ┆ i64 ┆ i64 ┆ i64   ┆ i64  ┆ i64  ┆ i64  ┆ i64         │
╞═════╪═════╪═════╪═══════╪══════╪══════╪══════╪═════════════╡
│ 100 ┆ 120 ┆ 12  ┆ 2     ┆ 498  ┆ 130  ┆ 13   ┆ 2           │
│ 100 ┆ 120 ┆ 12  ┆ 2     ┆ 676  ┆ 150  ┆ 15   ┆ 1           │
│ 100 ┆ 120 ┆ 12  ┆ 2     ┆ 742  ┆ 170  ┆ 16   ┆ 4           │
│ 101 ┆ 140 ┆ 14  ┆ 8     ┆ 676  ┆ 150  ┆ 15   ┆ 1           │
│ 101 ┆ 140 ┆ 14  ┆ 8     ┆ 742  ┆ 170  ┆ 16   ┆ 4           │
└─────┴─────┴─────┴───────┴──────┴──────┴──────┴─────────────┘

To OR them together, use a single expression and the | operator.

>>> east.join_where(
...     west,
...     (pl.col("dur") < pl.col("time")) | (pl.col("rev") < pl.col("cost")),
... ).collect()
shape: (6, 8)
┌─────┬─────┬─────┬───────┬──────┬──────┬──────┬─────────────┐
│ id  ┆ dur ┆ rev ┆ cores ┆ t_id ┆ time ┆ cost ┆ cores_right │
│ --- ┆ --- ┆ --- ┆ ---   ┆ ---  ┆ ---  ┆ ---  ┆ ---         │
│ i64 ┆ i64 ┆ i64 ┆ i64   ┆ i64  ┆ i64  ┆ i64  ┆ i64         │
╞═════╪═════╪═════╪═══════╪══════╪══════╪══════╪═════════════╡
│ 100 ┆ 120 ┆ 12  ┆ 2     ┆ 498  ┆ 130  ┆ 13   ┆ 2           │
│ 100 ┆ 120 ┆ 12  ┆ 2     ┆ 676  ┆ 150  ┆ 15   ┆ 1           │
│ 100 ┆ 120 ┆ 12  ┆ 2     ┆ 742  ┆ 170  ┆ 16   ┆ 4           │
│ 101 ┆ 140 ┆ 14  ┆ 8     ┆ 676  ┆ 150  ┆ 15   ┆ 1           │
│ 101 ┆ 140 ┆ 14  ┆ 8     ┆ 742  ┆ 170  ┆ 16   ┆ 4           │
│ 102 ┆ 160 ┆ 16  ┆ 4     ┆ 742  ┆ 170  ┆ 16   ┆ 4           │
└─────┴─────┴─────┴───────┴──────┴──────┴──────┴─────────────┘

last() → LazyFrame[source]

Get the last row of the DataFrame.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 5, 3],
...         "b": [2, 4, 6],
...     }
... )
>>> lf.last().collect()
shape: (1, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 3   ┆ 6   │
└─────┴─────┘

lazy() → LazyFrame[source]

Return lazy representation, i.e. itself.

Useful for writing code that expects either a DataFrame or LazyFrame. On LazyFrame this is a no-op, and returns the same object.

Returns:

LazyFrame

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [None, 2, 3, 4],
...         "b": [0.5, None, 2.5, 13],
...         "c": [True, True, False, None],
...     }
... )
>>> lf.lazy()
<LazyFrame at ...>

limit(n: int = 5) → LazyFrame[source]

Get the first n rows.

Alias for LazyFrame.head().

Parameters:

n: Number of rows to return.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 2, 3, 4, 5, 6],
...         "b": [7, 8, 9, 10, 11, 12],
...     }
... )
>>> lf.limit().collect()
shape: (5, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 7   │
│ 2   ┆ 8   │
│ 3   ┆ 9   │
│ 4   ┆ 10  │
│ 5   ┆ 11  │
└─────┴─────┘
>>> lf.limit(2).collect()
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 7   │
│ 2   ┆ 8   │
└─────┴─────┘

map_batches( function: Callable[[DataFrame], DataFrame], *, predicate_pushdown: bool = False, projection_pushdown: bool = False, slice_pushdown: bool = False, no_optimizations: bool | None = None, schema: None | SchemaDict = None, validate_output_schema: bool = True, streamable: bool = False, ) → LazyFrame[source]

Apply a custom function.

It is important that the function returns a Polars DataFrame.

Parameters:

function: Lambda/ function to apply.
predicate_pushdown: Allow predicate pushdown optimization to pass this node.
projection_pushdown: Allow projection pushdown optimization to pass this node.
slice_pushdown: Allow slice pushdown optimization to pass this node.
no_optimizations: Deprecated since version 1.30.0: This parameter is deprecated and will be removed in a future version. The _pushdown parameters now default to False, so this parameter is no longer needed.
schema: Output schema of the function, if set to None we assume that the schema will remain unchanged by the applied function.
validate_output_schema: It is paramount that polars’ schema is correct. This flag will ensure that the output schema of this function will be checked with the expected schema. Setting this to False will not do this check, but may lead to hard to debug bugs.
streamable: Whether the function that is given is eligible to be running with the streaming engine. That means that the function must produce the same result when it is executed in batches or when it is be executed on the full dataset.

Warning

The schema of a LazyFrame must always be correct. It is up to the caller of this function to ensure that this invariant is upheld.

It is important that the optimization flags are correct. If the custom function for instance does an aggregation of a column, predicate_pushdown should not be allowed, as this prunes rows and will influence your aggregation results.

Notes

A UDF passed to map_batches must be pure, meaning that it cannot modify or depend on state other than its arguments.

Examples

>>> lf = (  
...     pl.LazyFrame(
...         {
...             "a": pl.int_range(-100_000, 0, eager=True),
...             "b": pl.int_range(0, 100_000, eager=True),
...         }
...     )
...     .map_batches(lambda x: 2 * x, streamable=True)
...     .collect(engine="streaming")
... )
shape: (100_000, 2)
┌─────────┬────────┐
│ a       ┆ b      │
│ ---     ┆ ---    │
│ i64     ┆ i64    │
╞═════════╪════════╡
│ -200000 ┆ 0      │
│ -199998 ┆ 2      │
│ -199996 ┆ 4      │
│ -199994 ┆ 6      │
│ …       ┆ …      │
│ -8      ┆ 199992 │
│ -6      ┆ 199994 │
│ -4      ┆ 199996 │
│ -2      ┆ 199998 │
└─────────┴────────┘

match_to_schema( schema: SchemaDict | Schema, *, missing_columns: Literal['insert', 'raise'] | Mapping[str, Literal['insert', 'raise'] | Expr] | Expr = 'raise', missing_struct_fields: Literal['insert', 'raise'] | Mapping[str, Literal['insert', 'raise']] = 'raise', extra_columns: Literal['ignore', 'raise'] = 'raise', extra_struct_fields: Literal['ignore', 'raise'] | Mapping[str, Literal['ignore', 'raise']] = 'raise', integer_cast: Literal['upcast', 'forbid'] | Mapping[str, Literal['upcast', 'forbid']] = 'forbid', float_cast: Literal['upcast', 'forbid'] | Mapping[str, Literal['upcast', 'forbid']] = 'forbid', ) → LazyFrame[source]

Match or evolve the schema of a LazyFrame into a specific schema.

By default, match_to_schema returns an error if the input schema does not exactly match the target schema. It also allows columns to be freely reordered, with additional coercion rules available through optional parameters.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

Parameters:

schema

Target schema to match or evolve to.

missing_columns

Raise of insert missing columns from the input with respect to the schema.

This can also be an expression per column with what to insert if it is missing.

missing_struct_fields

Raise of insert missing struct fields from the input with respect to the schema.

extra_columns

Raise of ignore extra columns from the input with respect to the schema.

extra_struct_fields

Raise of ignore extra struct fields from the input with respect to the schema.

integer_cast

Forbid of upcast for integer columns from the input to the respective column in schema.

float_cast

Forbid of upcast for float columns from the input to the respective column in schema.

Examples

Ensuring the schema matches

>>> lf = pl.LazyFrame({"a": [1, 2, 3], "b": ["A", "B", "C"]})
>>> lf.match_to_schema({"a": pl.Int64, "b": pl.String}).collect()
shape: (3, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 1   ┆ A   │
│ 2   ┆ B   │
│ 3   ┆ C   │
└─────┴─────┘
>>> (lf.match_to_schema({"a": pl.Int64}).collect())  
polars.exceptions.SchemaError: extra columns in `match_to_schema`: "b"

Adding missing columns

>>> (
...     pl.LazyFrame({"a": [1, 2, 3]})
...     .match_to_schema(
...         {"a": pl.Int64, "b": pl.String},
...         missing_columns="insert",
...     )
...     .collect()
... )
shape: (3, 2)
┌─────┬──────┐
│ a   ┆ b    │
│ --- ┆ ---  │
│ i64 ┆ str  │
╞═════╪══════╡
│ 1   ┆ null │
│ 2   ┆ null │
│ 3   ┆ null │
└─────┴──────┘
>>> (
...     pl.LazyFrame({"a": [1, 2, 3]})
...     .match_to_schema(
...         {"a": pl.Int64, "b": pl.String},
...         missing_columns={"b": pl.col.a.cast(pl.String)},
...     )
...     .collect()
... )
shape: (3, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 1   ┆ 1   │
│ 2   ┆ 2   │
│ 3   ┆ 3   │
└─────┴─────┘

Removing extra columns

>>> (
...     pl.LazyFrame({"a": [1, 2, 3], "b": ["A", "B", "C"]})
...     .match_to_schema(
...         {"a": pl.Int64},
...         extra_columns="ignore",
...     )
...     .collect()
... )
shape: (3, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 1   │
│ 2   │
│ 3   │
└─────┘

Upcasting integers and floats

>>> (
...     pl.LazyFrame(
...         {"a": [1, 2, 3], "b": [1.0, 2.0, 3.0]},
...         schema={"a": pl.Int32, "b": pl.Float32},
...     )
...     .match_to_schema(
...         {"a": pl.Int64, "b": pl.Float64},
...         integer_cast="upcast",
...         float_cast="upcast",
...     )
...     .collect()
... )
shape: (3, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════╪═════╡
│ 1   ┆ 1.0 │
│ 2   ┆ 2.0 │
│ 3   ┆ 3.0 │
└─────┴─────┘

max() → LazyFrame[source]

Aggregate the columns in the LazyFrame to their maximum value.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 2, 3, 4],
...         "b": [1, 2, 1, 1],
...     }
... )
>>> lf.max().collect()
shape: (1, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 4   ┆ 2   │
└─────┴─────┘

mean() → LazyFrame[source]

Aggregate the columns in the LazyFrame to their mean value.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 2, 3, 4],
...         "b": [1, 2, 1, 1],
...     }
... )
>>> lf.mean().collect()
shape: (1, 2)
┌─────┬──────┐
│ a   ┆ b    │
│ --- ┆ ---  │
│ f64 ┆ f64  │
╞═════╪══════╡
│ 2.5 ┆ 1.25 │
└─────┴──────┘

median() → LazyFrame[source]

Aggregate the columns in the LazyFrame to their median value.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 2, 3, 4],
...         "b": [1, 2, 1, 1],
...     }
... )
>>> lf.median().collect()
shape: (1, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞═════╪═════╡
│ 2.5 ┆ 1.0 │
└─────┴─────┘

Unpivot a DataFrame from wide to long format.

Optionally leaves identifiers set.

This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars) while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis leaving just two non-identifier columns, ‘variable’ and ‘value’.

Deprecated since version 1.0.0: Use the unpivot() method instead.

Parameters:

id_vars: Column(s) or selector(s) to use as identifier variables.
value_vars: Column(s) or selector(s) to use as values variables; if value_vars is empty all columns that are not in id_vars will be used.
variable_name: Name to give to the variable column. Defaults to “variable”
value_name: Name to give to the value column. Defaults to “value”
streamable: Allow this node to run in the streaming engine. If this runs in streaming, the output of the unpivot operation will not have a stable ordering.

merge_sorted( other: LazyFrame, key: str | Sequence[str], *, maintain_order: bool = False, ) → LazyFrame[source]

Take two sorted DataFrames and merge them by the sorted key.

The output of this operation will also be sorted. It is the callers responsibility that the frames are sorted in ascending order by the key(s), with null keys at the end, otherwise the order of the output will not make sense.

The schemas of both LazyFrames must be equal.

Parameters:

other: Other DataFrame that must be merged
key: Key column(s) that the frames are sorted by. A single column name or a sequence of column names can be passed. When multiple keys are given the frames are merged as if sorted by those keys in order.
maintain_order: If True, the output is guaranteed to have left-biased ordering for equal keys: rows from the left frame appear before rows from the right frame when their keys are equal.

Notes

Unless maintain_order=True, no guarantee is given over the output row order when the key is equal between the both dataframes.

The key(s) must be sorted in ascending order.

Examples

>>> df0 = pl.LazyFrame(
...     {"name": ["steve", "elise", "bob"], "age": [42, 44, 18]}
... ).sort("age")
>>> df0.collect()
shape: (3, 2)
┌───────┬─────┐
│ name  ┆ age │
│ ---   ┆ --- │
│ str   ┆ i64 │
╞═══════╪═════╡
│ bob   ┆ 18  │
│ steve ┆ 42  │
│ elise ┆ 44  │
└───────┴─────┘
>>> df1 = pl.LazyFrame(
...     {"name": ["anna", "megan", "steve", "thomas"], "age": [21, 33, 42, 20]}
... ).sort("age")
>>> df1.collect()
shape: (4, 2)
┌────────┬─────┐
│ name   ┆ age │
│ ---    ┆ --- │
│ str    ┆ i64 │
╞════════╪═════╡
│ thomas ┆ 20  │
│ anna   ┆ 21  │
│ megan  ┆ 33  │
│ steve  ┆ 42  │
└────────┴─────┘
>>> df0.merge_sorted(df1, key="age").collect()
shape: (7, 2)
┌────────┬─────┐
│ name   ┆ age │
│ ---    ┆ --- │
│ str    ┆ i64 │
╞════════╪═════╡
│ bob    ┆ 18  │
│ thomas ┆ 20  │
│ anna   ┆ 21  │
│ megan  ┆ 33  │
│ steve  ┆ 42  │
│ steve  ┆ 42  │
│ elise  ┆ 44  │
└────────┴─────┘

Multiple keys can be passed to merge frames sorted by a composite key. The frames are merged as if sorted by key_1, then key_2.

>>> df0 = pl.LazyFrame({"key_1": [1, 1, 3], "key_2": [1, 4, 2]})
>>> df1 = pl.LazyFrame({"key_1": [1, 2, 3], "key_2": [2, 1, 1]})
>>> df0.merge_sorted(df1, key=["key_1", "key_2"]).collect()
shape: (6, 2)
┌───────┬───────┐
│ key_1 ┆ key_2 │
│ ---   ┆ ---   │
│ i64   ┆ i64   │
╞═══════╪═══════╡
│ 1     ┆ 1     │
│ 1     ┆ 2     │
│ 1     ┆ 4     │
│ 2     ┆ 1     │
│ 3     ┆ 1     │
│ 3     ┆ 2     │
└───────┴───────┘

min() → LazyFrame[source]

Aggregate the columns in the LazyFrame to their minimum value.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 2, 3, 4],
...         "b": [1, 2, 1, 1],
...     }
... )
>>> lf.min().collect()
shape: (1, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 1   │
└─────┴─────┘

null_count() → LazyFrame[source]

Aggregate the columns in the LazyFrame as the sum of their null value count.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "foo": [1, None, 3],
...         "bar": [6, 7, None],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> lf.null_count().collect()
shape: (1, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ u32 ┆ u32 ┆ u32 │
╞═════╪═════╪═════╡
│ 1   ┆ 1   ┆ 0   │
└─────┴─────┴─────┘

pipe(

function: Callable[Concatenate[LazyFrame, P], T],

*args: P.args,

**kwargs: P.kwargs,

) → T[source]

Offers a structured way to apply a sequence of user-defined functions (UDFs).

Parameters:

function: Callable; will receive the frame as the first parameter, followed by any given args/kwargs.
*args: Arguments to pass to the UDF.
**kwargs: Keyword arguments to pass to the UDF.

See also

pipe_with_schema

Examples

>>> def cast_str_to_int(lf: pl.LazyFrame, col_name: str) -> pl.LazyFrame:
...     return lf.with_columns(pl.col(col_name).cast(pl.Int64))
>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 2, 3, 4],
...         "b": ["10", "20", "30", "40"],
...     }
... )
>>> lf.pipe(cast_str_to_int, col_name="b").collect()
shape: (4, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 10  │
│ 2   ┆ 20  │
│ 3   ┆ 30  │
│ 4   ┆ 40  │
└─────┴─────┘

>>> lf = pl.LazyFrame(
...     {
...         "b": [1, 2],
...         "a": [3, 4],
...     }
... )
>>> lf.collect()
shape: (2, 2)
┌─────┬─────┐
│ b   ┆ a   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 3   │
│ 2   ┆ 4   │
└─────┴─────┘
>>> lf.pipe(lambda lf: lf.select(sorted(lf.collect_schema()))).collect()
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 3   ┆ 1   │
│ 4   ┆ 2   │
└─────┴─────┘

pipe_with_schema( function: Callable[[LazyFrame, Schema], LazyFrame], ) → LazyFrame[source]

Allows to alter the lazy frame during the plan stage with the resolved schema.

In contrast to pipe, this method does not execute function immediately but only during the plan stage. This allows using the resolved schema of the input to dynamically alter the lazy frame. This also means that any exceptions raised by function will only be emitted during the plan stage.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

Parameters:

function: Callable; will receive the frame as the first parameter and the resolved schema as the second parameter.

See also

pipe

Examples

>>> def cast_to_float_if_necessary(
...     lf: pl.LazyFrame, schema: pl.Schema
... ) -> pl.LazyFrame:
...     required_casts = [
...         pl.col(name).cast(pl.Float64)
...         for name, dtype in schema.items()
...         if not dtype.is_float()
...     ]
...     return lf.with_columns(required_casts)
>>> lf = pl.LazyFrame(
...     {"a": [1.0, 2.0], "b": ["1.0", "2.5"], "c": [2.0, 3.0]},
...     schema={"a": pl.Float64, "b": pl.String, "c": pl.Float32},
... )
>>> lf.pipe_with_schema(cast_to_float_if_necessary).collect()
shape: (2, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f32 │
╞═════╪═════╪═════╡
│ 1.0 ┆ 1.0 ┆ 2.0 │
│ 2.0 ┆ 2.5 ┆ 3.0 │
└─────┴─────┴─────┘

pivot( on: ColumnNameOrSelector | Sequence[ColumnNameOrSelector], on_columns: Sequence[Any] | Series | DataFrame, *, index: ColumnNameOrSelector | Sequence[ColumnNameOrSelector] | None = None, values: ColumnNameOrSelector | Sequence[ColumnNameOrSelector] | None = None, aggregate_function: PivotAgg | Expr | None = None, maintain_order: bool = False, separator: str = '_', column_naming: Literal['auto', 'combine'] = 'auto', ) → LazyFrame[source]

Create a spreadsheet-style pivot table as a DataFrame.

Parameters:

on

The column(s) whose values will be used as the new columns of the output DataFrame.

on_columns

What value combinations will be considered for the output table.

index

The column(s) that remain from the input to the output. The output DataFrame will have one row for each unique combination of the index’s values. If None, all remaining columns not specified on on and values will be used. At least one of index and values must be specified.

values

The existing column(s) of values which will be moved under the new columns from index. If an aggregation is specified, these are the values on which the aggregation will be computed. If None, all remaining columns not specified on on and index will be used. At least one of index and values must be specified.

aggregate_function

Choose from:

None: no aggregation takes place, will raise error if multiple values are in group.
A predefined aggregate function string, one of {‘min’, ‘max’, ‘first’, ‘last’, ‘sum’, ‘mean’, ‘median’, ‘len’, ‘item’}
An expression to do the aggregation. The expression can only access data from the respective ‘values’ columns as generated by pivot, through pl.element().

maintain_order

Ensure the values of index are sorted by discovery order.

separator

Used as separator/delimiter in generated column names in case of multiple values columns.

column_naming{‘auto’, ‘combine’}

How resulting column names will be constructed.

‘auto’: The default; combine with separator if there are multiple
values columns, otherwise just use the on_columns names.
‘combine’: Always combine the values columns’ names with
the on_columns names.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

Returns:

DataFrame

Notes

In some other frameworks, you might know this operation as pivot_wider.

Examples

You can use pivot to reshape a dataframe from “long” to “wide” format.

For example, suppose we have a dataframe of test scores achieved by some students, where each row represents a distinct test.

>>> df = pl.DataFrame(
...     {
...         "name": ["Cady", "Cady", "Karen", "Karen"],
...         "subject": ["maths", "physics", "maths", "physics"],
...         "test_1": [98, 99, 61, 58],
...         "test_2": [100, 100, 60, 60],
...     }
... )
>>> df
shape: (4, 4)
┌───────┬─────────┬────────┬────────┐
│ name  ┆ subject ┆ test_1 ┆ test_2 │
│ ---   ┆ ---     ┆ ---    ┆ ---    │
│ str   ┆ str     ┆ i64    ┆ i64    │
╞═══════╪═════════╪════════╪════════╡
│ Cady  ┆ maths   ┆ 98     ┆ 100    │
│ Cady  ┆ physics ┆ 99     ┆ 100    │
│ Karen ┆ maths   ┆ 61     ┆ 60     │
│ Karen ┆ physics ┆ 58     ┆ 60     │
└───────┴─────────┴────────┴────────┘

Using pivot, we can reshape so we have one row per student, with different subjects as columns, and their test_1 scores as values:

>>> df.lazy().pivot(
...     "subject",
...     on_columns=["maths", "physics"],
...     index="name",
...     values="test_1",
... ).collect()  
shape: (2, 3)
┌───────┬───────┬─────────┐
│ name  ┆ maths ┆ physics │
│ ---   ┆ ---   ┆ ---     │
│ str   ┆ i64   ┆ i64     │
╞═══════╪═══════╪═════════╡
│ Cady  ┆ 98    ┆ 99      │
│ Karen ┆ 61    ┆ 58      │
└───────┴───────┴─────────┘

You can use selectors too - here we include all test scores in the pivoted table:

>>> import polars.selectors as cs
>>> df.lazy().pivot(
...     "subject",
...     on_columns=["maths", "physics"],
...     values=cs.starts_with("test"),
... ).collect()  
shape: (2, 5)
┌───────┬──────────────┬────────────────┬──────────────┬────────────────┐
│ name  ┆ test_1_maths ┆ test_1_physics ┆ test_2_maths ┆ test_2_physics │
│ ---   ┆ ---          ┆ ---            ┆ ---          ┆ ---            │
│ str   ┆ i64          ┆ i64            ┆ i64          ┆ i64            │
╞═══════╪══════════════╪════════════════╪══════════════╪════════════════╡
│ Cady  ┆ 98           ┆ 99             ┆ 100          ┆ 100            │
│ Karen ┆ 61           ┆ 58             ┆ 60           ┆ 60             │
└───────┴──────────────┴────────────────┴──────────────┴────────────────┘

If you end up with multiple values per cell, you can specify how to aggregate them with aggregate_function:

>>> lf = pl.LazyFrame(
...     {
...         "ix": [1, 1, 2, 2, 1, 2],
...         "col": ["a", "a", "a", "a", "b", "b"],
...         "foo": [0, 1, 2, 2, 7, 1],
...         "bar": [0, 2, 0, 0, 9, 4],
...     }
... )
>>> lf.pivot(
...     "col", on_columns=["a", "b"], index="ix", aggregate_function="sum"
... ).collect()  
shape: (2, 5)
┌─────┬───────┬───────┬───────┬───────┐
│ ix  ┆ foo_a ┆ foo_b ┆ bar_a ┆ bar_b │
│ --- ┆ ---   ┆ ---   ┆ ---   ┆ ---   │
│ i64 ┆ i64   ┆ i64   ┆ i64   ┆ i64   │
╞═════╪═══════╪═══════╪═══════╪═══════╡
│ 1   ┆ 1     ┆ 7     ┆ 2     ┆ 9     │
│ 2   ┆ 4     ┆ 1     ┆ 0     ┆ 4     │
└─────┴───────┴───────┴───────┴───────┘

You can also pass a custom aggregation function using polars.element():

>>> lf = pl.LazyFrame(
...     {
...         "col1": ["a", "a", "a", "b", "b", "b"],
...         "col2": ["x", "x", "x", "x", "y", "y"],
...         "col3": [6, 7, 3, 2, 5, 7],
...     }
... )
>>> lf.pivot(
...     "col2",
...     on_columns=["x", "y"],
...     index="col1",
...     values="col3",
...     aggregate_function=pl.element().tanh().mean(),
... ).collect()  
shape: (2, 3)
┌──────┬──────────┬──────────┐
│ col1 ┆ x        ┆ y        │
│ ---  ┆ ---      ┆ ---      │
│ str  ┆ f64      ┆ f64      │
╞══════╪══════════╪══════════╡
│ a    ┆ 0.998347 ┆ null     │
│ b    ┆ 0.964028 ┆ 0.999954 │
└──────┴──────────┴──────────┘

profile(

*,

type_coercion: bool = True,

predicate_pushdown: bool = True,

projection_pushdown: bool = True,

simplify_expression: bool = True,

no_optimization: bool = False,

slice_pushdown: bool = True,

comm_subplan_elim: bool = True,

comm_subexpr_elim: bool = True,

cluster_with_columns: bool = True,

collapse_joins: bool = True,

show_plot: bool = False,

truncate_nodes: int = 0,

figsize: tuple[int, int] = (18, 8),

engine: EngineType = 'auto',

optimizations: QueryOptFlags = (),

**_kwargs: Any,

) → tuple[DataFrame, DataFrame][source]

Profile a LazyFrame.

Deprecated since version 1.43.0: It was made for the older in-memory engine, but from version 2.0, Polars uses a streaming engine by default. Due to the concurrent nature of the streaming engine, the profiling information from this function would be misleading.

This will run the query and return a tuple containing the materialized DataFrame and a DataFrame that contains profiling information of each node that is executed.

The units of the timings are microseconds.

Parameters:

type_coercion

Do type coercion optimization.

Deprecated since version 1.30.0: Use the optimizations parameters.

predicate_pushdown

Do predicate pushdown optimization.

Deprecated since version 1.30.0: Use the optimizations parameters.

projection_pushdown

Do projection pushdown optimization.

Deprecated since version 1.30.0: Use the optimizations parameters.

simplify_expression

Run simplify expressions optimization.

Deprecated since version 1.30.0: Use the optimizations parameters.

no_optimization

Turn off (certain) optimizations.

Deprecated since version 1.30.0: Use the optimizations parameters.

slice_pushdown

Slice pushdown optimization.

Deprecated since version 1.30.0: Use the optimizations parameters.

comm_subplan_elim

Will try to cache branching subplans that occur on self-joins or unions.

Deprecated since version 1.30.0: Use the optimizations parameters.

comm_subexpr_elim

Common subexpressions will be cached and reused.

Deprecated since version 1.30.0: Use the optimizations parameters.

cluster_with_columns

Combine sequential independent calls to with_columns

Deprecated since version 1.30.0: Use the optimizations parameters.

collapse_joins

Collapse a join and filters into a faster join

Deprecated since version 1.30.0: Use the optimizations parameters.

show_plot

Show a gantt chart of the profiling result

truncate_nodes

Truncate the label lengths in the gantt chart to this number of characters.

figsize

matplotlib figsize of the profiling plot

engine

Select the engine used to process the query (default "auto"):

"auto": use the engine set by Config.set_engine_affinity or the POLARS_ENGINE_AFFINITY environment variable, falling back to "in-memory" if unset (this default may change in a future release).
"in-memory": use the in-memory engine, this is the default engine.
"streaming": use the streaming engine, which processes queries in batches, reducing memory pressure and often outperforming the in-memory engine. This will soon become the default engine of Polars.
"gpu": use the CUDA GPU engine (requires an Nvidia GPU and cudf-polars). Pass a GPUEngine object for fine-grained control (e.g. device selection on multi-GPU systems).

If the selected engine cannot run the query, Polars falls back to the in-memory engine.

Note

GPU mode is considered unstable. Not all queries will run successfully on the GPU, however, they should fall back transparently to the default engine if execution is not supported.

Running with POLARS_VERBOSE=1 will provide information if a query falls back (and why).

optimizations

The optimization passes done during query optimization.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": ["a", "b", "a", "b", "b", "c"],
...         "b": [1, 2, 3, 4, 5, 6],
...         "c": [6, 5, 4, 3, 2, 1],
...     }
... )
>>> lf.group_by("a", maintain_order=True).agg(pl.all().sum()).sort(
...     "a"
... ).profile()  
(shape: (3, 3)
 ┌─────┬─────┬─────┐
 │ a   ┆ b   ┆ c   │
 │ --- ┆ --- ┆ --- │
 │ str ┆ i64 ┆ i64 │
 ╞═════╪═════╪═════╡
 │ a   ┆ 4   ┆ 10  │
 │ b   ┆ 11  ┆ 10  │
 │ c   ┆ 6   ┆ 1   │
 └─────┴─────┴─────┘,
 shape: (3, 3)
 ┌─────────────────────────┬───────┬──────┐
 │ node                    ┆ start ┆ end  │
 │ ---                     ┆ ---   ┆ ---  │
 │ str                     ┆ u64   ┆ u64  │
 ╞═════════════════════════╪═══════╪══════╡
 │ optimization            ┆ 0     ┆ 5    │
 │ group_by_partitioned(a) ┆ 5     ┆ 470  │
 │ sort(a)                 ┆ 475   ┆ 1964 │
 └─────────────────────────┴───────┴──────┘)

quantile( quantile: float | Expr, interpolation: QuantileMethod = 'nearest', ) → LazyFrame[source]

Aggregate the columns in the LazyFrame to their quantile value.

Parameters:

quantile: Quantile between 0.0 and 1.0.
interpolation{‘nearest’, ‘higher’, ‘lower’, ‘midpoint’, ‘linear’, ‘equiprobable’}: Interpolation method.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 2, 3, 4],
...         "b": [1, 2, 1, 1],
...     }
... )
>>> lf.quantile(0.7).collect()
shape: (1, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞═════╪═════╡
│ 3.0 ┆ 1.0 │
└─────┴─────┘

remote( context: pc.ClientContext | None = None, *, plan_type: pc._typing.PlanTypePreference = 'dot', n_retries: int = 0, engine: pc._typing.Engine = 'auto', scaling_mode: pc._typing.ScalingMode = 'auto', ) → pc.LazyFrameRemote[source]

Run a query remotely on Polars Cloud.

This allows you to run Polars remotely on one or more workers via several strategies for distributed compute.

See also

filter

Notes

If you are transitioning from Pandas, and performing filter operations based on the comparison of two or more columns, please note that in Polars any comparison involving null values will result in a null result, not boolean True or False. As a result, these rows will not be removed. Ensure that null values are handled appropriately to avoid unexpected behaviour (see examples below).

Examples

>>> lf = pl.LazyFrame(
...     {
...         "foo": [2, 3, None, 4, 0],
...         "bar": [5, 6, None, None, 0],
...         "ham": ["a", "b", None, "c", "d"],
...     }
... )

Remove rows matching a condition:

>>> lf.remove(
...     pl.col("bar") >= 5,
... ).collect()
shape: (3, 3)
┌──────┬──────┬──────┐
│ foo  ┆ bar  ┆ ham  │
│ ---  ┆ ---  ┆ ---  │
│ i64  ┆ i64  ┆ str  │
╞══════╪══════╪══════╡
│ null ┆ null ┆ null │
│ 4    ┆ null ┆ c    │
│ 0    ┆ 0    ┆ d    │
└──────┴──────┴──────┘

Discard rows based on multiple conditions, combined with and/or operators:

>>> lf.remove(
...     (pl.col("foo") >= 0) & (pl.col("bar") >= 0),
... ).collect()
shape: (2, 3)
┌──────┬──────┬──────┐
│ foo  ┆ bar  ┆ ham  │
│ ---  ┆ ---  ┆ ---  │
│ i64  ┆ i64  ┆ str  │
╞══════╪══════╪══════╡
│ null ┆ null ┆ null │
│ 4    ┆ null ┆ c    │
└──────┴──────┴──────┘

>>> lf.remove(
...     (pl.col("foo") >= 0) | (pl.col("bar") >= 0),
... ).collect()
shape: (1, 3)
┌──────┬──────┬──────┐
│ foo  ┆ bar  ┆ ham  │
│ ---  ┆ ---  ┆ ---  │
│ i64  ┆ i64  ┆ str  │
╞══════╪══════╪══════╡
│ null ┆ null ┆ null │
└──────┴──────┴──────┘

Provide multiple constraints using *args syntax:

>>> lf.remove(
...     pl.col("ham").is_not_null(),
...     pl.col("bar") >= 0,
... ).collect()
shape: (2, 3)
┌──────┬──────┬──────┐
│ foo  ┆ bar  ┆ ham  │
│ ---  ┆ ---  ┆ ---  │
│ i64  ┆ i64  ┆ str  │
╞══════╪══════╪══════╡
│ null ┆ null ┆ null │
│ 4    ┆ null ┆ c    │
└──────┴──────┴──────┘

Provide constraints(s) using **kwargs syntax:

>>> lf.remove(foo=0, bar=0).collect()
shape: (4, 3)
┌──────┬──────┬──────┐
│ foo  ┆ bar  ┆ ham  │
│ ---  ┆ ---  ┆ ---  │
│ i64  ┆ i64  ┆ str  │
╞══════╪══════╪══════╡
│ 2    ┆ 5    ┆ a    │
│ 3    ┆ 6    ┆ b    │
│ null ┆ null ┆ null │
│ 4    ┆ null ┆ c    │
└──────┴──────┴──────┘

Remove rows by comparing two columns against each other; in this case, we remove rows where the two columns are not equal (using ne_missing to ensure that null values compare equal):

>>> lf.remove(
...     pl.col("foo").ne_missing(pl.col("bar")),
... ).collect()
shape: (2, 3)
┌──────┬──────┬──────┐
│ foo  ┆ bar  ┆ ham  │
│ ---  ┆ ---  ┆ ---  │
│ i64  ┆ i64  ┆ str  │
╞══════╪══════╪══════╡
│ null ┆ null ┆ null │
│ 0    ┆ 0    ┆ d    │
└──────┴──────┴──────┘

rename( mapping: Mapping[str, str] | Callable[[str], str], *, strict: bool = True, ) → LazyFrame[source]

Rename column names.

Parameters:

mapping: Key value pairs that map from old name to new name, or a function that takes the old name as input and returns the new name.
strict: Validate that all column names exist in the current schema, and throw an exception if any do not. (Note that this parameter is a no-op when passing a function to mapping).

See also

Expr.name.replace

Notes

If existing names are swapped (e.g. ‘A’ points to ‘B’ and ‘B’ points to ‘A’), polars will block projection and predicate pushdowns at this node.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6, 7, 8],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> lf.rename({"foo": "apple"}).collect()
shape: (3, 3)
┌───────┬─────┬─────┐
│ apple ┆ bar ┆ ham │
│ ---   ┆ --- ┆ --- │
│ i64   ┆ i64 ┆ str │
╞═══════╪═════╪═════╡
│ 1     ┆ 6   ┆ a   │
│ 2     ┆ 7   ┆ b   │
│ 3     ┆ 8   ┆ c   │
└───────┴─────┴─────┘
>>> lf.rename(lambda column_name: "c" + column_name[1:]).collect()
shape: (3, 3)
┌─────┬─────┬─────┐
│ coo ┆ car ┆ cam │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 1   ┆ 6   ┆ a   │
│ 2   ┆ 7   ┆ b   │
│ 3   ┆ 8   ┆ c   │
└─────┴─────┴─────┘

reverse() → LazyFrame[source]

Reverse the DataFrame.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "key": ["a", "b", "c"],
...         "val": [1, 2, 3],
...     }
... )
>>> lf.reverse().collect()
shape: (3, 2)
┌─────┬─────┐
│ key ┆ val │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ c   ┆ 3   │
│ b   ┆ 2   │
│ a   ┆ 1   │
└─────┴─────┘

Create rolling groups based on a temporal or integer column.

Different from a group_by_dynamic the windows are now determined by the individual values and are not of constant intervals. For constant intervals use LazyFrame.group_by_dynamic().

If you have a time series <t_0, t_1, ..., t_n>, then by default the windows created will be

(t_0 - period, t_0]

(t_1 - period, t_1]

…

(t_n - period, t_n]

whereas if you pass a non-default offset, then the windows will be

(t_0 + offset, t_0 + offset + period]

(t_1 + offset, t_1 + offset + period]

…

(t_n + offset, t_n + offset + period]

The period and offset arguments are created either from a timedelta, or by using the following string language:

1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
1i (1 index count)

Or combine them: “3d12h4m25s” # 3 days, 12 hours, 4 minutes, and 25 seconds

By “calendar day”, we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for “calendar week”, “calendar month”, “calendar quarter”, and “calendar year”.

Changed in version 0.20.14: The by parameter was renamed group_by.

Parameters:

index_column

Column used to group based on the time window. Often of type Date/Datetime. This column must be sorted in ascending order (or, if group_by is specified, then it must be sorted in ascending order within each group).

In case of a rolling group by on indices, dtype needs to be one of {UInt32, UInt64, Int32, Int64}. Note that the first three get temporarily cast to Int64, so if performance matters use an Int64 column.

period

Length of the window - must be non-negative.

offset

Offset of the window. Default is -period.

closed{‘right’, ‘left’, ‘both’, ‘none’}

Define which sides of the temporal interval are closed (inclusive).

group_by

Also group by this column/these columns

Returns:

LazyGroupBy: Object you can call .agg on to aggregate by groups, the result of which will be sorted by index_column (but note that if group_by columns are passed, it will only be sorted within each group).

See also

group_by_dynamic

Examples

>>> dates = [
...     "2020-01-01 13:45:48",
...     "2020-01-01 16:42:13",
...     "2020-01-01 16:45:09",
...     "2020-01-02 18:12:48",
...     "2020-01-03 19:45:32",
...     "2020-01-08 23:16:43",
... ]
>>> df = pl.LazyFrame({"dt": dates, "a": [3, 7, 5, 9, 2, 1]}).with_columns(
...     pl.col("dt").str.strptime(pl.Datetime).set_sorted()
... )
>>> out = (
...     df.rolling(index_column="dt", period="2d")
...     .agg(
...         pl.sum("a").alias("sum_a"),
...         pl.min("a").alias("min_a"),
...         pl.max("a").alias("max_a"),
...     )
...     .collect()
... )
>>> out
shape: (6, 4)
┌─────────────────────┬───────┬───────┬───────┐
│ dt                  ┆ sum_a ┆ min_a ┆ max_a │
│ ---                 ┆ ---   ┆ ---   ┆ ---   │
│ datetime[μs]        ┆ i64   ┆ i64   ┆ i64   │
╞═════════════════════╪═══════╪═══════╪═══════╡
│ 2020-01-01 13:45:48 ┆ 3     ┆ 3     ┆ 3     │
│ 2020-01-01 16:42:13 ┆ 10    ┆ 3     ┆ 7     │
│ 2020-01-01 16:45:09 ┆ 15    ┆ 3     ┆ 7     │
│ 2020-01-02 18:12:48 ┆ 24    ┆ 3     ┆ 9     │
│ 2020-01-03 19:45:32 ┆ 11    ┆ 2     ┆ 9     │
│ 2020-01-08 23:16:43 ┆ 1     ┆ 1     ┆ 1     │
└─────────────────────┴───────┴───────┴───────┘

property schema: Schema[source]

Get an ordered mapping of column names to their data type.

Warning

Resolving the schema of a LazyFrame is a potentially expensive operation. Using collect_schema() is the idiomatic way to resolve the schema. This property exists only for symmetry with the DataFrame class.

See also

collect_schema
Schema

Examples

>>> lf = pl.LazyFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6.0, 7.0, 8.0],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> lf.schema  
Schema({'foo': Int64, 'bar': Float64, 'ham': String})

select(

*exprs: IntoExpr | Iterable[IntoExpr],

**named_exprs: IntoExpr,

) → LazyFrame[source]

Select columns from this LazyFrame.

Parameters:

*exprs: Column(s) to select, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.
**named_exprs: Additional columns to select, specified as keyword arguments. The columns will be renamed to the keyword used.

Examples

Pass the name of a column to select that column.

>>> lf = pl.LazyFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6, 7, 8],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> lf.select("foo").collect()
shape: (3, 1)
┌─────┐
│ foo │
│ --- │
│ i64 │
╞═════╡
│ 1   │
│ 2   │
│ 3   │
└─────┘

Multiple columns can be selected by passing a list of column names.

>>> lf.select(["foo", "bar"]).collect()
shape: (3, 2)
┌─────┬─────┐
│ foo ┆ bar │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 6   │
│ 2   ┆ 7   │
│ 3   ┆ 8   │
└─────┴─────┘

Multiple columns can also be selected using positional arguments instead of a list. Expressions are also accepted.

>>> lf.select(pl.col("foo"), pl.col("bar") + 1).collect()
shape: (3, 2)
┌─────┬─────┐
│ foo ┆ bar │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 7   │
│ 2   ┆ 8   │
│ 3   ┆ 9   │
└─────┴─────┘

Use keyword arguments to easily name your expression inputs.

>>> lf.select(
...     threshold=pl.when(pl.col("foo") > 2).then(10).otherwise(0)
... ).collect()
shape: (3, 1)
┌───────────┐
│ threshold │
│ ---       │
│ i32       │
╞═══════════╡
│ 0         │
│ 0         │
│ 10        │
└───────────┘

select_seq(

*exprs: IntoExpr | Iterable[IntoExpr],

**named_exprs: IntoExpr,

) → LazyFrame[source]

Select columns from this LazyFrame.

This will run all expression sequentially instead of in parallel. Use this when the work per expression is cheap.

Parameters:

*exprs: Column(s) to select, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.
**named_exprs: Additional columns to select, specified as keyword arguments. The columns will be renamed to the keyword used.

See also

select

Serialize the logical plan of this LazyFrame to a file or string in JSON format.

Parameters:

file

File path to which the result should be written. If set to None (default), the output is returned as a string instead.

format

The format in which to serialize. Options:

"binary": Serialize to binary format (bytes). This is the default.
"json": Serialize to JSON format (string) (deprecated).

See also

LazyFrame.deserialize

Notes

Serialization is not stable across Polars versions: a LazyFrame serialized in one Polars version may not be deserializable in another Polars version.

Examples

Serialize the logical plan into a binary representation.

>>> lf = pl.LazyFrame({"a": [1, 2, 3]}).sum()
>>> bytes = lf.serialize()

The bytes can later be deserialized back into a LazyFrame.

>>> import io
>>> pl.LazyFrame.deserialize(io.BytesIO(bytes)).collect()
shape: (1, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 6   │
└─────┘

set_sorted( column: str | list[str], *more_columns: str, descending: bool | list[bool] = False, nulls_last: bool | list[bool] = False, ) → LazyFrame[source]

Flag a column as sorted.

This can speed up future operations.

Parameters:

column: Column(s) that is sorted
more_columns: Columns that are sorted over after column.
descending: Whether the column is sorted in descending order.
nulls_last: Whether the nulls are at the end.

Warning

This can lead to incorrect results if the data is NOT sorted!! Use with care!

shift( n: int | IntoExprColumn = 1, *, fill_value: IntoExpr | None = None, ) → LazyFrame[source]

Shift values by the given number of indices.

Parameters:

n: Number of indices to shift forward. If a negative value is passed, values are shifted in the opposite direction instead.
fill_value: Fill the resulting null values with this value. Accepts scalar expression input. Non-expression inputs are parsed as literals.

Notes

This method is similar to the LAG operation in SQL when the value for n is positive. With a negative value for n, it is similar to LEAD.

Examples

By default, values are shifted forward by one index.

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 2, 3, 4],
...         "b": [5, 6, 7, 8],
...     }
... )
>>> lf.shift().collect()
shape: (4, 2)
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ i64  ┆ i64  │
╞══════╪══════╡
│ null ┆ null │
│ 1    ┆ 5    │
│ 2    ┆ 6    │
│ 3    ┆ 7    │
└──────┴──────┘

Pass a negative value to shift in the opposite direction instead.

>>> lf.shift(-2).collect()
shape: (4, 2)
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ i64  ┆ i64  │
╞══════╪══════╡
│ 3    ┆ 7    │
│ 4    ┆ 8    │
│ null ┆ null │
│ null ┆ null │
└──────┴──────┘

Specify fill_value to fill the resulting null values.

>>> lf.shift(-2, fill_value=100).collect()
shape: (4, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 3   ┆ 7   │
│ 4   ┆ 8   │
│ 100 ┆ 100 │
│ 100 ┆ 100 │
└─────┴─────┘

Show the first n rows.

Parameters:

limitint

Number of rows to show. If None is passed, raises a ValueError. This is done to match the signature of DataFrame.show().

ascii_tablesbool

Use ASCII characters to display table outlines. Set False to revert to the default UTF8_FULL_CONDENSED formatting style. See Config.set_ascii_tables() for more information.

decimal_separatorstr

Set the decimal separator character. See Config.set_decimal_separator() for more information.

thousands_separatorstr, bool

Set the thousands grouping separator character. See Config.set_thousands_separator() for more information.

float_precisionint

Number of decimal places to display for floating point values. See Config.set_float_precision() for more information.

fmt_float{“mixed”, “full”}

Control how floating point values are displayed. See Config.set_fmt_float() for more information. Supported options are:

“mixed”: Limit the number of decimal places and use scientific notation for large/small values.
“full”: Print the full precision of the floating point number.

fmt_str_lengthsint

Number of characters to display for string values. See Config.set_fmt_str_lengths() for more information.

fmt_table_cell_list_lenint

Number of elements to display for List values. See Config.set_fmt_table_cell_list_len() for more information.

tbl_cell_alignmentstr

Set table cell alignment. See Config.set_tbl_cell_alignment() for more information. Supported options are:

“LEFT”: left aligned
“CENTER”: center aligned
“RIGHT”: right aligned

tbl_cell_numeric_alignmentstr

Set table cell alignment for numeric columns. See Config.set_tbl_cell_numeric_alignment() for more information. Supported options are:

“LEFT”: left aligned
“CENTER”: center aligned
“RIGHT”: right aligned

tbl_colsint

Number of columns to display. See Config.set_tbl_cols() for more information.

tbl_column_data_type_inlinebool

Moves the data type inline with the column name (to the right, in parentheses). See Config.set_tbl_column_data_type_inline() for more information.

tbl_dataframe_shape_belowbool

Print the DataFrame shape information below the data when displaying tables. See Config.set_tbl_dataframe_shape_below() for more information.

tbl_formattingstr

Set table formatting style. See Config.set_tbl_formatting() for more information. Supported options are:

“ASCII_FULL”: ASCII, with all borders and lines, including row dividers.
“ASCII_FULL_CONDENSED”: Same as ASCII_FULL, but with dense row spacing.
“ASCII_NO_BORDERS”: ASCII, no borders.
“ASCII_BORDERS_ONLY”: ASCII, borders only.
“ASCII_BORDERS_ONLY_CONDENSED”: ASCII, borders only, dense row spacing.
“ASCII_HORIZONTAL_ONLY”: ASCII, horizontal lines only.
“ASCII_MARKDOWN”: Markdown format (ascii ellipses for truncated values).
“MARKDOWN”: Markdown format (utf8 ellipses for truncated values).
“UTF8_FULL”: UTF8, with all borders and lines, including row dividers.
“UTF8_FULL_CONDENSED”: Same as UTF8_FULL, but with dense row spacing.
“UTF8_NO_BORDERS”: UTF8, no borders.
“UTF8_BORDERS_ONLY”: UTF8, borders only.
“UTF8_HORIZONTAL_ONLY”: UTF8, horizontal lines only.
“NOTHING”: No borders or other lines.

tbl_hide_column_data_typesbool

Hide table column data types (i64, f64, str etc.). See Config.set_tbl_hide_column_data_types() for more information.

tbl_hide_column_namesbool

Hide table column names. See Config.set_tbl_hide_column_names() for more information.

tbl_hide_dtype_separatorbool

Hide the ‘—’ separator between the column names and column types. See Config.set_tbl_hide_dtype_separator() for more information.

tbl_hide_dataframe_shapebool

Hide the DataFrame shape information when displaying tables. See Config.set_tbl_hide_dataframe_shape() for more information.

tbl_width_charsint

Set the maximum width of a table in characters. See Config.set_tbl_width_chars() for more information.

trim_decimal_zerosbool

Strip trailing zeros from Decimal data type values. See Config.set_trim_decimal_zeros() for more information.

Warning

This method does not maintain the laziness of the frame, and will collect the final result. This could potentially be an expensive operation.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 2, 3, 4, 5, 6],
...         "b": [7, 8, 9, 10, 11, 12],
...     }
... )
>>> lf.show()
shape: (5, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 7   │
│ 2   ┆ 8   │
│ 3   ┆ 9   │
│ 4   ┆ 10  │
│ 5   ┆ 11  │
└─────┴─────┘
>>> lf.show(2)
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 7   │
│ 2   ┆ 8   │
└─────┴─────┘

show_graph( *, optimized: bool = True, show: bool = True, output_path: str | Path | None = None, raw_output: bool = False, figsize: tuple[float, float] = (16.0, 12.0), type_coercion: bool = True, _type_check: bool = True, predicate_pushdown: bool = True, projection_pushdown: bool = True, simplify_expression: bool = True, slice_pushdown: bool = True, comm_subplan_elim: bool = True, comm_subexpr_elim: bool = True, cluster_with_columns: bool = True, collapse_joins: bool = True, engine: EngineType = 'auto', plan_stage: PlanStage = 'ir', _check_order: bool = True, optimizations: QueryOptFlags = (), ) → str | None[source]

Show a plot of the query plan.

Note that Graphviz must be installed to render the visualization (if not already present, you can download it here: https://graphviz.org/download).

Parameters:

optimized

Optimize the query plan.

show

Show the figure.

output_path

Write the figure to disk.

raw_output

Return dot syntax. This cannot be combined with show and/or output_path.

figsize

Passed to matplotlib if show == True.

type_coercion

Do type coercion optimization.

Deprecated since version 1.30.0: Use the optimizations parameters.

predicate_pushdown

Do predicate pushdown optimization.

Deprecated since version 1.30.0: Use the optimizations parameters.

projection_pushdown

Do projection pushdown optimization.

Deprecated since version 1.30.0: Use the optimizations parameters.

simplify_expression

Run simplify expressions optimization.

Deprecated since version 1.30.0: Use the optimizations parameters.

slice_pushdown

Slice pushdown optimization.

Deprecated since version 1.30.0: Use the optimizations parameters.

comm_subplan_elim

Will try to cache branching subplans that occur on self-joins or unions.

Deprecated since version 1.30.0: Use the optimizations parameters.

comm_subexpr_elim

Common subexpressions will be cached and reused.

Deprecated since version 1.30.0: Use the optimizations parameters.

cluster_with_columns

Combine sequential independent calls to with_columns.

Deprecated since version 1.30.0: Use the optimizations parameters.

collapse_joins

Collapse a join and filters into a faster join.

Deprecated since version 1.30.0: Use the optimizations parameters.

engine

Select the engine used to process the query (default "auto"):

"auto": use the engine set by Config.set_engine_affinity or the POLARS_ENGINE_AFFINITY environment variable, falling back to "in-memory" if unset (this default may change in a future release).
"in-memory": use the in-memory engine, this is the default engine.
"streaming": use the streaming engine, which processes queries in batches, reducing memory pressure and often outperforming the in-memory engine. This will soon become the default engine of Polars.
"gpu": use the CUDA GPU engine (requires an Nvidia GPU and cudf-polars). Pass a GPUEngine object for fine-grained control (e.g. device selection on multi-GPU systems).

If the selected engine cannot run the query, Polars falls back to the in-memory engine.

Note

GPU mode is considered unstable. Not all queries will run successfully on the GPU, however, they should fall back transparently to the default engine if execution is not supported.

Running with POLARS_VERBOSE=1 will provide information if a query falls back (and why).

plan_stage{‘ir’, ‘physical’}

Select the stage to display. Currently only the streaming engine has a separate physical stage, for the other engines both IR and physical are the same.

optimizations

The set of the optimizations considered during query optimization.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": ["a", "b", "a", "b", "b", "c"],
...         "b": [1, 2, 3, 4, 5, 6],
...         "c": [6, 5, 4, 3, 2, 1],
...     }
... )
>>> lf.group_by("a", maintain_order=True).agg(pl.all().sum()).sort(
...     "a"
... ).show_graph()  

sink_batches( function: Callable[[DataFrame], bool | None], *, chunk_size: int | None = None, maintain_order: bool = True, lazy: bool = False, engine: EngineType = 'auto', optimizations: QueryOptFlags = (), ) → LazyFrame | None[source]

Evaluate the query and call a user-defined function for every ready batch.

This allows streaming results that are larger than RAM in certain cases.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

Warning

This method is much slower than native sinks. Only use it if you cannot implement your logic otherwise.

Parameters:

function

Function to run with a batch that is ready. If the function returns True, this signals that no more results are needed, allowing for early stopping.

chunk_size

The number of rows that are buffered before the callback is called.

maintain_order

Maintain the order in which data is processed. Setting this to False will be slightly faster.

lazy: bool

Wait to start execution until collect is called.

engine

Select the engine used to process the query (default "auto"):

"auto": use the engine set by Config.set_engine_affinity or the POLARS_ENGINE_AFFINITY environment variable, falling back to "streaming" if unset.
"in-memory": use the in-memory engine before writing, this is the default engine.
"streaming": use the streaming engine, which processes queries in batches, reducing memory pressure and often outperforming the in-memory engine. This will soon become the default engine of Polars.
"gpu": use the CUDA GPU engine (requires an Nvidia GPU and cudf-polars). Pass a GPUEngine object for fine-grained control.

If the selected engine cannot run the query, Polars falls back to the streaming engine.

optimizations

The optimization passes done during query optimization.

This has no effect if lazy is set to True.

Examples

>>> lf = pl.scan_csv("/path/to/my_larger_than_ram_file.csv")  
>>> lf.sink_batches(lambda df: print(df))  

sink_csv( path: str | Path | IO[bytes] | IO[str] | PartitionBy, *, include_bom: bool = False, compression: Literal['uncompressed', 'gzip', 'zstd'] = 'uncompressed', compression_level: int | None = None, check_extension: bool = True, include_header: bool = True, separator: str = ',', line_terminator: str = '\n', quote_char: str = '"', batch_size: int = 1024, datetime_format: str | None = None, date_format: str | None = None, time_format: str | None = None, float_scientific: bool | None = None, float_precision: int | None = None, decimal_comma: bool = False, null_value: str | None = None, quote_style: CsvQuoteStyle | None = None, maintain_order: bool = True, storage_options: StorageOptionsDict | None = None, credential_provider: CredentialProviderFunction | Literal['auto'] | None = 'auto', retries: int | None = None, sync_on_close: SyncOnCloseMethod | None = None, mkdir: bool = False, lazy: bool = False, engine: EngineType = 'auto', optimizations: QueryOptFlags = (), ) → LazyFrame | None[source]

Evaluate the query in streaming mode and write to a CSV file.

This allows streaming results that are larger than RAM to be written to disk.

Parameters:

path

File path to which the file should be written.

include_bom

Whether to include UTF-8 BOM in the CSV output.

compression

What compression format to use.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

compression_level

The compression level to use, typically 0-9 or None to let the engine choose.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

check_extension

Whether to check if the filename matches the compression settings. Will raise an error if compression is set to ‘uncompressed’ and the filename ends in one of (“.gz”, “.zst”, “.zstd”) or if compression != ‘uncompressed’ and the file uses an mismatched extension. Only applies if file is a path.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

include_header

Whether to include header in the CSV output.

separator

Separate CSV fields with this symbol.

line_terminator

String used to end each row.

quote_char

Byte to use as quoting character.

batch_size

Number of rows that will be processed per thread.

datetime_format

A format string, with the specifiers defined by the chrono Rust crate. If no format specified, the default fractional-second precision is inferred from the maximum timeunit found in the frame’s Datetime cols (if any).

date_format

A format string, with the specifiers defined by the chrono Rust crate.

time_format

A format string, with the specifiers defined by the chrono Rust crate.

float_scientific

Whether to use scientific form always (true), never (false), or automatically (None) for floating-point datatypes.

float_precision

Number of decimal places to write, applied to both floating-point datatypes.

decimal_comma

Use a comma as the decimal separator instead of a point. Floats will be encapsulated in quotes if necessary; set the field separator to override.

null_value

A string representing null values (defaulting to the empty string).

quote_style{‘necessary’, ‘always’, ‘non_numeric’, ‘never’}

Determines the quoting strategy used.

necessary (default): This puts quotes around fields only when necessary. They are necessary when fields contain a quote, delimiter or record terminator. Quotes are also necessary when writing an empty record (which is indistinguishable from a record with one empty field). This is the default.
always: This puts quotes around every field. Always.
never: This never puts quotes around fields, even if that results in invalid CSV data (e.g.: by not quoting strings containing the separator).
non_numeric: This puts quotes around all fields that are non-numeric. Namely, when writing a field that does not parse as a valid float or integer, then quotes will be used even if they aren`t strictly necessary.

maintain_order

Maintain the order in which data is processed. Setting this to False will be slightly faster.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

storage_options

Options that indicate how to connect to a cloud provider.

The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:

aws
gcp
azure
Hugging Face (hf://): Accepts an API key under the token parameter: {'token': '...'}, or by setting the HF_TOKEN environment variable.

If storage_options is not provided, Polars will try to infer the information from environment variables.

credential_provider

Provide a function that can be called to provide cloud storage credentials. The function is expected to return a dictionary of credential keys along with an optional credential expiry time.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

retries

Number of retries if accessing a cloud instance fails.

Deprecated since version 1.37.1: Pass {“max_retries”: n} via storage_options instead.

sync_on_close: { None, ‘data’, ‘all’ }

Sync to disk when before closing a file.

None does not sync.
data syncs the file contents.
all syncs the file contents and metadata.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

mkdir: bool

Recursively create all the directories in the path.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

lazy: bool

Wait to start execution until collect is called.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

engine

Select the engine used to process the query (default "auto"):

"auto": use the engine set by Config.set_engine_affinity or the POLARS_ENGINE_AFFINITY environment variable, falling back to "streaming" if unset.
"in-memory": use the in-memory engine before writing, this is the default engine.
"streaming": use the streaming engine, which processes queries in batches, reducing memory pressure and often outperforming the in-memory engine. This will soon become the default engine of Polars.
"gpu": use the CUDA GPU engine (requires an Nvidia GPU and cudf-polars). Pass a GPUEngine object for fine-grained control.

If the selected engine cannot run the query, Polars falls back to the streaming engine.

optimizations

The optimization passes done during query optimization.

This has no effect if lazy is set to True.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

Returns:

DataFrame

See also

PartitionBy

Examples

>>> lf = pl.scan_csv("/path/to/my_larger_than_ram_file.csv")  
>>> lf.sink_csv("out.csv")  

Sink to a BytesIO object.

>>> import io
>>> buf = io.BytesIO()  
>>> pl.LazyFrame({"x": [1, 2, 1]}).sink_csv(buf)  

Split into a hive-partitioning style partition:

>>> pl.LazyFrame({"x": [1, 2, 1], "y": [3, 4, 5]}).sink_csv(
...     pl.PartitionBy("./out/", key="x"),
...     mkdir=True
... )  

sink_delta( target: str | Path | deltalake.DeltaTable, *, mode: Literal['error', 'append', 'overwrite', 'ignore', 'merge'] = 'error', storage_options: StorageOptionsDict | None = None, credential_provider: CredentialProviderFunction | Literal['auto'] | None = 'auto', delta_write_options: dict[str, Any] | None = None, delta_merge_options: dict[str, Any] | None = None, optimizations: QueryOptFlags = (), ) → deltalake.table.TableMerger | None[source]

Sink DataFrame as delta table.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

Parameters:

target

URI of a table or a DeltaTable object.

mode{‘error’, ‘append’, ‘overwrite’, ‘ignore’, ‘merge’}

How to handle existing data.

If ‘error’, throw an error if the table already exists (default).
If ‘append’, will add new data.
If ‘overwrite’, will replace table with new data.
If ‘ignore’, will not write anything if table already exists.
If ‘merge’, return a TableMerger object to merge data from the DataFrame with the existing data.

storage_options

Extra options for the storage backends supported by deltalake. For cloud storages, this may include configurations for authentication etc.

See a list of supported storage options for S3 here.
See a list of supported storage options for GCS here.
See a list of supported storage options for Azure here.

credential_provider

Provide a function that can be called to provide cloud storage credentials. The function is expected to return a dictionary of credential keys along with an optional credential expiry time.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

delta_write_options

Additional keyword arguments while writing a Delta lake Table. See a list of supported write options here.

delta_merge_options

Keyword arguments which are required to MERGE a Delta lake Table. See a list of supported merge options here.

engine

Select the engine used to process the query (default "auto"):

"auto": use the engine set by Config.set_engine_affinity or the POLARS_ENGINE_AFFINITY environment variable, falling back to "streaming" if unset.
"in-memory": use the in-memory engine before writing, this is the default engine.
"streaming": use the streaming engine, which processes queries in batches, reducing memory pressure and often outperforming the in-memory engine. This will soon become the default engine of Polars.
"gpu": use the CUDA GPU engine (requires an Nvidia GPU and cudf-polars). Pass a GPUEngine object for fine-grained control.

If the selected engine cannot run the query, Polars falls back to the streaming engine.

optimizations

The optimization passes done during query optimization.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

Raises:

TypeError: If the DataFrame contains unsupported data types.
ArrowInvalidError: If the DataFrame contains data types that could not be cast to their primitive type.
TableNotFoundError: If the delta table doesn’t exist and MERGE action is triggered

Notes

The Polars data types Null and Time are not supported by the delta protocol specification and will raise a TypeError. Columns using The Categorical data type will be converted to normal (non-categorical) strings when written.

Polars columns are always nullable. To write data to a delta table with non-nullable columns, a custom pyarrow schema has to be passed to the delta_write_options. See the last example below.

Examples

Sink a large than fits into memory dataset to a Delta Lake table.

>>> lf = pl.scan_parquet(
...     "/path/to/my_larger_than_ram_file.parquet"
... )  
>>> table_path = "/path/to/delta-table/"
>>> lf.sink_delta(table_path)  

Sink a dataframe to the local filesystem as a Delta Lake table.

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3, 4, 5],
...         "bar": [6, 7, 8, 9, 10],
...         "ham": ["a", "b", "c", "d", "e"],
...     }
... )
>>> table_path = "/path/to/delta-table/"
>>> df.lazy().sink_delta(table_path)  

Append data to an existing Delta Lake table on the local filesystem. Note that this will fail if the schema of the new data does not match the schema of the existing table.

>>> df.lazy().sink_delta(table_path, mode="append")  

Overwrite a Delta Lake table as a new version. If the schemas of the new and old data are the same, specifying the schema_mode is not required.

>>> existing_table_path = "/path/to/delta-table/"
>>> df.lazy().sink_delta(
...     existing_table_path,
...     mode="overwrite",
...     delta_write_options={"schema_mode": "overwrite"},
... )  

Sink a DataFrame as a Delta Lake table to a cloud object store like S3.

>>> table_path = "s3://bucket/prefix/to/delta-table/"
>>> df.lazy().sink_delta(
...     table_path,
...     storage_options={
...         "AWS_REGION": "THE_AWS_REGION",
...         "AWS_ACCESS_KEY_ID": "THE_AWS_ACCESS_KEY_ID",
...         "AWS_SECRET_ACCESS_KEY": "THE_AWS_SECRET_ACCESS_KEY",
...     },
... )  

Sink DataFrame as a Delta Lake table with non-nullable columns.

>>> import pyarrow as pa
>>> existing_table_path = "/path/to/delta-table/"
>>> df.lazy().sink_delta(
...     existing_table_path,
...     delta_write_options={
...         "schema": pa.schema([pa.field("foo", pa.int64(), nullable=False)])
...     },
... )  

Sink DataFrame as a Delta Lake table with zstd compression. For all delta_write_options keyword arguments, check the deltalake docs here, and for Writer Properties in particular here.

>>> import deltalake
>>> df.lazy().sink_delta(
...     table_path,
...     delta_write_options={
...         "writer_properties": deltalake.WriterProperties(compression="zstd"),
...     },
... )  

Merge the DataFrame with an existing Delta Lake table. For all TableMerger methods, check the deltalake docs here.

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3, 4, 5],
...         "bar": [6, 7, 8, 9, 10],
...         "ham": ["a", "b", "c", "d", "e"],
...     }
... )
>>> table_path = "/path/to/delta-table/"
>>> (
...     df.lazy()
...     .sink_delta(
...         "table_path",
...         mode="merge",
...         delta_merge_options={
...             "predicate": "s.foo = t.foo",
...             "source_alias": "s",
...             "target_alias": "t",
...         },
...     )
...     .when_matched_update_all()
...     .when_not_matched_insert_all()
...     .execute()
... )  

sink_iceberg( target: str | pyiceberg.table.Table, *, mode: Literal['append', 'overwrite'], catalog: pyiceberg.catalog.Catalog | polars.io.iceberg.IcebergCatalogConfig | None = None, storage_options: StorageOptionsDict | None = None, ) → DataFrame[source]

Sink a LazyFrame to an Iceberg table.

Warning

This functionality is currently considered unstable. It may be changed at any point without it being considered a breaking change.

Parameters:

target

A PyIceberg Table object, or a ‘namespace.table_name’ identifier string.

mode{‘append’, ‘overwrite’}

How to handle existing data.

If ‘append’, will add new data.
If ‘overwrite’, will replace table with new data.

catalog

PyIceberg catalog to load the table from if the provided target was a table identifier.

storage_options

Extra options for the storage backends supported by pyiceberg. For cloud storages, this may include configurations for authentication etc.

More info is available here.

Returns:

DataFrame: Contains the new metadata path.

sink_ipc( path: str | Path | IO[bytes] | PartitionBy, *, compression: IpcCompression | None = 'uncompressed', compat_level: CompatLevel | None = None, record_batch_size: int | None = None, maintain_order: bool = True, storage_options: StorageOptionsDict | None = None, credential_provider: CredentialProviderFunction | Literal['auto'] | None = 'auto', retries: int | None = None, sync_on_close: SyncOnCloseMethod | None = None, mkdir: bool = False, lazy: bool = False, engine: EngineType = 'auto', optimizations: QueryOptFlags = (), _record_batch_statistics: bool = False, ) → LazyFrame | None[source]

Evaluate the query in streaming mode and write to an IPC file.

This allows streaming results that are larger than RAM to be written to disk.

Parameters:

path

File path to which the file should be written.

compression{‘uncompressed’, ‘lz4’, ‘zstd’}

Choose “zstd” for good compression performance. Choose “lz4” for fast compression/decompression.

compat_level

Compatibility level to use when exporting Polars data structures. The default compatibility level is recommended for most users. Use pl.CompatLevel.oldest() for the most compatible level. pl.CompatLevel.newest() uses the highest supported compatibility level, but is considered unstable and may change without it being considered a breaking change.

record_batch_size

Size of the record batches in number of rows.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

maintain_order

Maintain the order in which data is processed. Setting this to False will be slightly faster.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

storage_options

Options that indicate how to connect to a cloud provider.

The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:

aws
gcp
azure
Hugging Face (hf://): Accepts an API key under the token parameter: {'token': '...'}, or by setting the HF_TOKEN environment variable.

If storage_options is not provided, Polars will try to infer the information from environment variables.

credential_provider

Provide a function that can be called to provide cloud storage credentials. The function is expected to return a dictionary of credential keys along with an optional credential expiry time.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

retries

Number of retries if accessing a cloud instance fails.

Deprecated since version 1.37.1: Pass {“max_retries”: n} via storage_options instead.

sync_on_close: { None, ‘data’, ‘all’ }

Sync to disk when before closing a file.

None does not sync.
data syncs the file contents.
all syncs the file contents and metadata.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

mkdir: bool

Recursively create all the directories in the path.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

lazy: bool

Wait to start execution until collect is called.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

engine

Select the engine used to process the query (default "auto"):

"auto": use the engine set by Config.set_engine_affinity or the POLARS_ENGINE_AFFINITY environment variable, falling back to "streaming" if unset.
"in-memory": use the in-memory engine before writing, this is the default engine.
"streaming": use the streaming engine, which processes queries in batches, reducing memory pressure and often outperforming the in-memory engine. This will soon become the default engine of Polars.
"gpu": not currently supported for this sink.

If the selected engine cannot run the query, Polars falls back to the streaming engine.

Note

The GPU engine is currently not supported.

optimizations

The optimization passes done during query optimization.

This has no effect if lazy is set to True.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

Returns:

DataFrame

See also

PartitionBy

Examples

>>> lf = pl.scan_csv("/path/to/my_larger_than_ram_file.csv")  
>>> lf.sink_ipc("out.arrow")  

Sink to a BytesIO object.

>>> import io
>>> buf = io.BytesIO()  
>>> pl.LazyFrame({"x": [1, 2, 1]}).sink_ipc(buf)  

Split into a hive-partitioning style partition:

>>> pl.LazyFrame({"x": [1, 2, 1], "y": [3, 4, 5]}).sink_ipc(
...     pl.PartitionBy("./out/", key="x"),
...     mkdir=True
... )  

sink_ndjson( path: str | Path | IO[bytes] | IO[str] | PartitionBy, *, compression: Literal['uncompressed', 'gzip', 'zstd'] = 'uncompressed', compression_level: int | None = None, check_extension: bool = True, maintain_order: bool = True, storage_options: StorageOptionsDict | None = None, credential_provider: CredentialProviderFunction | Literal['auto'] | None = 'auto', retries: int | None = None, sync_on_close: SyncOnCloseMethod | None = None, mkdir: bool = False, lazy: bool = False, engine: EngineType = 'auto', optimizations: QueryOptFlags = (), ) → LazyFrame | None[source]

Evaluate the query in streaming mode and write to an NDJSON file.

This allows streaming results that are larger than RAM to be written to disk.

Parameters:

path

File path to which the file should be written.

compression

What compression format to use.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

compression_level

The compression level to use, typically 0-9 or None to let the engine choose.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

check_extension

Whether to check if the filename matches the compression settings. Will raise an error if compression is set to ‘uncompressed’ and the filename ends in one of (“.gz”, “.zst”, “.zstd”) or if compression != ‘uncompressed’ and the file uses an mismatched extension. Only applies if file is a path.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

maintain_order

Maintain the order in which data is processed. Setting this to False will be slightly faster.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

storage_options

Options that indicate how to connect to a cloud provider.

The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:

aws
gcp
azure
Hugging Face (hf://): Accepts an API key under the token parameter: {'token': '...'}, or by setting the HF_TOKEN environment variable.

If storage_options is not provided, Polars will try to infer the information from environment variables.

credential_provider

Provide a function that can be called to provide cloud storage credentials. The function is expected to return a dictionary of credential keys along with an optional credential expiry time.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

retries

Number of retries if accessing a cloud instance fails.

Deprecated since version 1.37.1: Pass {“max_retries”: n} via storage_options instead.

sync_on_close: { None, ‘data’, ‘all’ }

Sync to disk when before closing a file.

None does not sync.
data syncs the file contents.
all syncs the file contents and metadata.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

mkdir: bool

Recursively create all the directories in the path.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

lazy: bool

Wait to start execution until collect is called.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

engine

Select the engine used to process the query (default "auto"):

"auto": use the engine set by Config.set_engine_affinity or the POLARS_ENGINE_AFFINITY environment variable, falling back to "streaming" if unset.
"in-memory": use the in-memory engine before writing, this is the default engine.
"streaming": use the streaming engine, which processes queries in batches, reducing memory pressure and often outperforming the in-memory engine. This will soon become the default engine of Polars.
"gpu": use the CUDA GPU engine (requires an Nvidia GPU and cudf-polars). Pass a GPUEngine object for fine-grained control.

If the selected engine cannot run the query, Polars falls back to the streaming engine.

optimizations

The optimization passes done during query optimization.

This has no effect if lazy is set to True.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

Returns:

DataFrame

See also

PartitionBy

Examples

>>> lf = pl.scan_csv("/path/to/my_larger_than_ram_file.csv")  
>>> lf.sink_ndjson("out.ndjson")  

Sink to a BytesIO object.

>>> import io
>>> buf = io.BytesIO()  
>>> pl.LazyFrame({"x": [1, 2, 1]}).sink_ndjson(buf)  

Split into a hive-partitioning style partition:

>>> pl.LazyFrame({"x": [1, 2, 1], "y": [3, 4, 5]}).sink_ndjson(
...     pl.PartitionBy("./out/", key="x"),
...     mkdir=True
... )  

sink_parquet( path: str | Path | IO[bytes] | PartitionBy, *, compression: str = 'zstd', compression_level: int | None = None, statistics: bool | str | dict[str, bool] = True, row_group_size: int | None = None, data_page_size: int | None = None, maintain_order: bool = True, storage_options: StorageOptionsDict | None = None, credential_provider: CredentialProviderFunction | Literal['auto'] | None = 'auto', retries: int | None = None, sync_on_close: SyncOnCloseMethod | None = None, metadata: ParquetMetadata | None = None, arrow_schema: ArrowSchemaExportable | None = None, mkdir: bool = False, lazy: bool = False, engine: EngineType = 'auto', optimizations: QueryOptFlags = (), _sinked_paths_callback: SinkedPathsCallback | None = None, ) → LazyFrame | None[source]

Evaluate the query in streaming mode and write to a Parquet file.

This allows streaming results that are larger than RAM to be written to disk.

Parameters:

path

File path to which the file should be written.

compression{‘lz4’, ‘uncompressed’, ‘snappy’, ‘gzip’, ‘brotli’, ‘zstd’}

Choose “zstd” for good compression performance. Choose “lz4” for fast compression/decompression. Choose “snappy” for more backwards compatibility guarantees when you deal with older parquet readers.

compression_level

The level of compression to use. Higher compression means smaller files on disk.

“gzip” : min-level: 0, max-level: 9, default: 6.
“brotli” : min-level: 0, max-level: 11, default: 1.
“zstd” : min-level: 1, max-level: 22, default: 3.

statistics

Write statistics to the parquet headers. This is the default behavior.

Possible values:

True: enable default set of statistics (default). Some statistics may be disabled.
False: disable all statistics
“full”: calculate and write all available statistics.
{ "statistic-key": True / False, ... }. Available keys:
- “min”: column minimum value (default: True)
- “max”: column maximum value (default: True)
- “distinct_count”: number of unique column values (default: False)
- “null_count”: number of null values in column (default: True)

row_group_size

Size of the row groups in number of rows. If None (default), the chunks of the DataFrame are used. Writing in smaller chunks may reduce memory pressure and improve writing speeds.

data_page_size

Size limit of individual data pages. If not set defaults to 1024 * 1024 bytes

maintain_order

Maintain the order in which data is processed. Setting this to False will be slightly faster.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

storage_options

Options that indicate how to connect to a cloud provider.

The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:

aws
gcp
azure
Hugging Face (hf://): Accepts an API key under the token parameter: {'token': '...'}, or by setting the HF_TOKEN environment variable.

If storage_options is not provided, Polars will try to infer the information from environment variables.

credential_provider

Provide a function that can be called to provide cloud storage credentials. The function is expected to return a dictionary of credential keys along with an optional credential expiry time.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

retries

Number of retries if accessing a cloud instance fails.

Deprecated since version 1.37.1: Pass {“max_retries”: n} via storage_options instead.

sync_on_close: { None, ‘data’, ‘all’ }

Sync to disk when before closing a file.

None does not sync.
data syncs the file contents.
all syncs the file contents and metadata.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

metadata

A dictionary or callback to add key-values to the file-level Parquet metadata.

Warning

This functionality is considered experimental. It may be removed or changed at any point without it being considered a breaking change.

arrow_schema

Provide a custom arrow schema to write to the file. This allows setting custom schema and field-level metadata. Names and dtypes must match.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

mkdir: bool

Recursively create all the directories in the path.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

lazy: bool

Wait to start execution until collect is called.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

engine

Select the engine used to process the query (default "auto"):

"auto": use the engine set by Config.set_engine_affinity or the POLARS_ENGINE_AFFINITY environment variable, falling back to "streaming" if unset.
"in-memory": use the in-memory engine before writing, this is the default engine.
"streaming": use the streaming engine, which processes queries in batches, reducing memory pressure and often outperforming the in-memory engine. This will soon become the default engine of Polars.
"gpu": use the CUDA GPU engine (requires an Nvidia GPU and cudf-polars). Pass a GPUEngine object for fine-grained control.

If the selected engine cannot run the query, Polars falls back to the streaming engine.

optimizations

The optimization passes done during query optimization.

This has no effect if lazy is set to True.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

Returns:

DataFrame

See also

PartitionBy

Examples

>>> lf = pl.scan_csv("/path/to/my_larger_than_ram_file.csv")  
>>> lf.sink_parquet("out.parquet")  

Sink to a BytesIO object.

>>> import io
>>> buf = io.BytesIO()  
>>> pl.LazyFrame({"x": [1, 2, 1]}).sink_parquet(buf)  

Split into a hive-partitioning style partition:

>>> pl.LazyFrame({"x": [1, 2, 1], "y": [3, 4, 5]}).sink_parquet(
...     pl.PartitionBy("./out/", key="x"),
...     mkdir=True
... )  

slice( offset: int, length: int | None = None, ) → LazyFrame[source]

Get a slice of this DataFrame.

Parameters:

offset: Start index. Negative indexing is supported.
length: Length of the slice. If set to None, all rows starting at the offset will be selected.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": ["x", "y", "z"],
...         "b": [1, 3, 5],
...         "c": [2, 4, 6],
...     }
... )
>>> lf.slice(1, 2).collect()
shape: (2, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ y   ┆ 3   ┆ 4   │
│ z   ┆ 5   ┆ 6   │
└─────┴─────┴─────┘

sort( by: IntoExpr | Iterable[IntoExpr], *more_by: IntoExpr, descending: bool | Sequence[bool] = False, nulls_last: bool | Sequence[bool] = False, maintain_order: bool = False, multithreaded: bool = True, ) → LazyFrame[source]

Sort the LazyFrame by the given columns.

Parameters:

by: Column(s) to sort by. Accepts expression input, including selectors. Strings are parsed as column names.
*more_by: Additional columns to sort by, specified as positional arguments.
descending: Sort in descending order. When sorting by multiple columns, can be specified per column by passing a sequence of booleans.
nulls_last: Place null values last; can specify a single boolean applying to all columns or a sequence of booleans for per-column control.
maintain_order: Whether the order should be maintained if elements are equal. Note that if true streaming is not possible and performance might be worse since this requires a stable search.
multithreaded: Sort using multiple threads.

Examples

Pass a single column name to sort by that column.

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 2, None],
...         "b": [6.0, 5.0, 4.0],
...         "c": ["a", "c", "b"],
...     }
... )
>>> lf.sort("a").collect()
shape: (3, 3)
┌──────┬─────┬─────┐
│ a    ┆ b   ┆ c   │
│ ---  ┆ --- ┆ --- │
│ i64  ┆ f64 ┆ str │
╞══════╪═════╪═════╡
│ null ┆ 4.0 ┆ b   │
│ 1    ┆ 6.0 ┆ a   │
│ 2    ┆ 5.0 ┆ c   │
└──────┴─────┴─────┘

Sorting by expressions is also supported.

>>> lf.sort(pl.col("a") + pl.col("b") * 2, nulls_last=True).collect()
shape: (3, 3)
┌──────┬─────┬─────┐
│ a    ┆ b   ┆ c   │
│ ---  ┆ --- ┆ --- │
│ i64  ┆ f64 ┆ str │
╞══════╪═════╪═════╡
│ 2    ┆ 5.0 ┆ c   │
│ 1    ┆ 6.0 ┆ a   │
│ null ┆ 4.0 ┆ b   │
└──────┴─────┴─────┘

Sort by multiple columns by passing a list of columns.

>>> lf.sort(["c", "a"], descending=True).collect()
shape: (3, 3)
┌──────┬─────┬─────┐
│ a    ┆ b   ┆ c   │
│ ---  ┆ --- ┆ --- │
│ i64  ┆ f64 ┆ str │
╞══════╪═════╪═════╡
│ 2    ┆ 5.0 ┆ c   │
│ null ┆ 4.0 ┆ b   │
│ 1    ┆ 6.0 ┆ a   │
└──────┴─────┴─────┘

Or use positional arguments to sort by multiple columns in the same way.

>>> lf.sort("c", "a", descending=[False, True]).collect()
shape: (3, 3)
┌──────┬─────┬─────┐
│ a    ┆ b   ┆ c   │
│ ---  ┆ --- ┆ --- │
│ i64  ┆ f64 ┆ str │
╞══════╪═════╪═════╡
│ 1    ┆ 6.0 ┆ a   │
│ null ┆ 4.0 ┆ b   │
│ 2    ┆ 5.0 ┆ c   │
└──────┴─────┴─────┘

sql( query: str, *, table_name: str = 'self', ) → LazyFrame[source]

Execute a SQL query against the LazyFrame.

Added in version 0.20.23.

Warning

This functionality is considered unstable, although it is close to being considered stable. It may be changed at any point without it being considered a breaking change.

Parameters:

query: SQL query to execute.
table_name: Optionally provide an explicit name for the table that represents the calling frame (defaults to “self”).

See also

SQLContext

Notes

The calling LazyFrame is automatically registered as a table in the SQLContext under the name “self”. If you want access to the DataFrames and LazyFrames found in the current globals, use the top-level pl.sql.
More control over registration and execution behaviour is available by using the SQLContext object.

Examples

>>> lf1 = pl.LazyFrame({"a": [1, 2, 3], "b": [6, 7, 8], "c": ["z", "y", "x"]})
>>> lf2 = pl.LazyFrame({"a": [3, 2, 1], "d": [125, -654, 888]})

Query the LazyFrame using SQL:

>>> lf1.sql("SELECT c, b FROM self WHERE a > 1").collect()
shape: (2, 2)
┌─────┬─────┐
│ c   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ y   ┆ 7   │
│ x   ┆ 8   │
└─────┴─────┘

Apply SQL transforms (aliasing “self” to “frame”) then filter natively (you can freely mix SQL and native operations):

>>> lf1.sql(
...     query='''
...         SELECT
...             a,
...             (a % 2 == 0) AS a_is_even,
...             (b::float4 / 2) AS "b/2",
...             CONCAT_WS(':', c, c, c) AS c_c_c
...         FROM frame
...         ORDER BY a
...     ''',
...     table_name="frame",
... ).filter(~pl.col("c_c_c").str.starts_with("x")).collect()
shape: (2, 4)
┌─────┬───────────┬─────┬───────┐
│ a   ┆ a_is_even ┆ b/2 ┆ c_c_c │
│ --- ┆ ---       ┆ --- ┆ ---   │
│ i64 ┆ bool      ┆ f32 ┆ str   │
╞═════╪═══════════╪═════╪═══════╡
│ 1   ┆ false     ┆ 3.0 ┆ z:z:z │
│ 2   ┆ true      ┆ 3.5 ┆ y:y:y │
└─────┴───────────┴─────┴───────┘

std(ddof: int = 1) → LazyFrame[source]

Aggregate the columns in the LazyFrame to their standard deviation value.

Parameters:

ddof: “Delta Degrees of Freedom”: the divisor used in the calculation is N - ddof, where N represents the number of elements. By default ddof is 1.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 2, 3, 4],
...         "b": [1, 2, 1, 1],
...     }
... )
>>> lf.std().collect()
shape: (1, 2)
┌──────────┬─────┐
│ a        ┆ b   │
│ ---      ┆ --- │
│ f64      ┆ f64 │
╞══════════╪═════╡
│ 1.290994 ┆ 0.5 │
└──────────┴─────┘
>>> lf.std(ddof=0).collect()
shape: (1, 2)
┌──────────┬──────────┐
│ a        ┆ b        │
│ ---      ┆ ---      │
│ f64      ┆ f64      │
╞══════════╪══════════╡
│ 1.118034 ┆ 0.433013 │
└──────────┴──────────┘

sum() → LazyFrame[source]

Aggregate the columns in the LazyFrame to their sum value.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 2, 3, 4],
...         "b": [1, 2, 1, 1],
...     }
... )
>>> lf.sum().collect()
shape: (1, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 10  ┆ 5   │
└─────┴─────┘

tail(n: int = 5) → LazyFrame[source]

Get the last n rows.

Parameters:

n: Number of rows to return.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 2, 3, 4, 5, 6],
...         "b": [7, 8, 9, 10, 11, 12],
...     }
... )
>>> lf.tail().collect()
shape: (5, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 2   ┆ 8   │
│ 3   ┆ 9   │
│ 4   ┆ 10  │
│ 5   ┆ 11  │
│ 6   ┆ 12  │
└─────┴─────┘
>>> lf.tail(2).collect()
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 5   ┆ 11  │
│ 6   ┆ 12  │
└─────┴─────┘

top_k( k: int, *, by: IntoExpr | Iterable[IntoExpr], reverse: bool | Sequence[bool] = False, ) → LazyFrame[source]

Return the k largest rows.

Non-null elements are always preferred over null elements, regardless of the value of reverse. The output is not guaranteed to be in any particular order, call sort() after this function if you wish the output to be sorted.

Changed in version 1.0.0: The descending parameter was renamed reverse.

Parameters:

k: Number of rows to return.
by: Column(s) used to determine the top rows. Accepts expression input. Strings are parsed as column names.
reverse: Consider the k smallest elements of the by column(s) (instead of the k largest). This can be specified per column by passing a sequence of booleans.

See also

bottom_k

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": ["a", "b", "a", "b", "b", "c"],
...         "b": [2, 1, 1, 3, 2, 1],
...     }
... )

Get the rows which contain the 4 largest values in column b.

>>> lf.top_k(4, by="b").collect()
shape: (4, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ b   ┆ 3   │
│ a   ┆ 2   │
│ b   ┆ 2   │
│ b   ┆ 1   │
└─────┴─────┘

Get the rows which contain the 4 largest values when sorting on column b and a.

>>> lf.top_k(4, by=["b", "a"]).collect()
shape: (4, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ b   ┆ 3   │
│ b   ┆ 2   │
│ a   ┆ 2   │
│ c   ┆ 1   │
└─────┴─────┘

unique( subset: IntoExpr | Collection[IntoExpr] | None = None, *, keep: UniqueKeepStrategy = 'any', maintain_order: bool = False, ) → LazyFrame[source]

Drop duplicate rows from this LazyFrame.

Parameters:

subset

Column name(s), selector(s), or expressions to consider when identifying duplicate rows. If set to None (default), all columns are considered.

keep{‘first’, ‘last’, ‘any’, ‘none’}

Which of the duplicate rows to keep.

‘any’: Does not give any guarantee of which row is kept.
This allows more optimizations.
‘none’: Don’t keep duplicate rows.
‘first’: Keep the first unique row.
‘last’: Keep the last unique row.

maintain_order

Keep the same order as the original DataFrame. This is more expensive to compute. Settings this to True blocks the possibility to run on the streaming engine.

Returns:

LazyFrame: LazyFrame with unique rows.

Notes

If you’re coming from Pandas, this is similar to pandas.DataFrame.drop_duplicates.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "foo": [1, 2, 3, 1, 1],
...         "bar": ["a", "a", "a", "x", "x"],
...         "ham": ["b", "b", "b", "y", "y"],
...     }
... )

By default, all columns are considered when determining which rows are unique:

>>> lf.unique(maintain_order=True).collect()
shape: (4, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞═════╪═════╪═════╡
│ 1   ┆ a   ┆ b   │
│ 2   ┆ a   ┆ b   │
│ 3   ┆ a   ┆ b   │
│ 1   ┆ x   ┆ y   │
└─────┴─────┴─────┘

We can also consider only a subset of columns when determining uniqueness, controlling which row we keep when duplicates are found:

>>> lf.unique(subset="foo", keep="first", maintain_order=True).collect()
shape: (3, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞═════╪═════╪═════╡
│ 1   ┆ a   ┆ b   │
│ 2   ┆ a   ┆ b   │
│ 3   ┆ a   ┆ b   │
└─────┴─────┴─────┘
>>> lf.unique(subset="foo", keep="last", maintain_order=True).collect()
shape: (3, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞═════╪═════╪═════╡
│ 2   ┆ a   ┆ b   │
│ 3   ┆ a   ┆ b   │
│ 1   ┆ x   ┆ y   │
└─────┴─────┴─────┘
>>> lf.unique(subset="foo", keep="none", maintain_order=True).collect()
shape: (2, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞═════╪═════╪═════╡
│ 2   ┆ a   ┆ b   │
│ 3   ┆ a   ┆ b   │
└─────┴─────┴─────┘

Selectors can be used to define the “subset” parameter:

>>> import polars.selectors as cs
>>> lf.unique(subset=cs.string(), maintain_order=True).collect()
shape: (2, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞═════╪═════╪═════╡
│ 1   ┆ a   ┆ b   │
│ 1   ┆ x   ┆ y   │
└─────┴─────┴─────┘

We can also use an arbitrary expression in the “subset” parameter; in this example we use the part of the label in front of “:” to determine uniqueness:

>>> lf = pl.LazyFrame(
...     {
...         "label": ["xx:1", "xx:2", "yy:3", "yy:4"],
...         "value": [100, 200, 300, 400],
...     }
... )
>>> lf.unique(
...     subset=pl.col("label").str.extract(r"^(\w+):"),
...     maintain_order=True,
...     keep="first",
... ).collect()
shape: (2, 2)
┌───────┬───────┐
│ label ┆ value │
│ ---   ┆ ---   │
│ str   ┆ i64   │
╞═══════╪═══════╡
│ xx:1  ┆ 100   │
│ yy:3  ┆ 300   │
└───────┴───────┘

unnest( columns: ColumnNameOrSelector | Collection[ColumnNameOrSelector] | None = None, *more_columns: ColumnNameOrSelector, separator: str | None = None, ) → LazyFrame[source]

Decompose struct columns into separate columns for each of their fields.

The new columns will be inserted into the DataFrame at the location of the struct column.

If no columns are provided, all struct columns are unnested.

Parameters:

columns: Name of the struct column(s) that should be unnested.
*more_columns: Additional columns to unnest, specified as positional arguments.
separator: Rename output column names as combination of the struct column name, name separator and field name.

Examples

>>> df = pl.LazyFrame(
...     {
...         "before": ["foo", "bar"],
...         "t_a": [1, 2],
...         "t_b": ["a", "b"],
...         "t_c": [True, None],
...         "t_d": [[1, 2], [3]],
...         "after": ["baz", "womp"],
...     }
... ).select("before", pl.struct(pl.col("^t_.$")).alias("t_struct"), "after")
>>> df.collect()
shape: (2, 3)
┌────────┬─────────────────────┬───────┐
│ before ┆ t_struct            ┆ after │
│ ---    ┆ ---                 ┆ ---   │
│ str    ┆ struct[4]           ┆ str   │
╞════════╪═════════════════════╪═══════╡
│ foo    ┆ {1,"a",true,[1, 2]} ┆ baz   │
│ bar    ┆ {2,"b",null,[3]}    ┆ womp  │
└────────┴─────────────────────┴───────┘
>>> df.unnest("t_struct").collect()
shape: (2, 6)
┌────────┬─────┬─────┬──────┬───────────┬───────┐
│ before ┆ t_a ┆ t_b ┆ t_c  ┆ t_d       ┆ after │
│ ---    ┆ --- ┆ --- ┆ ---  ┆ ---       ┆ ---   │
│ str    ┆ i64 ┆ str ┆ bool ┆ list[i64] ┆ str   │
╞════════╪═════╪═════╪══════╪═══════════╪═══════╡
│ foo    ┆ 1   ┆ a   ┆ true ┆ [1, 2]    ┆ baz   │
│ bar    ┆ 2   ┆ b   ┆ null ┆ [3]       ┆ womp  │
└────────┴─────┴─────┴──────┴───────────┴───────┘

Unnest all struct columns by calling without arguments:

>>> df.unnest().collect()
shape: (2, 6)
┌────────┬─────┬─────┬──────┬───────────┬───────┐
│ before ┆ t_a ┆ t_b ┆ t_c  ┆ t_d       ┆ after │
│ ---    ┆ --- ┆ --- ┆ ---  ┆ ---       ┆ ---   │
│ str    ┆ i64 ┆ str ┆ bool ┆ list[i64] ┆ str   │
╞════════╪═════╪═════╪══════╪═══════════╪═══════╡
│ foo    ┆ 1   ┆ a   ┆ true ┆ [1, 2]    ┆ baz   │
│ bar    ┆ 2   ┆ b   ┆ null ┆ [3]       ┆ womp  │
└────────┴─────┴─────┴──────┴───────────┴───────┘

>>> df = pl.LazyFrame(
...     {
...         "before": ["foo", "bar"],
...         "t_a": [1, 2],
...         "t_b": ["a", "b"],
...         "t_c": [True, None],
...         "t_d": [[1, 2], [3]],
...         "after": ["baz", "womp"],
...     }
... ).select(
...     "before",
...     pl.struct(pl.col("^t_.$").name.map(lambda t: t[2:])).alias("t"),
...     "after",
... )
>>> df.unnest("t", separator="::").collect()
shape: (2, 6)
┌────────┬──────┬──────┬──────┬───────────┬───────┐
│ before ┆ t::a ┆ t::b ┆ t::c ┆ t::d      ┆ after │
│ ---    ┆ ---  ┆ ---  ┆ ---  ┆ ---       ┆ ---   │
│ str    ┆ i64  ┆ str  ┆ bool ┆ list[i64] ┆ str   │
╞════════╪══════╪══════╪══════╪═══════════╪═══════╡
│ foo    ┆ 1    ┆ a    ┆ true ┆ [1, 2]    ┆ baz   │
│ bar    ┆ 2    ┆ b    ┆ null ┆ [3]       ┆ womp  │
└────────┴──────┴──────┴──────┴───────────┴───────┘

Unpivot a DataFrame from wide to long format.

Optionally leaves identifiers set.

This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (index) while all other columns, considered measured variables (on), are “unpivoted” to the row axis leaving just two non-identifier columns, ‘variable’ and ‘value’.

Parameters:

on: Column(s) or selector(s) to use as values variables; if on is empty no columns will be used. If set to None (default) all columns that are not in index will be used.
index: Column(s) or selector(s) to use as identifier variables.
variable_name: Name to give to the variable column. Defaults to “variable”
value_name: Name to give to the value column. Defaults to “value”
streamable: deprecated

Notes

If you’re coming from pandas, this is similar to pandas.DataFrame.melt, but with index replacing id_vars and on replacing value_vars. In other frameworks, you might know this operation as pivot_longer.

The resulting row order is unspecified.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": ["x", "y", "z"],
...         "b": [1, 3, 5],
...         "c": [2, 4, 6],
...     }
... )
>>> import polars.selectors as cs
>>> lf.unpivot(cs.numeric(), index="a").collect()
shape: (6, 3)
┌─────┬──────────┬───────┐
│ a   ┆ variable ┆ value │
│ --- ┆ ---      ┆ ---   │
│ str ┆ str      ┆ i64   │
╞═════╪══════════╪═══════╡
│ x   ┆ b        ┆ 1     │
│ y   ┆ b        ┆ 3     │
│ z   ┆ b        ┆ 5     │
│ x   ┆ c        ┆ 2     │
│ y   ┆ c        ┆ 4     │
│ z   ┆ c        ┆ 6     │
└─────┴──────────┴───────┘

Update the values in this LazyFrame with the values in other.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

Parameters:

other

LazyFrame that will be used to update the values

on

Column names that will be joined on. If set to None (default), the implicit row index of each frame is used as a join key.

how{‘left’, ‘inner’, ‘full’}

‘left’ will keep all rows from the left table; rows may be duplicated if multiple rows in the right frame match the left row’s key.
‘inner’ keeps only those rows where the key exists in both frames.
‘full’ will update existing rows where the key matches while also adding any new rows contained in the given frame.

left_on

Join column(s) of the left DataFrame.

right_on

Join column(s) of the right DataFrame.

include_nulls

Overwrite values in the left frame with null values from the right frame. If set to False (default), null values in the right frame are ignored.

maintain_order{‘none’, ‘left’, ‘right’, ‘left_right’, ‘right_left’}

Which order of rows from the inputs to preserve. See join() for details. Unlike join this function preserves the left order by default.

Notes

This is syntactic sugar for a left/inner join that preserves the order of the left DataFrame by default, with an optional coalesce when include_nulls = False.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "A": [1, 2, 3, 4],
...         "B": [400, 500, 600, 700],
...     }
... )
>>> lf.collect()
shape: (4, 2)
┌─────┬─────┐
│ A   ┆ B   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 400 │
│ 2   ┆ 500 │
│ 3   ┆ 600 │
│ 4   ┆ 700 │
└─────┴─────┘
>>> new_lf = pl.LazyFrame(
...     {
...         "B": [-66, None, -99],
...         "C": [5, 3, 1],
...     }
... )

Update df values with the non-null values in new_df, by row index:

>>> lf.update(new_lf).collect()
shape: (4, 2)
┌─────┬─────┐
│ A   ┆ B   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ -66 │
│ 2   ┆ 500 │
│ 3   ┆ -99 │
│ 4   ┆ 700 │
└─────┴─────┘

Update df values with the non-null values in new_df, by row index, but only keeping those rows that are common to both frames:

>>> lf.update(new_lf, how="inner").collect()
shape: (3, 2)
┌─────┬─────┐
│ A   ┆ B   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ -66 │
│ 2   ┆ 500 │
│ 3   ┆ -99 │
└─────┴─────┘

Update df values with the non-null values in new_df, using a full outer join strategy that defines explicit join columns in each frame:

>>> lf.update(new_lf, left_on=["A"], right_on=["C"], how="full").collect()
shape: (5, 2)
┌─────┬─────┐
│ A   ┆ B   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ -99 │
│ 2   ┆ 500 │
│ 3   ┆ 600 │
│ 4   ┆ 700 │
│ 5   ┆ -66 │
└─────┴─────┘

Update df values including null values in new_df, using a full outer join strategy that defines explicit join columns in each frame:

>>> lf.update(
...     new_lf, left_on="A", right_on="C", how="full", include_nulls=True
... ).collect()
shape: (5, 2)
┌─────┬──────┐
│ A   ┆ B    │
│ --- ┆ ---  │
│ i64 ┆ i64  │
╞═════╪══════╡
│ 1   ┆ -99  │
│ 2   ┆ 500  │
│ 3   ┆ null │
│ 4   ┆ 700  │
│ 5   ┆ -66  │
└─────┴──────┘

var(ddof: int = 1) → LazyFrame[source]

Aggregate the columns in the LazyFrame to their variance value.

Parameters:

ddof: “Delta Degrees of Freedom”: the divisor used in the calculation is N - ddof, where N represents the number of elements. By default ddof is 1.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 2, 3, 4],
...         "b": [1, 2, 1, 1],
...     }
... )
>>> lf.var().collect()
shape: (1, 2)
┌──────────┬──────┐
│ a        ┆ b    │
│ ---      ┆ ---  │
│ f64      ┆ f64  │
╞══════════╪══════╡
│ 1.666667 ┆ 0.25 │
└──────────┴──────┘
>>> lf.var(ddof=0).collect()
shape: (1, 2)
┌──────┬────────┐
│ a    ┆ b      │
│ ---  ┆ ---    │
│ f64  ┆ f64    │
╞══════╪════════╡
│ 1.25 ┆ 0.1875 │
└──────┴────────┘

property width: int[source]

Get the number of columns.

Returns:

int

Warning

Determining the width of a LazyFrame requires resolving its schema, which is a potentially expensive operation. Using collect_schema() is the idiomatic way to resolve the schema. This property exists only for symmetry with the DataFrame class.

See also

collect_schema
Schema.len

Examples

>>> lf = pl.LazyFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [4, 5, 6],
...     }
... )
>>> lf.width  
2

with_columns(

*exprs: IntoExpr | Iterable[IntoExpr],

**named_exprs: IntoExpr,

) → LazyFrame[source]

Add columns to this LazyFrame.

Added columns will replace existing columns with the same name.

Parameters:

*exprs: Column(s) to add, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.
**named_exprs: Additional columns to add, specified as keyword arguments. The columns will be renamed to the keyword used.

Returns:

LazyFrame: A new LazyFrame with the columns added.

Notes

Creating a new LazyFrame using this method does not create a new copy of existing data.

Examples

Pass an expression to add it as a new column.

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 2, 3, 4],
...         "b": [0.5, 4, 10, 13],
...         "c": [True, True, False, True],
...     }
... )
>>> lf.with_columns((pl.col("a") ** 2).alias("a^2")).collect()
shape: (4, 4)
┌─────┬──────┬───────┬─────┐
│ a   ┆ b    ┆ c     ┆ a^2 │
│ --- ┆ ---  ┆ ---   ┆ --- │
│ i64 ┆ f64  ┆ bool  ┆ i64 │
╞═════╪══════╪═══════╪═════╡
│ 1   ┆ 0.5  ┆ true  ┆ 1   │
│ 2   ┆ 4.0  ┆ true  ┆ 4   │
│ 3   ┆ 10.0 ┆ false ┆ 9   │
│ 4   ┆ 13.0 ┆ true  ┆ 16  │
└─────┴──────┴───────┴─────┘

Added columns will replace existing columns with the same name.

>>> lf.with_columns(pl.col("a").cast(pl.Float64)).collect()
shape: (4, 3)
┌─────┬──────┬───────┐
│ a   ┆ b    ┆ c     │
│ --- ┆ ---  ┆ ---   │
│ f64 ┆ f64  ┆ bool  │
╞═════╪══════╪═══════╡
│ 1.0 ┆ 0.5  ┆ true  │
│ 2.0 ┆ 4.0  ┆ true  │
│ 3.0 ┆ 10.0 ┆ false │
│ 4.0 ┆ 13.0 ┆ true  │
└─────┴──────┴───────┘

Multiple columns can be added using positional arguments.

>>> lf.with_columns(
...     (pl.col("a") ** 2).alias("a^2"),
...     (pl.col("b") / 2).alias("b/2"),
...     (pl.col("c").not_()).alias("not c"),
... ).collect()
shape: (4, 6)
┌─────┬──────┬───────┬─────┬──────┬───────┐
│ a   ┆ b    ┆ c     ┆ a^2 ┆ b/2  ┆ not c │
│ --- ┆ ---  ┆ ---   ┆ --- ┆ ---  ┆ ---   │
│ i64 ┆ f64  ┆ bool  ┆ i64 ┆ f64  ┆ bool  │
╞═════╪══════╪═══════╪═════╪══════╪═══════╡
│ 1   ┆ 0.5  ┆ true  ┆ 1   ┆ 0.25 ┆ false │
│ 2   ┆ 4.0  ┆ true  ┆ 4   ┆ 2.0  ┆ false │
│ 3   ┆ 10.0 ┆ false ┆ 9   ┆ 5.0  ┆ true  │
│ 4   ┆ 13.0 ┆ true  ┆ 16  ┆ 6.5  ┆ false │
└─────┴──────┴───────┴─────┴──────┴───────┘

Multiple columns can also be added by passing a list of expressions.

>>> lf.with_columns(
...     [
...         (pl.col("a") ** 2).alias("a^2"),
...         (pl.col("b") / 2).alias("b/2"),
...         (pl.col("c").not_()).alias("not c"),
...     ]
... ).collect()
shape: (4, 6)
┌─────┬──────┬───────┬─────┬──────┬───────┐
│ a   ┆ b    ┆ c     ┆ a^2 ┆ b/2  ┆ not c │
│ --- ┆ ---  ┆ ---   ┆ --- ┆ ---  ┆ ---   │
│ i64 ┆ f64  ┆ bool  ┆ i64 ┆ f64  ┆ bool  │
╞═════╪══════╪═══════╪═════╪══════╪═══════╡
│ 1   ┆ 0.5  ┆ true  ┆ 1   ┆ 0.25 ┆ false │
│ 2   ┆ 4.0  ┆ true  ┆ 4   ┆ 2.0  ┆ false │
│ 3   ┆ 10.0 ┆ false ┆ 9   ┆ 5.0  ┆ true  │
│ 4   ┆ 13.0 ┆ true  ┆ 16  ┆ 6.5  ┆ false │
└─────┴──────┴───────┴─────┴──────┴───────┘

Use keyword arguments to easily name your expression inputs.

>>> lf.with_columns(
...     ab=pl.col("a") * pl.col("b"),
...     not_c=pl.col("c").not_(),
... ).collect()
shape: (4, 5)
┌─────┬──────┬───────┬──────┬───────┐
│ a   ┆ b    ┆ c     ┆ ab   ┆ not_c │
│ --- ┆ ---  ┆ ---   ┆ ---  ┆ ---   │
│ i64 ┆ f64  ┆ bool  ┆ f64  ┆ bool  │
╞═════╪══════╪═══════╪══════╪═══════╡
│ 1   ┆ 0.5  ┆ true  ┆ 0.5  ┆ false │
│ 2   ┆ 4.0  ┆ true  ┆ 8.0  ┆ false │
│ 3   ┆ 10.0 ┆ false ┆ 30.0 ┆ true  │
│ 4   ┆ 13.0 ┆ true  ┆ 52.0 ┆ false │
└─────┴──────┴───────┴──────┴───────┘

with_columns_seq(

*exprs: IntoExpr | Iterable[IntoExpr],

**named_exprs: IntoExpr,

) → LazyFrame[source]

Add columns to this LazyFrame.

Added columns will replace existing columns with the same name.

This will run all expression sequentially instead of in parallel. Use this when the work per expression is cheap.

Parameters:

*exprs: Column(s) to add, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.
**named_exprs: Additional columns to add, specified as keyword arguments. The columns will be renamed to the keyword used.

Returns:

LazyFrame: A new LazyFrame with the columns added.

See also

with_columns

with_context(other: Self | list[Self]) → LazyFrame[source]

Add an external context to the computation graph.

Deprecated since version 1.0.0: Use concat() instead, with how='horizontal'

This allows expressions to also access columns from DataFrames that are not part of this one.

Parameters:

other: Lazy DataFrame to join with.

Examples

>>> lf = pl.LazyFrame({"a": [1, 2, 3], "b": ["a", "c", None]})
>>> lf_other = pl.LazyFrame({"c": ["foo", "ham"]})
>>> lf.with_context(lf_other).select(  
...     pl.col("b") + pl.col("c").first()
... ).collect()
shape: (3, 1)
┌──────┐
│ b    │
│ ---  │
│ str  │
╞══════╡
│ afoo │
│ cfoo │
│ null │
└──────┘

Fill nulls with the median from another DataFrame:

>>> train_lf = pl.LazyFrame(
...     {"feature_0": [-1.0, 0, 1], "feature_1": [-1.0, 0, 1]}
... )
>>> test_lf = pl.LazyFrame(
...     {"feature_0": [-1.0, None, 1], "feature_1": [-1.0, 0, 1]}
... )
>>> test_lf.with_context(  
...     train_lf.select(pl.all().name.suffix("_train"))
... ).select(
...     pl.col("feature_0").fill_null(pl.col("feature_0_train").median())
... ).collect()
shape: (3, 1)
┌───────────┐
│ feature_0 │
│ ---       │
│ f64       │
╞═══════════╡
│ -1.0      │
│ 0.0       │
│ 1.0       │
└───────────┘

with_row_count( name: str = 'row_nr', offset: int = 0, ) → LazyFrame[source]

Add a column at index 0 that counts the rows.

Deprecated since version 0.20.4: Use the with_row_index() method instead. Note that the default column name has changed from ‘row_nr’ to ‘index’.

Parameters:

name: Name of the column to add.
offset: Start the row count at this offset.

Warning

This can have a negative effect on query performance. This may, for instance, block predicate pushdown optimization.

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 3, 5],
...         "b": [2, 4, 6],
...     }
... )
>>> lf.with_row_count().collect()  
shape: (3, 3)
┌────────┬─────┬─────┐
│ row_nr ┆ a   ┆ b   │
│ ---    ┆ --- ┆ --- │
│ u32    ┆ i64 ┆ i64 │
╞════════╪═════╪═════╡
│ 0      ┆ 1   ┆ 2   │
│ 1      ┆ 3   ┆ 4   │
│ 2      ┆ 5   ┆ 6   │
└────────┴─────┴─────┘

with_row_index( name: str = 'index', offset: int = 0, ) → LazyFrame[source]

Add a row index as the first column in the LazyFrame.

Parameters:

name: Name of the index column.
offset: Start the index at this offset. Cannot be negative.

Warning

Using this function can have a negative effect on query performance. This may, for instance, block predicate pushdown optimization.

Notes

The resulting column does not have any special properties. It is a regular column of type UInt32 (or UInt64 in polars[rt64]).

Examples

>>> lf = pl.LazyFrame(
...     {
...         "a": [1, 3, 5],
...         "b": [2, 4, 6],
...     }
... )
>>> lf.with_row_index().collect()
shape: (3, 3)
┌───────┬─────┬─────┐
│ index ┆ a   ┆ b   │
│ ---   ┆ --- ┆ --- │
│ u32   ┆ i64 ┆ i64 │
╞═══════╪═════╪═════╡
│ 0     ┆ 1   ┆ 2   │
│ 1     ┆ 3   ┆ 4   │
│ 2     ┆ 5   ┆ 6   │
└───────┴─────┴─────┘
>>> lf.with_row_index("id", offset=1000).collect()
shape: (3, 3)
┌──────┬─────┬─────┐
│ id   ┆ a   ┆ b   │
│ ---  ┆ --- ┆ --- │
│ u32  ┆ i64 ┆ i64 │
╞══════╪═════╪═════╡
│ 1000 ┆ 1   ┆ 2   │
│ 1001 ┆ 3   ┆ 4   │
│ 1002 ┆ 5   ┆ 6   │
└──────┴─────┴─────┘

An index column can also be created using the expressions int_range() and len().

>>> lf.select(
...     pl.int_range(pl.len(), dtype=pl.UInt32).alias("index"),
...     pl.all(),
... ).collect()
shape: (3, 3)
┌───────┬─────┬─────┐
│ index ┆ a   ┆ b   │
│ ---   ┆ --- ┆ --- │
│ u32   ┆ i64 ┆ i64 │
╞═══════╪═════╪═════╡
│ 0     ┆ 1   ┆ 2   │
│ 1     ┆ 3   ┆ 4   │
│ 2     ┆ 5   ┆ 6   │
└───────┴─────┴─────┘