LazyFrame#
This page gives an overview of all public LazyFrame methods.
- class polars.LazyFrame(
- data: FrameInitTypes | None = None,
- schema: SchemaDefinition | None = None,
- *,
- schema_overrides: SchemaDict | None = None,
- strict: bool = True,
- orient: Orientation | None = None,
- infer_schema_length: int | None = 100,
- nan_to_null: bool = False,
Representation of a Lazy computation graph/query against a DataFrame.
This allows for whole-query optimisation in addition to parallelism, and is the preferred (and highest-performance) mode of operation for polars.
- Parameters:
- datadict, Sequence, ndarray, Series, or pandas.DataFrame
Two-dimensional data in various forms; dict input must contain Sequences, Generators, or a
range
. Sequence may contain Series or other Sequences.- schemaSequence of str, (str,DataType) pairs, or a {str:DataType,} dict
The LazyFrame schema may be declared in several ways:
As a dict of {name:type} pairs; if type is None, it will be auto-inferred.
As a list of column names; in this case types are automatically inferred.
As a list of (name,type) pairs; this is equivalent to the dictionary form.
If you supply a list of column names that does not match the names in the underlying data, the names given here will overwrite them. The number of names given in the schema should match the underlying data dimensions.
- schema_overridesdict, default None
Support type specification or override of one or more columns; note that any dtypes inferred from the schema param will be overridden.
The number of entries in the schema should match the underlying data dimensions, unless a sequence of dictionaries is being passed, in which case a partial schema can be declared to prevent specific fields from being loaded.
- strictbool, default True
Throw an error if any
data
value does not exactly match the given or inferred data type for that column. If set toFalse
, values that do not match the data type are cast to that data type or, if casting is not possible, set to null instead.- orient{βcolβ, βrowβ}, default None
Whether to interpret two-dimensional data as columns or as rows. If None, the orientation is inferred by matching the columns and data dimensions. If this does not yield conclusive results, column orientation is used.
- infer_schema_lengthint or None
The maximum number of rows to scan for schema inference. If set to
None
, the full data may be scanned (this can be slow). This parameter only applies if the input data is a sequence or generator of rows; other input is read as-is.- nan_to_nullbool, default False
If the data comes from one or more numpy arrays, can optionally convert input data np.nan values to null instead. This is a no-op for all other input data.
Notes
Initialising
LazyFrame(...)
directly is equivalent toDataFrame(...).lazy()
.Examples
Constructing a LazyFrame directly from a dictionary:
>>> data = {"a": [1, 2], "b": [3, 4]} >>> lf = pl.LazyFrame(data) >>> lf.collect() shape: (2, 2) βββββββ¬ββββββ β a β b β β --- β --- β β i64 β i64 β βββββββͺββββββ‘ β 1 β 3 β β 2 β 4 β βββββββ΄ββββββ
Notice that the dtypes are automatically inferred as polars Int64:
>>> lf.dtypes [Int64, Int64]
To specify a more detailed/specific frame schema you can supply the
schema
parameter with a dictionary of (name,dtype) pairsβ¦>>> data = {"col1": [0, 2], "col2": [3, 7]} >>> lf2 = pl.LazyFrame(data, schema={"col1": pl.Float32, "col2": pl.Int64}) >>> lf2.collect() shape: (2, 2) ββββββββ¬βββββββ β col1 β col2 β β --- β --- β β f32 β i64 β ββββββββͺβββββββ‘ β 0.0 β 3 β β 2.0 β 7 β ββββββββ΄βββββββ
β¦a sequence of (name,dtype) pairsβ¦
>>> data = {"col1": [1, 2], "col2": [3, 4]} >>> lf3 = pl.LazyFrame(data, schema=[("col1", pl.Float32), ("col2", pl.Int64)]) >>> lf3.collect() shape: (2, 2) ββββββββ¬βββββββ β col1 β col2 β β --- β --- β β f32 β i64 β ββββββββͺβββββββ‘ β 1.0 β 3 β β 2.0 β 4 β ββββββββ΄βββββββ
β¦or a list of typed Series.
>>> data = [ ... pl.Series("col1", [1, 2], dtype=pl.Float32), ... pl.Series("col2", [3, 4], dtype=pl.Int64), ... ] >>> lf4 = pl.LazyFrame(data) >>> lf4.collect() shape: (2, 2) ββββββββ¬βββββββ β col1 β col2 β β --- β --- β β f32 β i64 β ββββββββͺβββββββ‘ β 1.0 β 3 β β 2.0 β 4 β ββββββββ΄βββββββ
Constructing a LazyFrame from a numpy ndarray, specifying column names:
>>> import numpy as np >>> data = np.array([(1, 2), (3, 4)], dtype=np.int64) >>> lf5 = pl.LazyFrame(data, schema=["a", "b"], orient="col") >>> lf5.collect() shape: (2, 2) βββββββ¬ββββββ β a β b β β --- β --- β β i64 β i64 β βββββββͺββββββ‘ β 1 β 3 β β 2 β 4 β βββββββ΄ββββββ
Constructing a LazyFrame from a list of lists, row orientation inferred:
>>> data = [[1, 2, 3], [4, 5, 6]] >>> lf6 = pl.LazyFrame(data, schema=["a", "b", "c"]) >>> lf6.collect() shape: (2, 3) βββββββ¬ββββββ¬ββββββ β a β b β c β β --- β --- β --- β β i64 β i64 β i64 β βββββββͺββββββͺββββββ‘ β 1 β 2 β 3 β β 4 β 5 β 6 β βββββββ΄ββββββ΄ββββββ
Methods:
Approximate count of unique values.
Return the
k
smallest rows.Cache the result once the execution of the physical plan hits this node.
Cast LazyFrame column(s) to the specified dtype(s).
Create an empty copy of the current LazyFrame, with zero to 'n' rows.
Create a copy of this LazyFrame.
Materialize this LazyFrame into a DataFrame.
Collect DataFrame asynchronously in thread pool.
Return the number of non-null elements for each column.
Creates a summary of statistics for a LazyFrame, returning a DataFrame.
Read a logical plan from a file to construct a LazyFrame.
Remove columns from the DataFrame.
Drop all rows that contain null values.
Create a string representation of the query plan.
Explode the DataFrame to long format by exploding the given columns.
Collect a small number of rows for debugging purposes.
Fill floating point NaN values.
Fill null values using the specified value or strategy.
Filter the rows in the LazyFrame based on a predicate expression.
Get the first row of the DataFrame.
Take every nth row in the LazyFrame and return as a new LazyFrame.
Start a group by operation.
Group based on a time value (or index value of type Int32, Int64).
Create rolling groups based on a time, Int32, or Int64 column.
Start a group by operation.
Group based on a time value (or index value of type Int32, Int64).
Create rolling groups based on a time, Int32, or Int64 column.
Get the first
n
rows.Inspect a node in the computation graph.
Interpolate intermediate values.
Add a join operation to the Logical Plan.
Perform an asof join.
Get the last row of the DataFrame.
Return lazy representation, i.e. itself.
Get the first
n
rows.Apply a custom function.
Apply a custom function.
Aggregate the columns in the LazyFrame to their maximum value.
Aggregate the columns in the LazyFrame to their mean value.
Aggregate the columns in the LazyFrame to their median value.
Unpivot a DataFrame from wide to long format.
Take two sorted DataFrames and merge them by the sorted key.
Aggregate the columns in the LazyFrame to their minimum value.
Aggregate the columns in the LazyFrame as the sum of their null value count.
Offers a structured way to apply a sequence of user-defined functions (UDFs).
Profile a LazyFrame.
Aggregate the columns in the LazyFrame to their quantile value.
Rename column names.
Reverse the DataFrame.
Create rolling groups based on a temporal or integer column.
Select columns from this LazyFrame.
Select columns from this LazyFrame.
Serialize the logical plan of this LazyFrame to a file or string in JSON format.
Indicate that one or multiple columns are sorted.
Shift values by the given number of indices.
Shift values by the given number of places and fill the resulting null values.
Show a plot of the query plan.
Evaluate the query in streaming mode and write to a CSV file.
Evaluate the query in streaming mode and write to an IPC file.
Evaluate the query in streaming mode and write to an NDJSON file.
Evaluate the query in streaming mode and write to a Parquet file.
Get a slice of this DataFrame.
Sort the LazyFrame by the given columns.
Execute a SQL query against the LazyFrame.
Aggregate the columns in the LazyFrame to their standard deviation value.
Aggregate the columns in the LazyFrame to their sum value.
Get the last
n
rows.Take every nth row in the LazyFrame and return as a new LazyFrame.
Return the
k
largest rows.Drop duplicate rows from this DataFrame.
Decompose struct columns into separate columns for each of their fields.
Update the values in this
LazyFrame
with the non-null values inother
.Aggregate the columns in the LazyFrame to their variance value.
Add columns to this LazyFrame.
Add columns to this LazyFrame.
Add an external context to the computation graph.
Add a column at index 0 that counts the rows.
Add a row index as the first column in the LazyFrame.
Attributes:
Get the column names.
Get the column data types.
Get a mapping of column names to their data type.
Get the number of columns.
- approx_n_unique() Self [source]
Approximate count of unique values.
Deprecated since version 0.20.11: Use
select(pl.all().approx_n_unique())
instead.This is done using the HyperLogLog++ algorithm for cardinality estimation.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [1, 2, 1, 1], ... } ... ) >>> lf.approx_n_unique().collect() shape: (1, 2) βββββββ¬ββββββ β a β b β β --- β --- β β u32 β u32 β βββββββͺββββββ‘ β 4 β 2 β βββββββ΄ββββββ
- bottom_k(
- k: int,
- *,
- by: IntoExpr | Iterable[IntoExpr],
- descending: bool | Sequence[bool] = False,
- nulls_last: bool | Sequence[bool] | None = None,
- maintain_order: bool | None = None,
- multithreaded: bool | None = None,
Return the
k
smallest rows.- Parameters:
- k
Number of rows to return.
- by
Column(s) used to determine the bottom rows. Accepts expression input. Strings are parsed as column names.
- descending
Consider the
k
largest elements of theby
column(s) (instead of thek
smallest). This can be specified per column by passing a sequence of booleans.- nulls_last
Place null values last.
Deprecated since version 0.20.31: This parameter will be removed in the next breaking release. Null values will be considered lowest priority and will only be included if
k
is larger than the number of non-null elements.- maintain_order
Whether the order should be maintained if elements are equal. Note that if
true
streaming is not possible and performance might be worse since this requires a stable search.Deprecated since version 0.20.31: This parameter will be removed in the next breaking release. There will be no guarantees about the order of the output.
- multithreaded
Sort using multiple threads.
Deprecated since version 0.20.31: This parameter will be removed in the next breaking release. Polars itself will determine whether to use multithreading or not.
See also
Examples
>>> lf = pl.LazyFrame( ... { ... "a": ["a", "b", "a", "b", "b", "c"], ... "b": [2, 1, 1, 3, 2, 1], ... } ... )
Get the rows which contain the 4 smallest values in column b.
>>> lf.bottom_k(4, by="b").collect() shape: (4, 2) βββββββ¬ββββββ β a β b β β --- β --- β β str β i64 β βββββββͺββββββ‘ β b β 1 β β a β 1 β β c β 1 β β a β 2 β βββββββ΄ββββββ
Get the rows which contain the 4 smallest values when sorting on column a and b.
>>> lf.bottom_k(4, by=["a", "b"]).collect() shape: (4, 2) βββββββ¬ββββββ β a β b β β --- β --- β β str β i64 β βββββββͺββββββ‘ β a β 1 β β a β 2 β β b β 1 β β b β 2 β βββββββ΄ββββββ
- cache() Self [source]
Cache the result once the execution of the physical plan hits this node.
It is not recommended using this as the optimizer likely can do a better job.
- cast(
- dtypes: Mapping[ColumnNameOrSelector | PolarsDataType, PolarsDataType] | PolarsDataType,
- *,
- strict: bool = True,
Cast LazyFrame column(s) to the specified dtype(s).
- Parameters:
- dtypes
Mapping of column names (or selector) to dtypes, or a single dtype to which all columns will be cast.
- strict
Throw an error if a cast could not be done (for instance, due to an overflow).
Examples
>>> from datetime import date >>> lf = pl.LazyFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": [date(2020, 1, 2), date(2021, 3, 4), date(2022, 5, 6)], ... } ... )
Cast specific frame columns to the specified dtypes:
>>> lf.cast({"foo": pl.Float32, "bar": pl.UInt8}).collect() shape: (3, 3) βββββββ¬ββββββ¬βββββββββββββ β foo β bar β ham β β --- β --- β --- β β f32 β u8 β date β βββββββͺββββββͺβββββββββββββ‘ β 1.0 β 6 β 2020-01-02 β β 2.0 β 7 β 2021-03-04 β β 3.0 β 8 β 2022-05-06 β βββββββ΄ββββββ΄βββββββββββββ
Cast all frame columns matching one dtype (or dtype group) to another dtype:
>>> lf.cast({pl.Date: pl.Datetime}).collect() shape: (3, 3) βββββββ¬ββββββ¬ββββββββββββββββββββββ β foo β bar β ham β β --- β --- β --- β β i64 β f64 β datetime[ΞΌs] β βββββββͺββββββͺββββββββββββββββββββββ‘ β 1 β 6.0 β 2020-01-02 00:00:00 β β 2 β 7.0 β 2021-03-04 00:00:00 β β 3 β 8.0 β 2022-05-06 00:00:00 β βββββββ΄ββββββ΄ββββββββββββββββββββββ
Use selectors to define the columns being cast:
>>> import polars.selectors as cs >>> lf.cast({cs.numeric(): pl.UInt32, cs.temporal(): pl.String}).collect() shape: (3, 3) βββββββ¬ββββββ¬βββββββββββββ β foo β bar β ham β β --- β --- β --- β β u32 β u32 β str β βββββββͺββββββͺβββββββββββββ‘ β 1 β 6 β 2020-01-02 β β 2 β 7 β 2021-03-04 β β 3 β 8 β 2022-05-06 β βββββββ΄ββββββ΄βββββββββββββ
Cast all frame columns to the specified dtype:
>>> lf.cast(pl.String).collect().to_dict(as_series=False) {'foo': ['1', '2', '3'], 'bar': ['6.0', '7.0', '8.0'], 'ham': ['2020-01-02', '2021-03-04', '2022-05-06']}
- clear(n: int = 0) LazyFrame [source]
Create an empty copy of the current LazyFrame, with zero to βnβ rows.
Returns a copy with an identical schema but no data.
- Parameters:
- n
Number of (empty) rows to return in the cleared frame.
See also
clone
Cheap deepcopy/clone.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [None, 2, 3, 4], ... "b": [0.5, None, 2.5, 13], ... "c": [True, True, False, None], ... } ... ) >>> lf.clear().collect() shape: (0, 3) βββββββ¬ββββββ¬βββββββ β a β b β c β β --- β --- β --- β β i64 β f64 β bool β βββββββͺββββββͺβββββββ‘ βββββββ΄ββββββ΄βββββββ
>>> lf.clear(2).collect() shape: (2, 3) ββββββββ¬βββββββ¬βββββββ β a β b β c β β --- β --- β --- β β i64 β f64 β bool β ββββββββͺβββββββͺβββββββ‘ β null β null β null β β null β null β null β ββββββββ΄βββββββ΄βββββββ
- clone() Self [source]
Create a copy of this LazyFrame.
This is a cheap operation that does not copy data.
See also
clear
Create an empty copy of the current LazyFrame, with identical schema but no data.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [None, 2, 3, 4], ... "b": [0.5, None, 2.5, 13], ... "c": [True, True, False, None], ... } ... ) >>> lf.clone() <LazyFrame at ...>
- collect(
- *,
- type_coercion: bool = True,
- predicate_pushdown: bool = True,
- projection_pushdown: bool = True,
- simplify_expression: bool = True,
- slice_pushdown: bool = True,
- comm_subplan_elim: bool = True,
- comm_subexpr_elim: bool = True,
- cluster_with_columns: bool = True,
- no_optimization: bool = False,
- streaming: bool = False,
- background: bool = False,
- _eager: bool = False,
- **_kwargs: Any,
Materialize this LazyFrame into a DataFrame.
By default, all query optimizations are enabled. Individual optimizations may be disabled by setting the corresponding parameter to
False
.- Parameters:
- type_coercion
Do type coercion optimization.
- predicate_pushdown
Do predicate pushdown optimization.
- projection_pushdown
Do projection pushdown optimization.
- simplify_expression
Run simplify expressions optimization.
- slice_pushdown
Slice pushdown optimization.
- comm_subplan_elim
Will try to cache branching subplans that occur on self-joins or unions.
- comm_subexpr_elim
Common subexpressions will be cached and reused.
- cluster_with_columns
Combine sequential independent calls to with_columns
- no_optimization
Turn off (certain) optimizations.
- streaming
Process the query in batches to handle larger-than-memory data. If set to
False
(default), the entire query is processed in a single batch.Warning
Streaming mode is considered unstable. It may be changed at any point without it being considered a breaking change.
Note
Use
explain()
to see if Polars can process the query in streaming mode.- background
Run the query in the background and get a handle to the query. This handle can be used to fetch the result or cancel the query.
- Returns:
- DataFrame
See also
fetch
Run the query on the first
n
rows only for debugging purposes.explain
Print the query plan that is evaluated with collect.
profile
Collect the LazyFrame and time each node in the computation graph.
polars.collect_all
Collect multiple LazyFrames at the same time.
polars.Config.set_streaming_chunk_size
Set the size of streaming batches.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": ["a", "b", "a", "b", "b", "c"], ... "b": [1, 2, 3, 4, 5, 6], ... "c": [6, 5, 4, 3, 2, 1], ... } ... ) >>> lf.group_by("a").agg(pl.all().sum()).collect() shape: (3, 3) βββββββ¬ββββββ¬ββββββ β a β b β c β β --- β --- β --- β β str β i64 β i64 β βββββββͺββββββͺββββββ‘ β a β 4 β 10 β β b β 11 β 10 β β c β 6 β 1 β βββββββ΄ββββββ΄ββββββ
Collect in streaming mode
>>> lf.group_by("a").agg(pl.all().sum()).collect( ... streaming=True ... ) shape: (3, 3) βββββββ¬ββββββ¬ββββββ β a β b β c β β --- β --- β --- β β str β i64 β i64 β βββββββͺββββββͺββββββ‘ β a β 4 β 10 β β b β 11 β 10 β β c β 6 β 1 β βββββββ΄ββββββ΄ββββββ
- collect_async(
- *,
- gevent: bool = False,
- type_coercion: bool = True,
- predicate_pushdown: bool = True,
- projection_pushdown: bool = True,
- simplify_expression: bool = True,
- no_optimization: bool = False,
- slice_pushdown: bool = True,
- comm_subplan_elim: bool = True,
- comm_subexpr_elim: bool = True,
- cluster_with_columns: bool = True,
- streaming: bool = False,
Collect DataFrame asynchronously in thread pool.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
Collects into a DataFrame (like
collect()
) but, instead of returning a DataFrame directly, it is scheduled to be collected inside a thread pool, while this method returns almost instantly.This can be useful if you use
gevent
orasyncio
and want to release control to other greenlets/tasks while LazyFrames are being collected.- Parameters:
- gevent
Return wrapper to
gevent.event.AsyncResult
instead of Awaitable- type_coercion
Do type coercion optimization.
- predicate_pushdown
Do predicate pushdown optimization.
- projection_pushdown
Do projection pushdown optimization.
- simplify_expression
Run simplify expressions optimization.
- no_optimization
Turn off (certain) optimizations.
- slice_pushdown
Slice pushdown optimization.
- comm_subplan_elim
Will try to cache branching subplans that occur on self-joins or unions.
- comm_subexpr_elim
Common subexpressions will be cached and reused.
- cluster_with_columns
Combine sequential independent calls to with_columns
- streaming
Process the query in batches to handle larger-than-memory data. If set to
False
(default), the entire query is processed in a single batch.Warning
Streaming mode is considered unstable. It may be changed at any point without it being considered a breaking change.
Note
Use
explain()
to see if Polars can process the query in streaming mode.
- Returns:
- If
gevent=False
(default) then returns an awaitable. - If
gevent=True
then returns wrapper that has a .get(block=True, timeout=None)
method.
- If
See also
polars.collect_all
Collect multiple LazyFrames at the same time.
polars.collect_all_async
Collect multiple LazyFrames at the same time lazily.
Notes
In case of error
set_exception
is used onasyncio.Future
/gevent.event.AsyncResult
and will be reraised by them.Examples
>>> import asyncio >>> lf = pl.LazyFrame( ... { ... "a": ["a", "b", "a", "b", "b", "c"], ... "b": [1, 2, 3, 4, 5, 6], ... "c": [6, 5, 4, 3, 2, 1], ... } ... ) >>> async def main(): ... return await ( ... lf.group_by("a", maintain_order=True) ... .agg(pl.all().sum()) ... .collect_async() ... ) >>> asyncio.run(main()) shape: (3, 3) βββββββ¬ββββββ¬ββββββ β a β b β c β β --- β --- β --- β β str β i64 β i64 β βββββββͺββββββͺββββββ‘ β a β 4 β 10 β β b β 11 β 10 β β c β 6 β 1 β βββββββ΄ββββββ΄ββββββ
- property columns: list[str][source]
Get the column names.
- Returns:
- list of str
A list containing the name of each column in order.
Warning
Determining the column names of a LazyFrame requires resolving its schema. Resolving the schema of a LazyFrame can be an expensive operation. Avoid accessing this property repeatedly if possible.
Examples
>>> lf = pl.LazyFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ).select("foo", "bar") >>> lf.columns ['foo', 'bar']
- count() Self [source]
Return the number of non-null elements for each column.
Examples
>>> lf = pl.LazyFrame( ... {"a": [1, 2, 3, 4], "b": [1, 2, 1, None], "c": [None, None, None, None]} ... ) >>> lf.count().collect() shape: (1, 3) βββββββ¬ββββββ¬ββββββ β a β b β c β β --- β --- β --- β β u32 β u32 β u32 β βββββββͺββββββͺββββββ‘ β 4 β 3 β 0 β βββββββ΄ββββββ΄ββββββ
- describe(
- percentiles: Sequence[float] | float | None = (0.25, 0.5, 0.75),
- *,
- interpolation: RollingInterpolationMethod = 'nearest',
Creates a summary of statistics for a LazyFrame, returning a DataFrame.
- Parameters:
- percentiles
One or more percentiles to include in the summary statistics. All values must be in the range
[0, 1]
.- interpolation{βnearestβ, βhigherβ, βlowerβ, βmidpointβ, βlinearβ}
Interpolation method used when calculating percentiles.
- Returns:
- DataFrame
Warning
This method does not maintain the laziness of the frame, and will
collect
the final result. This could potentially be an expensive operation.We do not guarantee the output of
describe
to be stable. It will show statistics that we deem informative, and may be updated in the future. Usingdescribe
programmatically (versus interactive exploration) is not recommended for this reason.
Notes
The median is included by default as the 50% percentile.
Examples
>>> from datetime import date, time >>> lf = pl.LazyFrame( ... { ... "float": [1.0, 2.8, 3.0], ... "int": [40, 50, None], ... "bool": [True, False, True], ... "str": ["zz", "xx", "yy"], ... "date": [date(2020, 1, 1), date(2021, 7, 5), date(2022, 12, 31)], ... "time": [time(10, 20, 30), time(14, 45, 50), time(23, 15, 10)], ... } ... )
Show default frame statistics:
>>> lf.describe() shape: (9, 7) ββββββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββ¬βββββββββββββ¬βββββββββββ β statistic β float β int β bool β str β date β time β β --- β --- β --- β --- β --- β --- β --- β β str β f64 β f64 β f64 β str β str β str β ββββββββββββββͺβββββββββββͺβββββββββββͺβββββββββββͺβββββββͺβββββββββββββͺβββββββββββ‘ β count β 3.0 β 2.0 β 3.0 β 3 β 3 β 3 β β null_count β 0.0 β 1.0 β 0.0 β 0 β 0 β 0 β β mean β 2.266667 β 45.0 β 0.666667 β null β 2021-07-02 β 16:07:10 β β std β 1.101514 β 7.071068 β null β null β null β null β β min β 1.0 β 40.0 β 0.0 β xx β 2020-01-01 β 10:20:30 β β 25% β 2.8 β 40.0 β null β null β 2021-07-05 β 14:45:50 β β 50% β 2.8 β 50.0 β null β null β 2021-07-05 β 14:45:50 β β 75% β 3.0 β 50.0 β null β null β 2022-12-31 β 23:15:10 β β max β 3.0 β 50.0 β 1.0 β zz β 2022-12-31 β 23:15:10 β ββββββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββ΄βββββββββββββ΄βββββββββββ
Customize which percentiles are displayed, applying linear interpolation:
>>> with pl.Config(tbl_rows=12): ... lf.describe( ... percentiles=[0.1, 0.3, 0.5, 0.7, 0.9], ... interpolation="linear", ... ) shape: (11, 7) ββββββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββ¬βββββββββββββ¬βββββββββββ β statistic β float β int β bool β str β date β time β β --- β --- β --- β --- β --- β --- β --- β β str β f64 β f64 β f64 β str β str β str β ββββββββββββββͺβββββββββββͺβββββββββββͺβββββββββββͺβββββββͺβββββββββββββͺβββββββββββ‘ β count β 3.0 β 2.0 β 3.0 β 3 β 3 β 3 β β null_count β 0.0 β 1.0 β 0.0 β 0 β 0 β 0 β β mean β 2.266667 β 45.0 β 0.666667 β null β 2021-07-02 β 16:07:10 β β std β 1.101514 β 7.071068 β null β null β null β null β β min β 1.0 β 40.0 β 0.0 β xx β 2020-01-01 β 10:20:30 β β 10% β 1.36 β 41.0 β null β null β 2020-04-20 β 11:13:34 β β 30% β 2.08 β 43.0 β null β null β 2020-11-26 β 12:59:42 β β 50% β 2.8 β 45.0 β null β null β 2021-07-05 β 14:45:50 β β 70% β 2.88 β 47.0 β null β null β 2022-02-07 β 18:09:34 β β 90% β 2.96 β 49.0 β null β null β 2022-09-13 β 21:33:18 β β max β 3.0 β 50.0 β 1.0 β zz β 2022-12-31 β 23:15:10 β ββββββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββ΄βββββββββββββ΄βββββββββββ
- classmethod deserialize(source: str | Path | IOBase) Self [source]
Read a logical plan from a file to construct a LazyFrame.
- Parameters:
- source
Path to a file or a file-like object (by file-like object, we refer to objects that have a
read()
method, such as a file handler (e.g. via builtinopen
function) orBytesIO
).
Warning
This function uses
pickle
when the logical plan contains Python UDFs, and as such inherits the security implications. Deserializing can execute arbitrary code, so it should only be attempted on trusted data.See also
Examples
>>> import io >>> lf = pl.LazyFrame({"a": [1, 2, 3]}).sum() >>> json = lf.serialize() >>> pl.LazyFrame.deserialize(io.StringIO(json)).collect() shape: (1, 1) βββββββ β a β β --- β β i64 β βββββββ‘ β 6 β βββββββ
- drop(*columns: ColumnNameOrSelector | Iterable[ColumnNameOrSelector]) Self [source]
Remove columns from the DataFrame.
- Parameters:
- *columns
Names of the columns that should be removed from the dataframe. Accepts column selector input.
Examples
Drop a single column by passing the name of that column.
>>> lf = pl.LazyFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": ["a", "b", "c"], ... } ... ) >>> lf.drop("ham").collect() shape: (3, 2) βββββββ¬ββββββ β foo β bar β β --- β --- β β i64 β f64 β βββββββͺββββββ‘ β 1 β 6.0 β β 2 β 7.0 β β 3 β 8.0 β βββββββ΄ββββββ
Drop multiple columns by passing a selector.
>>> import polars.selectors as cs >>> lf.drop(cs.numeric()).collect() shape: (3, 1) βββββββ β ham β β --- β β str β βββββββ‘ β a β β b β β c β βββββββ
Use positional arguments to drop multiple columns.
>>> lf.drop("foo", "ham").collect() shape: (3, 1) βββββββ β bar β β --- β β f64 β βββββββ‘ β 6.0 β β 7.0 β β 8.0 β βββββββ
- drop_nulls(
- subset: ColumnNameOrSelector | Collection[ColumnNameOrSelector] | None = None,
Drop all rows that contain null values.
The original order of the remaining rows is preserved.
- Parameters:
- subset
Column name(s) for which null values are considered. If set to
None
(default), use all columns.
Examples
>>> lf = pl.LazyFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, None, 8], ... "ham": ["a", "b", None], ... } ... )
The default behavior of this method is to drop rows where any single value of the row is null.
>>> lf.drop_nulls().collect() shape: (1, 3) βββββββ¬ββββββ¬ββββββ β foo β bar β ham β β --- β --- β --- β β i64 β i64 β str β βββββββͺββββββͺββββββ‘ β 1 β 6 β a β βββββββ΄ββββββ΄ββββββ
This behaviour can be constrained to consider only a subset of columns, as defined by name or with a selector. For example, dropping rows if there is a null in any of the integer columns:
>>> import polars.selectors as cs >>> lf.drop_nulls(subset=cs.integer()).collect() shape: (2, 3) βββββββ¬ββββββ¬βββββββ β foo β bar β ham β β --- β --- β --- β β i64 β i64 β str β βββββββͺββββββͺβββββββ‘ β 1 β 6 β a β β 3 β 8 β null β βββββββ΄ββββββ΄βββββββ
This method drops a row if any single value of the row is null.
Below are some example snippets that show how you could drop null values based on other conditions:
>>> lf = pl.LazyFrame( ... { ... "a": [None, None, None, None], ... "b": [1, 2, None, 1], ... "c": [1, None, None, 1], ... } ... ) >>> lf.collect() shape: (4, 3) ββββββββ¬βββββββ¬βββββββ β a β b β c β β --- β --- β --- β β null β i64 β i64 β ββββββββͺβββββββͺβββββββ‘ β null β 1 β 1 β β null β 2 β null β β null β null β null β β null β 1 β 1 β ββββββββ΄βββββββ΄βββββββ
Drop a row only if all values are null:
>>> lf.filter(~pl.all_horizontal(pl.all().is_null())).collect() shape: (3, 3) ββββββββ¬ββββββ¬βββββββ β a β b β c β β --- β --- β --- β β null β i64 β i64 β ββββββββͺββββββͺβββββββ‘ β null β 1 β 1 β β null β 2 β null β β null β 1 β 1 β ββββββββ΄ββββββ΄βββββββ
- property dtypes: list[DataType][source]
Get the column data types.
- Returns:
- list of DataType
A list containing the data type of each column in order.
Warning
Determining the data types of a LazyFrame requires resolving its schema. Resolving the schema of a LazyFrame can be an expensive operation. Avoid accessing this property repeatedly if possible.
See also
Examples
>>> lf = pl.LazyFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": ["a", "b", "c"], ... } ... ) >>> lf.dtypes [Int64, Float64, String]
- explain(
- *,
- format: ExplainFormat = 'plain',
- optimized: bool = True,
- type_coercion: bool = True,
- predicate_pushdown: bool = True,
- projection_pushdown: bool = True,
- simplify_expression: bool = True,
- slice_pushdown: bool = True,
- comm_subplan_elim: bool = True,
- comm_subexpr_elim: bool = True,
- cluster_with_columns: bool = True,
- streaming: bool = False,
- tree_format: bool | None = None,
Create a string representation of the query plan.
Different optimizations can be turned on or off.
- Parameters:
- format{βplainβ, βtreeβ}
The format to use for displaying the logical plan.
- optimized
Return an optimized query plan. Defaults to
True
. If this is set toTrue
the subsequent optimization flags control which optimizations run.- type_coercion
Do type coercion optimization.
- predicate_pushdown
Do predicate pushdown optimization.
- projection_pushdown
Do projection pushdown optimization.
- simplify_expression
Run simplify expressions optimization.
- slice_pushdown
Slice pushdown optimization.
- comm_subplan_elim
Will try to cache branching subplans that occur on self-joins or unions.
- comm_subexpr_elim
Common subexpressions will be cached and reused.
- cluster_with_columns
Combine sequential independent calls to with_columns
- streaming
Run parts of the query in a streaming fashion (this is in an alpha state)
- tree_format
Format the output as a tree.
Deprecated since version 0.20.30: Use
format="tree"
instead.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": ["a", "b", "a", "b", "b", "c"], ... "b": [1, 2, 3, 4, 5, 6], ... "c": [6, 5, 4, 3, 2, 1], ... } ... ) >>> lf.group_by("a", maintain_order=True).agg(pl.all().sum()).sort( ... "a" ... ).explain()
- explode( ) Self [source]
Explode the DataFrame to long format by exploding the given columns.
- Parameters:
- columns
Column names, expressions, or a selector defining them. The underlying columns being exploded must be of the
List
orArray
data type.- *more_columns
Additional names of columns to explode, specified as positional arguments.
Examples
>>> lf = pl.LazyFrame( ... { ... "letters": ["a", "a", "b", "c"], ... "numbers": [[1], [2, 3], [4, 5], [6, 7, 8]], ... } ... ) >>> lf.explode("numbers").collect() shape: (8, 2) βββββββββββ¬ββββββββββ β letters β numbers β β --- β --- β β str β i64 β βββββββββββͺββββββββββ‘ β a β 1 β β a β 2 β β a β 3 β β b β 4 β β b β 5 β β c β 6 β β c β 7 β β c β 8 β βββββββββββ΄ββββββββββ
- fetch(
- n_rows: int = 500,
- *,
- type_coercion: bool = True,
- predicate_pushdown: bool = True,
- projection_pushdown: bool = True,
- simplify_expression: bool = True,
- no_optimization: bool = False,
- slice_pushdown: bool = True,
- comm_subplan_elim: bool = True,
- comm_subexpr_elim: bool = True,
- cluster_with_columns: bool = True,
- streaming: bool = False,
Collect a small number of rows for debugging purposes.
- Parameters:
- n_rows
Collect n_rows from the data sources.
- type_coercion
Run type coercion optimization.
- predicate_pushdown
Run predicate pushdown optimization.
- projection_pushdown
Run projection pushdown optimization.
- simplify_expression
Run simplify expressions optimization.
- no_optimization
Turn off optimizations.
- slice_pushdown
Slice pushdown optimization
- comm_subplan_elim
Will try to cache branching subplans that occur on self-joins or unions.
- comm_subexpr_elim
Common subexpressions will be cached and reused.
- cluster_with_columns
Combine sequential independent calls to with_columns
- streaming
Run parts of the query in a streaming fashion (this is in an alpha state)
- Returns:
- DataFrame
Warning
This is strictly a utility function that can help to debug queries using a smaller number of rows, and should not be used in production code.
Notes
This is similar to a
collect()
operation, but it overwrites the number of rows read by every scan operation. Be aware thatfetch
does not guarantee the final number of rows in the DataFrame. Filters, join operations and fewer rows being available in the scanned data will all influence the final number of rows (joins are especially susceptible to this, and may return no data at all ifn_rows
is too small as the join keys may not be present).Examples
>>> lf = pl.LazyFrame( ... { ... "a": ["a", "b", "a", "b", "b", "c"], ... "b": [1, 2, 3, 4, 5, 6], ... "c": [6, 5, 4, 3, 2, 1], ... } ... ) >>> lf.group_by("a", maintain_order=True).agg(pl.all().sum()).fetch(2) shape: (2, 3) βββββββ¬ββββββ¬ββββββ β a β b β c β β --- β --- β --- β β str β i64 β i64 β βββββββͺββββββͺββββββ‘ β a β 1 β 6 β β b β 2 β 5 β βββββββ΄ββββββ΄ββββββ
- fill_nan(value: int | float | Expr | None) Self [source]
Fill floating point NaN values.
- Parameters:
- value
Value to fill the NaN values with.
Warning
Note that floating point NaNs (Not a Number) are not missing values. To replace missing values, use
fill_null()
.See also
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1.5, 2, float("nan"), 4], ... "b": [0.5, 4, float("nan"), 13], ... } ... ) >>> lf.fill_nan(99).collect() shape: (4, 2) ββββββββ¬βββββββ β a β b β β --- β --- β β f64 β f64 β ββββββββͺβββββββ‘ β 1.5 β 0.5 β β 2.0 β 4.0 β β 99.0 β 99.0 β β 4.0 β 13.0 β ββββββββ΄βββββββ
- fill_null(
- value: Any | Expr | None = None,
- strategy: FillNullStrategy | None = None,
- limit: int | None = None,
- *,
- matches_supertype: bool = True,
Fill null values using the specified value or strategy.
- Parameters:
- value
Value used to fill null values.
- strategy{None, βforwardβ, βbackwardβ, βminβ, βmaxβ, βmeanβ, βzeroβ, βoneβ}
Strategy used to fill null values.
- limit
Number of consecutive null values to fill when using the βforwardβ or βbackwardβ strategy.
- matches_supertype
Fill all matching supertypes of the fill
value
literal.
See also
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, None, 4], ... "b": [0.5, 4, None, 13], ... } ... ) >>> lf.fill_null(99).collect() shape: (4, 2) βββββββ¬βββββββ β a β b β β --- β --- β β i64 β f64 β βββββββͺβββββββ‘ β 1 β 0.5 β β 2 β 4.0 β β 99 β 99.0 β β 4 β 13.0 β βββββββ΄βββββββ >>> lf.fill_null(strategy="forward").collect() shape: (4, 2) βββββββ¬βββββββ β a β b β β --- β --- β β i64 β f64 β βββββββͺβββββββ‘ β 1 β 0.5 β β 2 β 4.0 β β 2 β 4.0 β β 4 β 13.0 β βββββββ΄βββββββ
>>> lf.fill_null(strategy="max").collect() shape: (4, 2) βββββββ¬βββββββ β a β b β β --- β --- β β i64 β f64 β βββββββͺβββββββ‘ β 1 β 0.5 β β 2 β 4.0 β β 4 β 13.0 β β 4 β 13.0 β βββββββ΄βββββββ
>>> lf.fill_null(strategy="zero").collect() shape: (4, 2) βββββββ¬βββββββ β a β b β β --- β --- β β i64 β f64 β βββββββͺβββββββ‘ β 1 β 0.5 β β 2 β 4.0 β β 0 β 0.0 β β 4 β 13.0 β βββββββ΄βββββββ
- filter(
- *predicates: IntoExprColumn | Iterable[IntoExprColumn] | bool | list[bool] | np.ndarray[Any, Any],
- **constraints: Any,
Filter the rows in the LazyFrame based on a predicate expression.
The original order of the remaining rows is preserved.
Rows where the filter does not evaluate to True are discarded, including nulls.
- Parameters:
- predicates
Expression that evaluates to a boolean Series.
- constraints
Column filters; use
name = value
to filter columns by the supplied value. Each constraint will behave the same aspl.col(name).eq(value)
, and will be implicitly joined with the other filter conditions using&
.
Examples
>>> lf = pl.LazyFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... )
Filter on one condition:
>>> lf.filter(pl.col("foo") > 1).collect() shape: (2, 3) βββββββ¬ββββββ¬ββββββ β foo β bar β ham β β --- β --- β --- β β i64 β i64 β str β βββββββͺββββββͺββββββ‘ β 2 β 7 β b β β 3 β 8 β c β βββββββ΄ββββββ΄ββββββ
Filter on multiple conditions:
>>> lf.filter((pl.col("foo") < 3) & (pl.col("ham") == "a")).collect() shape: (1, 3) βββββββ¬ββββββ¬ββββββ β foo β bar β ham β β --- β --- β --- β β i64 β i64 β str β βββββββͺββββββͺββββββ‘ β 1 β 6 β a β βββββββ΄ββββββ΄ββββββ
Provide multiple filters using
*args
syntax:>>> lf.filter( ... pl.col("foo") == 1, ... pl.col("ham") == "a", ... ).collect() shape: (1, 3) βββββββ¬ββββββ¬ββββββ β foo β bar β ham β β --- β --- β --- β β i64 β i64 β str β βββββββͺββββββͺββββββ‘ β 1 β 6 β a β βββββββ΄ββββββ΄ββββββ
Provide multiple filters using
**kwargs
syntax:>>> lf.filter(foo=1, ham="a").collect() shape: (1, 3) βββββββ¬ββββββ¬ββββββ β foo β bar β ham β β --- β --- β --- β β i64 β i64 β str β βββββββͺββββββͺββββββ‘ β 1 β 6 β a β βββββββ΄ββββββ΄ββββββ
Filter on an OR condition:
>>> lf.filter((pl.col("foo") == 1) | (pl.col("ham") == "c")).collect() shape: (2, 3) βββββββ¬ββββββ¬ββββββ β foo β bar β ham β β --- β --- β --- β β i64 β i64 β str β βββββββͺββββββͺββββββ‘ β 1 β 6 β a β β 3 β 8 β c β βββββββ΄ββββββ΄ββββββ
- first() Self [source]
Get the first row of the DataFrame.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 3, 5], ... "b": [2, 4, 6], ... } ... ) >>> lf.first().collect() shape: (1, 2) βββββββ¬ββββββ β a β b β β --- β --- β β i64 β i64 β βββββββͺββββββ‘ β 1 β 2 β βββββββ΄ββββββ
- gather_every(n: int, offset: int = 0) Self [source]
Take every nth row in the LazyFrame and return as a new LazyFrame.
- Parameters:
- n
Gather every n-th row.
- offset
Starting index.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [5, 6, 7, 8], ... } ... ) >>> lf.gather_every(2).collect() shape: (2, 2) βββββββ¬ββββββ β a β b β β --- β --- β β i64 β i64 β βββββββͺββββββ‘ β 1 β 5 β β 3 β 7 β βββββββ΄ββββββ >>> lf.gather_every(2, offset=1).collect() shape: (2, 2) βββββββ¬ββββββ β a β b β β --- β --- β β i64 β i64 β βββββββͺββββββ‘ β 2 β 6 β β 4 β 8 β βββββββ΄ββββββ
- group_by(
- *by: IntoExpr | Iterable[IntoExpr],
- maintain_order: bool = False,
- **named_by: IntoExpr,
Start a group by operation.
- Parameters:
- *by
Column(s) to group by. Accepts expression input. Strings are parsed as column names.
- maintain_order
Ensure that the order of the groups is consistent with the input data. This is slower than a default group by. Setting this to
True
blocks the possibility to run on the streaming engine.- **named_by
Additional columns to group by, specified as keyword arguments. The columns will be renamed to the keyword used.
Examples
Group by one column and call
agg
to compute the grouped sum of another column.>>> lf = pl.LazyFrame( ... { ... "a": ["a", "b", "a", "b", "c"], ... "b": [1, 2, 1, 3, 3], ... "c": [5, 4, 3, 2, 1], ... } ... ) >>> lf.group_by("a").agg(pl.col("b").sum()).collect() shape: (3, 2) βββββββ¬ββββββ β a β b β β --- β --- β β str β i64 β βββββββͺββββββ‘ β a β 2 β β b β 5 β β c β 3 β βββββββ΄ββββββ
Set
maintain_order=True
to ensure the order of the groups is consistent with the input.>>> lf.group_by("a", maintain_order=True).agg(pl.col("c")).collect() shape: (3, 2) βββββββ¬ββββββββββββ β a β c β β --- β --- β β str β list[i64] β βββββββͺββββββββββββ‘ β a β [5, 3] β β b β [4, 2] β β c β [1] β βββββββ΄ββββββββββββ
Group by multiple columns by passing a list of column names.
>>> lf.group_by(["a", "b"]).agg(pl.max("c")).collect() shape: (4, 3) βββββββ¬ββββββ¬ββββββ β a β b β c β β --- β --- β --- β β str β i64 β i64 β βββββββͺββββββͺββββββ‘ β a β 1 β 5 β β b β 2 β 4 β β b β 3 β 2 β β c β 3 β 1 β βββββββ΄ββββββ΄ββββββ
Or use positional arguments to group by multiple columns in the same way. Expressions are also accepted.
>>> lf.group_by("a", pl.col("b") // 2).agg( ... pl.col("c").mean() ... ).collect() shape: (3, 3) βββββββ¬ββββββ¬ββββββ β a β b β c β β --- β --- β --- β β str β i64 β f64 β βββββββͺββββββͺββββββ‘ β a β 0 β 4.0 β β b β 1 β 3.0 β β c β 1 β 1.0 β βββββββ΄ββββββ΄ββββββ
- group_by_dynamic(
- index_column: IntoExpr,
- *,
- every: str | timedelta,
- period: str | timedelta | None = None,
- offset: str | timedelta | None = None,
- truncate: bool | None = None,
- include_boundaries: bool = False,
- closed: ClosedInterval = 'left',
- label: Label = 'left',
- group_by: IntoExpr | Iterable[IntoExpr] | None = None,
- start_by: StartBy = 'window',
- check_sorted: bool | None = None,
Group based on a time value (or index value of type Int32, Int64).
Time windows are calculated and rows are assigned to windows. Different from a normal group by is that a row can be member of multiple groups. By default, the windows look like:
[start, start + period)
[start + every, start + every + period)
[start + 2*every, start + 2*every + period)
β¦
where
start
is determined bystart_by
,offset
,every
, and the earliest datapoint. See thestart_by
argument description for details.Warning
The index column must be sorted in ascending order. If
group_by
is passed, then the index column must be sorted in ascending order within each group.- Parameters:
- index_column
Column used to group based on the time window. Often of type Date/Datetime. This column must be sorted in ascending order (or, if
group_by
is specified, then it must be sorted in ascending order within each group).In case of a dynamic group by on indices, dtype needs to be one of {Int32, Int64}. Note that Int32 gets temporarily cast to Int64, so if performance matters use an Int64 column.
- every
interval of the window
- period
length of the window, if None it will equal βeveryβ
- offset
offset of the window, does not take effect if
start_by
is βdatapointβ. Defaults to negativeevery
.- truncate
truncate the time value to the window lower bound
Deprecated since version 0.19.4: Use
label
instead.- include_boundaries
Add the lower and upper bound of the window to the β_lower_boundaryβ and β_upper_boundaryβ columns. This will impact performance because itβs harder to parallelize
- closed{βleftβ, βrightβ, βbothβ, βnoneβ}
Define which sides of the temporal interval are closed (inclusive).
- label{βleftβ, βrightβ, βdatapointβ}
Define which label to use for the window:
βleftβ: lower boundary of the window
βrightβ: upper boundary of the window
βdatapointβ: the first value of the index column in the given window. If you donβt need the label to be at one of the boundaries, choose this option for maximum performance
- group_by
Also group by this column/these columns
- start_by{βwindowβ, βdatapointβ, βmondayβ, βtuesdayβ, βwednesdayβ, βthursdayβ, βfridayβ, βsaturdayβ, βsundayβ}
The strategy to determine the start of the first window by.
βwindowβ: Start by taking the earliest timestamp, truncating it with
every
, and then addingoffset
. Note that weekly windows start on Monday.βdatapointβ: Start from the first encountered data point.
a day of the week (only takes effect if
every
contains'w'
):βmondayβ: Start the window on the Monday before the first data point.
βtuesdayβ: Start the window on the Tuesday before the first data point.
β¦
βsundayβ: Start the window on the Sunday before the first data point.
The resulting window is then shifted back until the earliest datapoint is in or in front of it.
- check_sorted
Check whether
index_column
is sorted (or, ifgroup_by
is given, check whether itβs sorted within each group). When thegroup_by
argument is given, polars can not check sortedness by the metadata and has to do a full scan on the index column to verify data is sorted. This is expensive. If you are sure the data within the groups is sorted, you can set this toFalse
. Doing so incorrectly will lead to incorrect outputDeprecated since version 0.20.31: Sortedness is now verified in a quick manner, you can safely remove this argument.
- Returns:
- LazyGroupBy
Object you can call
.agg
on to aggregate by groups, the result of which will be sorted byindex_column
(but note that ifgroup_by
columns are passed, it will only be sorted within each group).
See also
Notes
If youβre coming from pandas, then
# polars df.group_by_dynamic("ts", every="1d").agg(pl.col("value").sum())
is equivalent to
# pandas df.set_index("ts").resample("D")["value"].sum().reset_index()
though note that, unlike pandas, polars doesnβt add extra rows for empty windows. If you need
index_column
to be evenly spaced, then please combine withDataFrame.upsample()
.The
every
,period
andoffset
arguments are created with the following string language:1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
1i (1 index count)
Or combine them: β3d12h4m25sβ # 3 days, 12 hours, 4 minutes, and 25 seconds
By βcalendar dayβ, we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for βcalendar weekβ, βcalendar monthβ, βcalendar quarterβ, and βcalendar yearβ.
In case of a group_by_dynamic on an integer column, the windows are defined by:
β1iβ # length 1
β10iβ # length 10
Examples
>>> from datetime import datetime >>> lf = pl.LazyFrame( ... { ... "time": pl.datetime_range( ... start=datetime(2021, 12, 16), ... end=datetime(2021, 12, 16, 3), ... interval="30m", ... eager=True, ... ), ... "n": range(7), ... } ... ) >>> lf.collect() shape: (7, 2) βββββββββββββββββββββββ¬ββββββ β time β n β β --- β --- β β datetime[ΞΌs] β i64 β βββββββββββββββββββββββͺββββββ‘ β 2021-12-16 00:00:00 β 0 β β 2021-12-16 00:30:00 β 1 β β 2021-12-16 01:00:00 β 2 β β 2021-12-16 01:30:00 β 3 β β 2021-12-16 02:00:00 β 4 β β 2021-12-16 02:30:00 β 5 β β 2021-12-16 03:00:00 β 6 β βββββββββββββββββββββββ΄ββββββ
Group by windows of 1 hour starting at 2021-12-16 00:00:00.
>>> lf.group_by_dynamic("time", every="1h", closed="right").agg( ... pl.col("n") ... ).collect() shape: (4, 2) βββββββββββββββββββββββ¬ββββββββββββ β time β n β β --- β --- β β datetime[ΞΌs] β list[i64] β βββββββββββββββββββββββͺββββββββββββ‘ β 2021-12-15 23:00:00 β [0] β β 2021-12-16 00:00:00 β [1, 2] β β 2021-12-16 01:00:00 β [3, 4] β β 2021-12-16 02:00:00 β [5, 6] β βββββββββββββββββββββββ΄ββββββββββββ
The window boundaries can also be added to the aggregation result
>>> lf.group_by_dynamic( ... "time", every="1h", include_boundaries=True, closed="right" ... ).agg(pl.col("n").mean()).collect() shape: (4, 4) βββββββββββββββββββββββ¬ββββββββββββββββββββββ¬ββββββββββββββββββββββ¬ββββββ β _lower_boundary β _upper_boundary β time β n β β --- β --- β --- β --- β β datetime[ΞΌs] β datetime[ΞΌs] β datetime[ΞΌs] β f64 β βββββββββββββββββββββββͺββββββββββββββββββββββͺββββββββββββββββββββββͺββββββ‘ β 2021-12-15 23:00:00 β 2021-12-16 00:00:00 β 2021-12-15 23:00:00 β 0.0 β β 2021-12-16 00:00:00 β 2021-12-16 01:00:00 β 2021-12-16 00:00:00 β 1.5 β β 2021-12-16 01:00:00 β 2021-12-16 02:00:00 β 2021-12-16 01:00:00 β 3.5 β β 2021-12-16 02:00:00 β 2021-12-16 03:00:00 β 2021-12-16 02:00:00 β 5.5 β βββββββββββββββββββββββ΄ββββββββββββββββββββββ΄ββββββββββββββββββββββ΄ββββββ
When closed=βleftβ, the window excludes the right end of interval: [lower_bound, upper_bound)
>>> lf.group_by_dynamic("time", every="1h", closed="left").agg( ... pl.col("n") ... ).collect() shape: (4, 2) βββββββββββββββββββββββ¬ββββββββββββ β time β n β β --- β --- β β datetime[ΞΌs] β list[i64] β βββββββββββββββββββββββͺββββββββββββ‘ β 2021-12-16 00:00:00 β [0, 1] β β 2021-12-16 01:00:00 β [2, 3] β β 2021-12-16 02:00:00 β [4, 5] β β 2021-12-16 03:00:00 β [6] β βββββββββββββββββββββββ΄ββββββββββββ
When closed=βbothβ the time values at the window boundaries belong to 2 groups.
>>> lf.group_by_dynamic("time", every="1h", closed="both").agg( ... pl.col("n") ... ).collect() shape: (5, 2) βββββββββββββββββββββββ¬ββββββββββββ β time β n β β --- β --- β β datetime[ΞΌs] β list[i64] β βββββββββββββββββββββββͺββββββββββββ‘ β 2021-12-15 23:00:00 β [0] β β 2021-12-16 00:00:00 β [0, 1, 2] β β 2021-12-16 01:00:00 β [2, 3, 4] β β 2021-12-16 02:00:00 β [4, 5, 6] β β 2021-12-16 03:00:00 β [6] β βββββββββββββββββββββββ΄ββββββββββββ
Dynamic group bys can also be combined with grouping on normal keys
>>> lf = lf.with_columns(groups=pl.Series(["a", "a", "a", "b", "b", "a", "a"])) >>> lf.collect() shape: (7, 3) βββββββββββββββββββββββ¬ββββββ¬βββββββββ β time β n β groups β β --- β --- β --- β β datetime[ΞΌs] β i64 β str β βββββββββββββββββββββββͺββββββͺβββββββββ‘ β 2021-12-16 00:00:00 β 0 β a β β 2021-12-16 00:30:00 β 1 β a β β 2021-12-16 01:00:00 β 2 β a β β 2021-12-16 01:30:00 β 3 β b β β 2021-12-16 02:00:00 β 4 β b β β 2021-12-16 02:30:00 β 5 β a β β 2021-12-16 03:00:00 β 6 β a β βββββββββββββββββββββββ΄ββββββ΄βββββββββ >>> lf.group_by_dynamic( ... "time", ... every="1h", ... closed="both", ... group_by="groups", ... include_boundaries=True, ... ).agg(pl.col("n")).collect() shape: (7, 5) ββββββββββ¬ββββββββββββββββββββββ¬ββββββββββββββββββββββ¬ββββββββββββββββββββββ¬ββββββββββββ β groups β _lower_boundary β _upper_boundary β time β n β β --- β --- β --- β --- β --- β β str β datetime[ΞΌs] β datetime[ΞΌs] β datetime[ΞΌs] β list[i64] β ββββββββββͺββββββββββββββββββββββͺββββββββββββββββββββββͺββββββββββββββββββββββͺββββββββββββ‘ β a β 2021-12-15 23:00:00 β 2021-12-16 00:00:00 β 2021-12-15 23:00:00 β [0] β β a β 2021-12-16 00:00:00 β 2021-12-16 01:00:00 β 2021-12-16 00:00:00 β [0, 1, 2] β β a β 2021-12-16 01:00:00 β 2021-12-16 02:00:00 β 2021-12-16 01:00:00 β [2] β β a β 2021-12-16 02:00:00 β 2021-12-16 03:00:00 β 2021-12-16 02:00:00 β [5, 6] β β a β 2021-12-16 03:00:00 β 2021-12-16 04:00:00 β 2021-12-16 03:00:00 β [6] β β b β 2021-12-16 01:00:00 β 2021-12-16 02:00:00 β 2021-12-16 01:00:00 β [3, 4] β β b β 2021-12-16 02:00:00 β 2021-12-16 03:00:00 β 2021-12-16 02:00:00 β [4] β ββββββββββ΄ββββββββββββββββββββββ΄ββββββββββββββββββββββ΄ββββββββββββββββββββββ΄ββββββββββββ
Dynamic group by on an index column
>>> lf = pl.LazyFrame( ... { ... "idx": pl.int_range(0, 6, eager=True), ... "A": ["A", "A", "B", "B", "B", "C"], ... } ... ) >>> lf.group_by_dynamic( ... "idx", ... every="2i", ... period="3i", ... include_boundaries=True, ... closed="right", ... ).agg(pl.col("A").alias("A_agg_list")).collect() shape: (4, 4) βββββββββββββββββββ¬ββββββββββββββββββ¬ββββββ¬ββββββββββββββββββ β _lower_boundary β _upper_boundary β idx β A_agg_list β β --- β --- β --- β --- β β i64 β i64 β i64 β list[str] β βββββββββββββββββββͺββββββββββββββββββͺββββββͺββββββββββββββββββ‘ β -2 β 1 β -2 β ["A", "A"] β β 0 β 3 β 0 β ["A", "B", "B"] β β 2 β 5 β 2 β ["B", "B", "C"] β β 4 β 7 β 4 β ["C"] β βββββββββββββββββββ΄ββββββββββββββββββ΄ββββββ΄ββββββββββββββββββ
- group_by_rolling(
- index_column: IntoExpr,
- *,
- period: str | timedelta,
- offset: str | timedelta | None = None,
- closed: ClosedInterval = 'right',
- by: IntoExpr | Iterable[IntoExpr] | None = None,
- check_sorted: bool | None = None,
Create rolling groups based on a time, Int32, or Int64 column.
Deprecated since version 0.19.9: This method has been renamed to
LazyFrame.rolling()
.- Parameters:
- index_column
Column used to group based on the time window. Often of type Date/Datetime. This column must be sorted in ascending order (or, if
by
is specified, then it must be sorted in ascending order within each group).In case of a rolling group by on indices, dtype needs to be one of {Int32, Int64}. Note that Int32 gets temporarily cast to Int64, so if performance matters use an Int64 column.
- period
length of the window - must be non-negative
- offset
offset of the window. Default is -period
- closed{βrightβ, βleftβ, βbothβ, βnoneβ}
Define which sides of the temporal interval are closed (inclusive).
- by
Also group by this column/these columns
- check_sorted
Check whether
index_column
is sorted (or, ifby
is given, check whether itβs sorted within each group). When theby
argument is given, polars can not check sortedness by the metadata and has to do a full scan on the index column to verify data is sorted. This is expensive. If you are sure the data within the groups is sorted, you can set this toFalse
. Doing so incorrectly will lead to incorrect output
- Returns:
- LazyGroupBy
Object you can call
.agg
on to aggregate by groups, the result of which will be sorted byindex_column
(but note that ifby
columns are passed, it will only be sorted within eachby
group).
- groupby(
- by: IntoExpr | Iterable[IntoExpr],
- *more_by: IntoExpr,
- maintain_order: bool = False,
Start a group by operation.
Deprecated since version 0.19.0: This method has been renamed to
LazyFrame.group_by()
.- Parameters:
- by
Column(s) to group by. Accepts expression input. Strings are parsed as column names.
- *more_by
Additional columns to group by, specified as positional arguments.
- maintain_order
Ensure that the order of the groups is consistent with the input data. This is slower than a default group by. Settings this to
True
blocks the possibility to run on the streaming engine.
- groupby_dynamic(
- index_column: IntoExpr,
- *,
- every: str | timedelta,
- period: str | timedelta | None = None,
- offset: str | timedelta | None = None,
- truncate: bool = True,
- include_boundaries: bool = False,
- closed: ClosedInterval = 'left',
- by: IntoExpr | Iterable[IntoExpr] | None = None,
- start_by: StartBy = 'window',
- check_sorted: bool | None = None,
Group based on a time value (or index value of type Int32, Int64).
Deprecated since version 0.19.0: This method has been renamed to
LazyFrame.group_by_dynamic()
.- Parameters:
- index_column
Column used to group based on the time window. Often of type Date/Datetime. This column must be sorted in ascending order (or, if
by
is specified, then it must be sorted in ascending order within each group).In case of a dynamic group by on indices, dtype needs to be one of {Int32, Int64}. Note that Int32 gets temporarily cast to Int64, so if performance matters use an Int64 column.
- every
interval of the window
- period
length of the window, if None it will equal βeveryβ
- offset
offset of the window, does not take effect if
start_by
is βdatapointβ. Defaults to negativeevery
.- truncate
truncate the time value to the window lower bound
- include_boundaries
Add the lower and upper bound of the window to the β_lower_boundβ and β_upper_boundβ columns. This will impact performance because itβs harder to parallelize
- closed{βrightβ, βleftβ, βbothβ, βnoneβ}
Define which sides of the temporal interval are closed (inclusive).
- by
Also group by this column/these columns
- start_by{βwindowβ, βdatapointβ, βmondayβ, βtuesdayβ, βwednesdayβ, βthursdayβ, βfridayβ, βsaturdayβ, βsundayβ}
The strategy to determine the start of the first window by.
βwindowβ: Start by taking the earliest timestamp, truncating it with
every
, and then addingoffset
. Note that weekly windows start on Monday.βdatapointβ: Start from the first encountered data point.
a day of the week (only takes effect if
every
contains'w'
):βmondayβ: Start the window on the Monday before the first data point.
βtuesdayβ: Start the window on the Tuesday before the first data point.
β¦
βsundayβ: Start the window on the Sunday before the first data point.
The resulting window is then shifted back until the earliest datapoint is in or in front of it.
- check_sorted
Check whether
index_column
is sorted (or, ifby
is given, check whether itβs sorted within each group). When theby
argument is given, polars can not check sortedness by the metadata and has to do a full scan on the index column to verify data is sorted. This is expensive. If you are sure the data within the groups is sorted, you can set this toFalse
. Doing so incorrectly will lead to incorrect output
- Returns:
- LazyGroupBy
Object you can call
.agg
on to aggregate by groups, the result of which will be sorted byindex_column
(but note that ifby
columns are passed, it will only be sorted within eachby
group).
- groupby_rolling(
- index_column: IntoExpr,
- *,
- period: str | timedelta,
- offset: str | timedelta | None = None,
- closed: ClosedInterval = 'right',
- by: IntoExpr | Iterable[IntoExpr] | None = None,
- check_sorted: bool | None = None,
Create rolling groups based on a time, Int32, or Int64 column.
Deprecated since version 0.19.0: This method has been renamed to
LazyFrame.rolling()
.- Parameters:
- index_column
Column used to group based on the time window. Often of type Date/Datetime. This column must be sorted in ascending order (or, if
by
is specified, then it must be sorted in ascending order within each group).In case of a rolling group by on indices, dtype needs to be one of {Int32, Int64}. Note that Int32 gets temporarily cast to Int64, so if performance matters use an Int64 column.
- period
length of the window - must be non-negative
- offset
offset of the window. Default is -period
- closed{βrightβ, βleftβ, βbothβ, βnoneβ}
Define which sides of the temporal interval are closed (inclusive).
- by
Also group by this column/these columns
- check_sorted
Check whether
index_column
is sorted (or, ifby
is given, check whether itβs sorted within each group). When theby
argument is given, polars can not check sortedness by the metadata and has to do a full scan on the index column to verify data is sorted. This is expensive. If you are sure the data within the groups is sorted, you can set this toFalse
. Doing so incorrectly will lead to incorrect output
- Returns:
- LazyGroupBy
Object you can call
.agg
on to aggregate by groups, the result of which will be sorted byindex_column
(but note that ifby
columns are passed, it will only be sorted within eachby
group).
- head(n: int = 5) Self [source]
Get the first
n
rows.- Parameters:
- n
Number of rows to return.
Notes
Consider using the
fetch()
operation if you only want to test your query. Thefetch()
operation will load the firstn
rows at the scan level, whereas thehead()
/limit()
are applied at the end.Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4, 5, 6], ... "b": [7, 8, 9, 10, 11, 12], ... } ... ) >>> lf.head().collect() shape: (5, 2) βββββββ¬ββββββ β a β b β β --- β --- β β i64 β i64 β βββββββͺββββββ‘ β 1 β 7 β β 2 β 8 β β 3 β 9 β β 4 β 10 β β 5 β 11 β βββββββ΄ββββββ >>> lf.head(2).collect() shape: (2, 2) βββββββ¬ββββββ β a β b β β --- β --- β β i64 β i64 β βββββββͺββββββ‘ β 1 β 7 β β 2 β 8 β βββββββ΄ββββββ
- inspect(fmt: str = '{}') Self [source]
Inspect a node in the computation graph.
Print the value that this node in the computation graph evaluates to and pass on the value.
Examples
>>> lf = pl.LazyFrame({"foo": [1, 1, -2, 3]}) >>> ( ... lf.with_columns(pl.col("foo").cum_sum().alias("bar")) ... .inspect() # print the node before the filter ... .filter(pl.col("bar") == pl.col("foo")) ... ) <LazyFrame at ...>
- interpolate() Self [source]
Interpolate intermediate values. The interpolation method is linear.
Examples
>>> lf = pl.LazyFrame( ... { ... "foo": [1, None, 9, 10], ... "bar": [6, 7, 9, None], ... "baz": [1, None, None, 9], ... } ... ) >>> lf.interpolate().collect() shape: (4, 3) ββββββββ¬βββββββ¬βββββββββββ β foo β bar β baz β β --- β --- β --- β β f64 β f64 β f64 β ββββββββͺβββββββͺβββββββββββ‘ β 1.0 β 6.0 β 1.0 β β 5.0 β 7.0 β 3.666667 β β 9.0 β 9.0 β 6.333333 β β 10.0 β null β 9.0 β ββββββββ΄βββββββ΄βββββββββββ
- join(
- other: LazyFrame,
- on: str | Expr | Sequence[str | Expr] | None = None,
- how: JoinStrategy = 'inner',
- *,
- left_on: str | Expr | Sequence[str | Expr] | None = None,
- right_on: str | Expr | Sequence[str | Expr] | None = None,
- suffix: str = '_right',
- validate: JoinValidation = 'm:m',
- join_nulls: bool = False,
- coalesce: bool | None = None,
- allow_parallel: bool = True,
- force_parallel: bool = False,
Add a join operation to the Logical Plan.
- Parameters:
- other
Lazy DataFrame to join with.
- on
Join column of both DataFrames. If set,
left_on
andright_on
should be None.- how{βinnerβ, βleftβ, βfullβ, βsemiβ, βantiβ, βcrossβ}
Join strategy.
- inner
Returns rows that have matching values in both tables
- left
Returns all rows from the left table, and the matched rows from the right table
- full
Returns all rows when there is a match in either left or right table
- cross
Returns the Cartesian product of rows from both tables
- semi
Filter rows that have a match in the right table.
- anti
Filter rows that not have a match in the right table.
Note
A left join preserves the row order of the left DataFrame.
- left_on
Join column of the left DataFrame.
- right_on
Join column of the right DataFrame.
- suffix
Suffix to append to columns with a duplicate name.
- validate: {βm:mβ, βm:1β, β1:mβ, β1:1β}
Checks if join is of specified type.
- many_to_many
βm:mβ: default, does not result in checks
- one_to_one
β1:1β: check if join keys are unique in both left and right datasets
- one_to_many
β1:mβ: check if join keys are unique in left dataset
- many_to_one
βm:1β: check if join keys are unique in right dataset
Note
This is currently not supported by the streaming engine.
- join_nulls
Join on null values. By default null values will never produce matches.
- coalesce
Coalescing behavior (merging of join columns).
None: -> join specific.
True: -> Always coalesce join columns.
False: -> Never coalesce join columns.
- allow_parallel
Allow the physical plan to optionally evaluate the computation of both DataFrames up to the join in parallel.
- force_parallel
Force the physical plan to evaluate the computation of both DataFrames up to the join in parallel.
See also
Examples
>>> lf = pl.LazyFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": ["a", "b", "c"], ... } ... ) >>> other_lf = pl.LazyFrame( ... { ... "apple": ["x", "y", "z"], ... "ham": ["a", "b", "d"], ... } ... ) >>> lf.join(other_lf, on="ham").collect() shape: (2, 4) βββββββ¬ββββββ¬ββββββ¬ββββββββ β foo β bar β ham β apple β β --- β --- β --- β --- β β i64 β f64 β str β str β βββββββͺββββββͺββββββͺββββββββ‘ β 1 β 6.0 β a β x β β 2 β 7.0 β b β y β βββββββ΄ββββββ΄ββββββ΄ββββββββ >>> lf.join(other_lf, on="ham", how="full").collect() shape: (4, 5) ββββββββ¬βββββββ¬βββββββ¬ββββββββ¬ββββββββββββ β foo β bar β ham β apple β ham_right β β --- β --- β --- β --- β --- β β i64 β f64 β str β str β str β ββββββββͺβββββββͺβββββββͺββββββββͺββββββββββββ‘ β 1 β 6.0 β a β x β a β β 2 β 7.0 β b β y β b β β null β null β null β z β d β β 3 β 8.0 β c β null β null β ββββββββ΄βββββββ΄βββββββ΄ββββββββ΄ββββββββββββ >>> lf.join(other_lf, on="ham", how="left", coalesce=True).collect() shape: (3, 4) βββββββ¬ββββββ¬ββββββ¬ββββββββ β foo β bar β ham β apple β β --- β --- β --- β --- β β i64 β f64 β str β str β βββββββͺββββββͺββββββͺββββββββ‘ β 1 β 6.0 β a β x β β 2 β 7.0 β b β y β β 3 β 8.0 β c β null β βββββββ΄ββββββ΄ββββββ΄ββββββββ >>> lf.join(other_lf, on="ham", how="semi").collect() shape: (2, 3) βββββββ¬ββββββ¬ββββββ β foo β bar β ham β β --- β --- β --- β β i64 β f64 β str β βββββββͺββββββͺββββββ‘ β 1 β 6.0 β a β β 2 β 7.0 β b β βββββββ΄ββββββ΄ββββββ >>> lf.join(other_lf, on="ham", how="anti").collect() shape: (1, 3) βββββββ¬ββββββ¬ββββββ β foo β bar β ham β β --- β --- β --- β β i64 β f64 β str β βββββββͺββββββͺββββββ‘ β 3 β 8.0 β c β βββββββ΄ββββββ΄ββββββ
- join_asof(
- other: LazyFrame,
- *,
- left_on: str | None | Expr = None,
- right_on: str | None | Expr = None,
- on: str | None | Expr = None,
- by_left: str | Sequence[str] | None = None,
- by_right: str | Sequence[str] | None = None,
- by: str | Sequence[str] | None = None,
- strategy: AsofJoinStrategy = 'backward',
- suffix: str = '_right',
- tolerance: str | int | float | timedelta | None = None,
- allow_parallel: bool = True,
- force_parallel: bool = False,
Perform an asof join.
This is similar to a left-join except that we match on nearest key rather than equal keys.
Both DataFrames must be sorted by the join_asof key.
For each row in the left DataFrame:
A βbackwardβ search selects the last row in the right DataFrame whose βonβ key is less than or equal to the leftβs key.
A βforwardβ search selects the first row in the right DataFrame whose βonβ key is greater than or equal to the leftβs key.
A βnearestβ search selects the last row in the right DataFrame whose value is nearest to the leftβs key. String keys are not currently supported for a nearest search.
The default is βbackwardβ.
- Parameters:
- other
Lazy DataFrame to join with.
- left_on
Join column of the left DataFrame.
- right_on
Join column of the right DataFrame.
- on
Join column of both DataFrames. If set,
left_on
andright_on
should be None.- by
Join on these columns before doing asof join.
- by_left
Join on these columns before doing asof join.
- by_right
Join on these columns before doing asof join.
- strategy{βbackwardβ, βforwardβ, βnearestβ}
Join strategy.
- suffix
Suffix to append to columns with a duplicate name.
- tolerance
Numeric tolerance. By setting this the join will only be done if the near keys are within this distance. If an asof join is done on columns of dtype βDateβ, βDatetimeβ, βDurationβ or βTimeβ, use either a datetime.timedelta object or the following string language:
1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
Or combine them: β3d12h4m25sβ # 3 days, 12 hours, 4 minutes, and 25 seconds
By βcalendar dayβ, we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for βcalendar weekβ, βcalendar monthβ, βcalendar quarterβ, and βcalendar yearβ.
- allow_parallel
Allow the physical plan to optionally evaluate the computation of both DataFrames up to the join in parallel.
- force_parallel
Force the physical plan to evaluate the computation of both DataFrames up to the join in parallel.
Examples
>>> from datetime import datetime >>> gdp = pl.LazyFrame( ... { ... "date": [ ... datetime(2016, 1, 1), ... datetime(2017, 1, 1), ... datetime(2018, 1, 1), ... datetime(2019, 1, 1), ... ], # note record date: Jan 1st (sorted!) ... "gdp": [4164, 4411, 4566, 4696], ... } ... ).set_sorted("date") >>> population = pl.LazyFrame( ... { ... "date": [ ... datetime(2016, 5, 12), ... datetime(2017, 5, 12), ... datetime(2018, 5, 12), ... datetime(2019, 5, 12), ... ], # note record date: May 12th (sorted!) ... "population": [82.19, 82.66, 83.12, 83.52], ... } ... ).set_sorted("date") >>> population.join_asof(gdp, on="date", strategy="backward").collect() shape: (4, 3) βββββββββββββββββββββββ¬βββββββββββββ¬βββββββ β date β population β gdp β β --- β --- β --- β β datetime[ΞΌs] β f64 β i64 β βββββββββββββββββββββββͺβββββββββββββͺβββββββ‘ β 2016-05-12 00:00:00 β 82.19 β 4164 β β 2017-05-12 00:00:00 β 82.66 β 4411 β β 2018-05-12 00:00:00 β 83.12 β 4566 β β 2019-05-12 00:00:00 β 83.52 β 4696 β βββββββββββββββββββββββ΄βββββββββββββ΄βββββββ
- last() Self [source]
Get the last row of the DataFrame.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 3, 5], ... "b": [2, 4, 6], ... } ... ) >>> lf.last().collect() shape: (1, 2) βββββββ¬ββββββ β a β b β β --- β --- β β i64 β i64 β βββββββͺββββββ‘ β 5 β 6 β βββββββ΄ββββββ
- lazy() Self [source]
Return lazy representation, i.e. itself.
Useful for writing code that expects either a
DataFrame
orLazyFrame
.- Returns:
- LazyFrame
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [None, 2, 3, 4], ... "b": [0.5, None, 2.5, 13], ... "c": [True, True, False, None], ... } ... ) >>> lf.lazy() <LazyFrame at ...>
- limit(n: int = 5) Self [source]
Get the first
n
rows.Alias for
LazyFrame.head()
.- Parameters:
- n
Number of rows to return.
Notes
Consider using the
fetch()
operation if you only want to test your query. Thefetch()
operation will load the firstn
rows at the scan level, whereas thehead()
/limit()
are applied at the end.Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4, 5, 6], ... "b": [7, 8, 9, 10, 11, 12], ... } ... ) >>> lf.limit().collect() shape: (5, 2) βββββββ¬ββββββ β a β b β β --- β --- β β i64 β i64 β βββββββͺββββββ‘ β 1 β 7 β β 2 β 8 β β 3 β 9 β β 4 β 10 β β 5 β 11 β βββββββ΄ββββββ >>> lf.limit(2).collect() shape: (2, 2) βββββββ¬ββββββ β a β b β β --- β --- β β i64 β i64 β βββββββͺββββββ‘ β 1 β 7 β β 2 β 8 β βββββββ΄ββββββ
- map(
- function: Callable[[DataFrame], DataFrame],
- *,
- predicate_pushdown: bool = True,
- projection_pushdown: bool = True,
- slice_pushdown: bool = True,
- no_optimizations: bool = False,
- schema: None | SchemaDict = None,
- validate_output_schema: bool = True,
- streamable: bool = False,
Apply a custom function.
Deprecated since version 0.19.0: This method has been renamed to
LazyFrame.map_batches()
.- Parameters:
- function
Lambda/ function to apply.
- predicate_pushdown
Allow predicate pushdown optimization to pass this node.
- projection_pushdown
Allow projection pushdown optimization to pass this node.
- slice_pushdown
Allow slice pushdown optimization to pass this node.
- no_optimizations
Turn off all optimizations past this point.
- schema
Output schema of the function, if set to
None
we assume that the schema will remain unchanged by the applied function.- validate_output_schema
It is paramount that polarsβ schema is correct. This flag will ensure that the output schema of this function will be checked with the expected schema. Setting this to
False
will not do this check, but may lead to hard to debug bugs.- streamable
Whether the function that is given is eligible to be running with the streaming engine. That means that the function must produce the same result when it is executed in batches or when it is be executed on the full dataset.
- map_batches(
- function: Callable[[DataFrame], DataFrame],
- *,
- predicate_pushdown: bool = True,
- projection_pushdown: bool = True,
- slice_pushdown: bool = True,
- no_optimizations: bool = False,
- schema: None | SchemaDict = None,
- validate_output_schema: bool = True,
- streamable: bool = False,
Apply a custom function.
It is important that the function returns a Polars DataFrame.
- Parameters:
- function
Lambda/ function to apply.
- predicate_pushdown
Allow predicate pushdown optimization to pass this node.
- projection_pushdown
Allow projection pushdown optimization to pass this node.
- slice_pushdown
Allow slice pushdown optimization to pass this node.
- no_optimizations
Turn off all optimizations past this point.
- schema
Output schema of the function, if set to
None
we assume that the schema will remain unchanged by the applied function.- validate_output_schema
It is paramount that polarsβ schema is correct. This flag will ensure that the output schema of this function will be checked with the expected schema. Setting this to
False
will not do this check, but may lead to hard to debug bugs.- streamable
Whether the function that is given is eligible to be running with the streaming engine. That means that the function must produce the same result when it is executed in batches or when it is be executed on the full dataset.
Warning
The
schema
of aLazyFrame
must always be correct. It is up to the caller of this function to ensure that this invariant is upheld.It is important that the optimization flags are correct. If the custom function for instance does an aggregation of a column,
predicate_pushdown
should not be allowed, as this prunes rows and will influence your aggregation results.Examples
>>> lf = ( ... pl.LazyFrame( ... { ... "a": pl.int_range(-100_000, 0, eager=True), ... "b": pl.int_range(0, 100_000, eager=True), ... } ... ) ... .map_batches(lambda x: 2 * x, streamable=True) ... .collect(streaming=True) ... ) shape: (100_000, 2) βββββββββββ¬βββββββββ β a β b β β --- β --- β β i64 β i64 β βββββββββββͺβββββββββ‘ β -200000 β 0 β β -199998 β 2 β β -199996 β 4 β β -199994 β 6 β β β¦ β β¦ β β -8 β 199992 β β -6 β 199994 β β -4 β 199996 β β -2 β 199998 β βββββββββββ΄βββββββββ
- max() Self [source]
Aggregate the columns in the LazyFrame to their maximum value.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [1, 2, 1, 1], ... } ... ) >>> lf.max().collect() shape: (1, 2) βββββββ¬ββββββ β a β b β β --- β --- β β i64 β i64 β βββββββͺββββββ‘ β 4 β 2 β βββββββ΄ββββββ
- mean() Self [source]
Aggregate the columns in the LazyFrame to their mean value.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [1, 2, 1, 1], ... } ... ) >>> lf.mean().collect() shape: (1, 2) βββββββ¬βββββββ β a β b β β --- β --- β β f64 β f64 β βββββββͺβββββββ‘ β 2.5 β 1.25 β βββββββ΄βββββββ
- median() Self [source]
Aggregate the columns in the LazyFrame to their median value.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [1, 2, 1, 1], ... } ... ) >>> lf.median().collect() shape: (1, 2) βββββββ¬ββββββ β a β b β β --- β --- β β f64 β f64 β βββββββͺββββββ‘ β 2.5 β 1.0 β βββββββ΄ββββββ
- melt(
- id_vars: ColumnNameOrSelector | Sequence[ColumnNameOrSelector] | None = None,
- value_vars: ColumnNameOrSelector | Sequence[ColumnNameOrSelector] | None = None,
- variable_name: str | None = None,
- value_name: str | None = None,
- *,
- streamable: bool = True,
Unpivot a DataFrame from wide to long format.
Optionally leaves identifiers set.
This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars) while all other columns, considered measured variables (value_vars), are βunpivotedβ to the row axis leaving just two non-identifier columns, βvariableβ and βvalueβ.
- Parameters:
- id_vars
Column(s) or selector(s) to use as identifier variables.
- value_vars
Column(s) or selector(s) to use as values variables; if
value_vars
is empty all columns that are not inid_vars
will be used.- variable_name
Name to give to the
variable
column. Defaults to βvariableβ- value_name
Name to give to the
value
column. Defaults to βvalueβ- streamable
Allow this node to run in the streaming engine. If this runs in streaming, the output of the melt operation will not have a stable ordering.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": ["x", "y", "z"], ... "b": [1, 3, 5], ... "c": [2, 4, 6], ... } ... ) >>> import polars.selectors as cs >>> lf.melt(id_vars="a", value_vars=cs.numeric()).collect() shape: (6, 3) βββββββ¬βββββββββββ¬ββββββββ β a β variable β value β β --- β --- β --- β β str β str β i64 β βββββββͺβββββββββββͺββββββββ‘ β x β b β 1 β β y β b β 3 β β z β b β 5 β β x β c β 2 β β y β c β 4 β β z β c β 6 β βββββββ΄βββββββββββ΄ββββββββ
- merge_sorted(other: LazyFrame, key: str) Self [source]
Take two sorted DataFrames and merge them by the sorted key.
The output of this operation will also be sorted. It is the callers responsibility that the frames are sorted by that key otherwise the output will not make sense.
The schemas of both LazyFrames must be equal.
- Parameters:
- other
Other DataFrame that must be merged
- key
Key that is sorted.
Examples
>>> df0 = pl.LazyFrame( ... {"name": ["steve", "elise", "bob"], "age": [42, 44, 18]} ... ).sort("age") >>> df0.collect() shape: (3, 2) βββββββββ¬ββββββ β name β age β β --- β --- β β str β i64 β βββββββββͺββββββ‘ β bob β 18 β β steve β 42 β β elise β 44 β βββββββββ΄ββββββ >>> df1 = pl.LazyFrame( ... {"name": ["anna", "megan", "steve", "thomas"], "age": [21, 33, 42, 20]} ... ).sort("age") >>> df1.collect() shape: (4, 2) ββββββββββ¬ββββββ β name β age β β --- β --- β β str β i64 β ββββββββββͺββββββ‘ β thomas β 20 β β anna β 21 β β megan β 33 β β steve β 42 β ββββββββββ΄ββββββ >>> df0.merge_sorted(df1, key="age").collect() shape: (7, 2) ββββββββββ¬ββββββ β name β age β β --- β --- β β str β i64 β ββββββββββͺββββββ‘ β bob β 18 β β thomas β 20 β β anna β 21 β β megan β 33 β β steve β 42 β β steve β 42 β β elise β 44 β ββββββββββ΄ββββββ
- min() Self [source]
Aggregate the columns in the LazyFrame to their minimum value.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [1, 2, 1, 1], ... } ... ) >>> lf.min().collect() shape: (1, 2) βββββββ¬ββββββ β a β b β β --- β --- β β i64 β i64 β βββββββͺββββββ‘ β 1 β 1 β βββββββ΄ββββββ
- null_count() Self [source]
Aggregate the columns in the LazyFrame as the sum of their null value count.
Examples
>>> lf = pl.LazyFrame( ... { ... "foo": [1, None, 3], ... "bar": [6, 7, None], ... "ham": ["a", "b", "c"], ... } ... ) >>> lf.null_count().collect() shape: (1, 3) βββββββ¬ββββββ¬ββββββ β foo β bar β ham β β --- β --- β --- β β u32 β u32 β u32 β βββββββͺββββββͺββββββ‘ β 1 β 1 β 0 β βββββββ΄ββββββ΄ββββββ
- pipe(
- function: Callable[Concatenate[LazyFrame, P], T],
- *args: P.args,
- **kwargs: P.kwargs,
Offers a structured way to apply a sequence of user-defined functions (UDFs).
- Parameters:
- function
Callable; will receive the frame as the first parameter, followed by any given args/kwargs.
- *args
Arguments to pass to the UDF.
- **kwargs
Keyword arguments to pass to the UDF.
Examples
>>> def cast_str_to_int(data, col_name): ... return data.with_columns(pl.col(col_name).cast(pl.Int64)) >>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4], ... "b": ["10", "20", "30", "40"], ... } ... ) >>> lf.pipe(cast_str_to_int, col_name="b").collect() shape: (4, 2) βββββββ¬ββββββ β a β b β β --- β --- β β i64 β i64 β βββββββͺββββββ‘ β 1 β 10 β β 2 β 20 β β 3 β 30 β β 4 β 40 β βββββββ΄ββββββ
>>> lf = pl.LazyFrame( ... { ... "b": [1, 2], ... "a": [3, 4], ... } ... ) >>> lf.collect() shape: (2, 2) βββββββ¬ββββββ β b β a β β --- β --- β β i64 β i64 β βββββββͺββββββ‘ β 1 β 3 β β 2 β 4 β βββββββ΄ββββββ >>> lf.pipe(lambda tdf: tdf.select(sorted(tdf.columns))).collect() shape: (2, 2) βββββββ¬ββββββ β a β b β β --- β --- β β i64 β i64 β βββββββͺββββββ‘ β 3 β 1 β β 4 β 2 β βββββββ΄ββββββ
- profile(
- *,
- type_coercion: bool = True,
- predicate_pushdown: bool = True,
- projection_pushdown: bool = True,
- simplify_expression: bool = True,
- no_optimization: bool = False,
- slice_pushdown: bool = True,
- comm_subplan_elim: bool = True,
- comm_subexpr_elim: bool = True,
- cluster_with_columns: bool = True,
- show_plot: bool = False,
- truncate_nodes: int = 0,
- figsize: tuple[int, int] = (18, 8),
- streaming: bool = False,
Profile a LazyFrame.
This will run the query and return a tuple containing the materialized DataFrame and a DataFrame that contains profiling information of each node that is executed.
The units of the timings are microseconds.
- Parameters:
- type_coercion
Do type coercion optimization.
- predicate_pushdown
Do predicate pushdown optimization.
- projection_pushdown
Do projection pushdown optimization.
- simplify_expression
Run simplify expressions optimization.
- no_optimization
Turn off (certain) optimizations.
- slice_pushdown
Slice pushdown optimization.
- comm_subplan_elim
Will try to cache branching subplans that occur on self-joins or unions.
- comm_subexpr_elim
Common subexpressions will be cached and reused.
- cluster_with_columns
Combine sequential independent calls to with_columns
- show_plot
Show a gantt chart of the profiling result
- truncate_nodes
Truncate the label lengths in the gantt chart to this number of characters.
- figsize
matplotlib figsize of the profiling plot
- streaming
Run parts of the query in a streaming fashion (this is in an alpha state)
Examples
>>> lf = pl.LazyFrame( ... { ... "a": ["a", "b", "a", "b", "b", "c"], ... "b": [1, 2, 3, 4, 5, 6], ... "c": [6, 5, 4, 3, 2, 1], ... } ... ) >>> lf.group_by("a", maintain_order=True).agg(pl.all().sum()).sort( ... "a" ... ).profile() (shape: (3, 3) βββββββ¬ββββββ¬ββββββ β a β b β c β β --- β --- β --- β β str β i64 β i64 β βββββββͺββββββͺββββββ‘ β a β 4 β 10 β β b β 11 β 10 β β c β 6 β 1 β βββββββ΄ββββββ΄ββββββ, shape: (3, 3) βββββββββββββββββββββββββββ¬ββββββββ¬βββββββ β node β start β end β β --- β --- β --- β β str β u64 β u64 β βββββββββββββββββββββββββββͺββββββββͺβββββββ‘ β optimization β 0 β 5 β β group_by_partitioned(a) β 5 β 470 β β sort(a) β 475 β 1964 β βββββββββββββββββββββββββββ΄ββββββββ΄βββββββ)
- quantile(
- quantile: float | Expr,
- interpolation: RollingInterpolationMethod = 'nearest',
Aggregate the columns in the LazyFrame to their quantile value.
- Parameters:
- quantile
Quantile between 0.0 and 1.0.
- interpolation{βnearestβ, βhigherβ, βlowerβ, βmidpointβ, βlinearβ}
Interpolation method.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [1, 2, 1, 1], ... } ... ) >>> lf.quantile(0.7).collect() shape: (1, 2) βββββββ¬ββββββ β a β b β β --- β --- β β f64 β f64 β βββββββͺββββββ‘ β 3.0 β 1.0 β βββββββ΄ββββββ
- rename(mapping: dict[str, str] | Callable[[str], str]) Self [source]
Rename column names.
- Parameters:
- mapping
Key value pairs that map from old name to new name, or a function that takes the old name as input and returns the new name.
Notes
If existing names are swapped (e.g. βAβ points to βBβ and βBβ points to βAβ), polars will block projection and predicate pushdowns at this node.
Examples
>>> lf = pl.LazyFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> lf.rename({"foo": "apple"}).collect() shape: (3, 3) βββββββββ¬ββββββ¬ββββββ β apple β bar β ham β β --- β --- β --- β β i64 β i64 β str β βββββββββͺββββββͺββββββ‘ β 1 β 6 β a β β 2 β 7 β b β β 3 β 8 β c β βββββββββ΄ββββββ΄ββββββ >>> lf.rename(lambda column_name: "c" + column_name[1:]).collect() shape: (3, 3) βββββββ¬ββββββ¬ββββββ β coo β car β cam β β --- β --- β --- β β i64 β i64 β str β βββββββͺββββββͺββββββ‘ β 1 β 6 β a β β 2 β 7 β b β β 3 β 8 β c β βββββββ΄ββββββ΄ββββββ
- reverse() Self [source]
Reverse the DataFrame.
Examples
>>> lf = pl.LazyFrame( ... { ... "key": ["a", "b", "c"], ... "val": [1, 2, 3], ... } ... ) >>> lf.reverse().collect() shape: (3, 2) βββββββ¬ββββββ β key β val β β --- β --- β β str β i64 β βββββββͺββββββ‘ β c β 3 β β b β 2 β β a β 1 β βββββββ΄ββββββ
- rolling(
- index_column: IntoExpr,
- *,
- period: str | timedelta,
- offset: str | timedelta | None = None,
- closed: ClosedInterval = 'right',
- group_by: IntoExpr | Iterable[IntoExpr] | None = None,
- check_sorted: bool | None = None,
Create rolling groups based on a temporal or integer column.
Different from a
group_by_dynamic
the windows are now determined by the individual values and are not of constant intervals. For constant intervals useLazyFrame.group_by_dynamic()
.If you have a time series
<t_0, t_1, ..., t_n>
, then by default the windows created will be(t_0 - period, t_0]
(t_1 - period, t_1]
β¦
(t_n - period, t_n]
whereas if you pass a non-default
offset
, then the windows will be(t_0 + offset, t_0 + offset + period]
(t_1 + offset, t_1 + offset + period]
β¦
(t_n + offset, t_n + offset + period]
The
period
andoffset
arguments are created either from a timedelta, or by using the following string language:1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
1i (1 index count)
Or combine them: β3d12h4m25sβ # 3 days, 12 hours, 4 minutes, and 25 seconds
By βcalendar dayβ, we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for βcalendar weekβ, βcalendar monthβ, βcalendar quarterβ, and βcalendar yearβ.
- Parameters:
- index_column
Column used to group based on the time window. Often of type Date/Datetime. This column must be sorted in ascending order (or, if
group_by
is specified, then it must be sorted in ascending order within each group).In case of a rolling group by on indices, dtype needs to be one of {UInt32, UInt64, Int32, Int64}. Note that the first three get temporarily cast to Int64, so if performance matters use an Int64 column.
- period
Length of the window - must be non-negative.
- offset
Offset of the window. Default is
-period
.- closed{βrightβ, βleftβ, βbothβ, βnoneβ}
Define which sides of the temporal interval are closed (inclusive).
- group_by
Also group by this column/these columns
- check_sorted
Check whether
index_column
is sorted (or, ifgroup_by
is given, check whether itβs sorted within each group). When thegroup_by
argument is given, polars can not check sortedness by the metadata and has to do a full scan on the index column to verify data is sorted. This is expensive. If you are sure the data within the groups is sorted, you can set this toFalse
. Doing so incorrectly will lead to incorrect outputDeprecated since version 0.20.31: Sortedness is now verified in a quick manner, you can safely remove this argument.
- Returns:
- LazyGroupBy
Object you can call
.agg
on to aggregate by groups, the result of which will be sorted byindex_column
(but note that ifgroup_by
columns are passed, it will only be sorted within each group).
See also
Examples
>>> dates = [ ... "2020-01-01 13:45:48", ... "2020-01-01 16:42:13", ... "2020-01-01 16:45:09", ... "2020-01-02 18:12:48", ... "2020-01-03 19:45:32", ... "2020-01-08 23:16:43", ... ] >>> df = pl.LazyFrame({"dt": dates, "a": [3, 7, 5, 9, 2, 1]}).with_columns( ... pl.col("dt").str.strptime(pl.Datetime).set_sorted() ... ) >>> out = ( ... df.rolling(index_column="dt", period="2d") ... .agg( ... pl.sum("a").alias("sum_a"), ... pl.min("a").alias("min_a"), ... pl.max("a").alias("max_a"), ... ) ... .collect() ... ) >>> out shape: (6, 4) βββββββββββββββββββββββ¬ββββββββ¬ββββββββ¬ββββββββ β dt β sum_a β min_a β max_a β β --- β --- β --- β --- β β datetime[ΞΌs] β i64 β i64 β i64 β βββββββββββββββββββββββͺββββββββͺββββββββͺββββββββ‘ β 2020-01-01 13:45:48 β 3 β 3 β 3 β β 2020-01-01 16:42:13 β 10 β 3 β 7 β β 2020-01-01 16:45:09 β 15 β 3 β 7 β β 2020-01-02 18:12:48 β 24 β 3 β 9 β β 2020-01-03 19:45:32 β 11 β 2 β 9 β β 2020-01-08 23:16:43 β 1 β 1 β 1 β βββββββββββββββββββββββ΄ββββββββ΄ββββββββ΄ββββββββ
- property schema: OrderedDict[str, DataType][source]
Get a mapping of column names to their data type.
- Returns:
- OrderedDict
An ordered mapping of column names to their data type.
Warning
Resolving the schema of a LazyFrame can be an expensive operation. Avoid accessing this property repeatedly if possible.
Examples
>>> lf = pl.LazyFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": ["a", "b", "c"], ... } ... ) >>> lf.schema OrderedDict({'foo': Int64, 'bar': Float64, 'ham': String})
- select(*exprs: IntoExpr | Iterable[IntoExpr], **named_exprs: IntoExpr) Self [source]
Select columns from this LazyFrame.
- Parameters:
- *exprs
Column(s) to select, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.
- **named_exprs
Additional columns to select, specified as keyword arguments. The columns will be renamed to the keyword used.
Examples
Pass the name of a column to select that column.
>>> lf = pl.LazyFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> lf.select("foo").collect() shape: (3, 1) βββββββ β foo β β --- β β i64 β βββββββ‘ β 1 β β 2 β β 3 β βββββββ
Multiple columns can be selected by passing a list of column names.
>>> lf.select(["foo", "bar"]).collect() shape: (3, 2) βββββββ¬ββββββ β foo β bar β β --- β --- β β i64 β i64 β βββββββͺββββββ‘ β 1 β 6 β β 2 β 7 β β 3 β 8 β βββββββ΄ββββββ
Multiple columns can also be selected using positional arguments instead of a list. Expressions are also accepted.
>>> lf.select(pl.col("foo"), pl.col("bar") + 1).collect() shape: (3, 2) βββββββ¬ββββββ β foo β bar β β --- β --- β β i64 β i64 β βββββββͺββββββ‘ β 1 β 7 β β 2 β 8 β β 3 β 9 β βββββββ΄ββββββ
Use keyword arguments to easily name your expression inputs.
>>> lf.select( ... threshold=pl.when(pl.col("foo") > 2).then(10).otherwise(0) ... ).collect() shape: (3, 1) βββββββββββββ β threshold β β --- β β i32 β βββββββββββββ‘ β 0 β β 0 β β 10 β βββββββββββββ
Expressions with multiple outputs can be automatically instantiated as Structs by enabling the setting
Config.set_auto_structify(True)
:>>> with pl.Config(auto_structify=True): ... lf.select( ... is_odd=(pl.col(pl.INTEGER_DTYPES) % 2).name.suffix("_is_odd"), ... ).collect() shape: (3, 1) βββββββββββββ β is_odd β β --- β β struct[2] β βββββββββββββ‘ β {1,0} β β {0,1} β β {1,0} β βββββββββββββ
- select_seq(
- *exprs: IntoExpr | Iterable[IntoExpr],
- **named_exprs: IntoExpr,
Select columns from this LazyFrame.
This will run all expression sequentially instead of in parallel. Use this when the work per expression is cheap.
- Parameters:
- *exprs
Column(s) to select, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.
- **named_exprs
Additional columns to select, specified as keyword arguments. The columns will be renamed to the keyword used.
See also
- serialize(file: IOBase | str | Path | None = None) str | None [source]
Serialize the logical plan of this LazyFrame to a file or string in JSON format.
- Parameters:
- file
File path to which the result should be written. If set to
None
(default), the output is returned as a string instead.
See also
Examples
Serialize the logical plan into a JSON string.
>>> lf = pl.LazyFrame({"a": [1, 2, 3]}).sum() >>> json = lf.serialize() >>> json '{"MapFunction":{"input":{"DataFrameScan":{"df":{"columns":[{"name":"a","datatype":"Int64","bit_settings":"","values":[1,2,3]}]},"schema":{"inner":{"a":"Int64"}},"output_schema":null,"projection":null,"selection":null}},"function":{"Stats":"Sum"}}}'
The logical plan can later be deserialized back into a LazyFrame.
>>> import io >>> pl.LazyFrame.deserialize(io.StringIO(json)).collect() shape: (1, 1) βββββββ β a β β --- β β i64 β βββββββ‘ β 6 β βββββββ
- set_sorted( ) Self [source]
Indicate that one or multiple columns are sorted.
- Parameters:
- column
Columns that are sorted
- more_columns
Additional columns that are sorted, specified as positional arguments.
- descending
Whether the columns are sorted in descending order.
- shift( ) Self [source]
Shift values by the given number of indices.
- Parameters:
- n
Number of indices to shift forward. If a negative value is passed, values are shifted in the opposite direction instead.
- fill_value
Fill the resulting null values with this value. Accepts expression input. Non-expression inputs are parsed as literals.
Notes
This method is similar to the
LAG
operation in SQL when the value forn
is positive. With a negative value forn
, it is similar toLEAD
.Examples
By default, values are shifted forward by one index.
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [5, 6, 7, 8], ... } ... ) >>> lf.shift().collect() shape: (4, 2) ββββββββ¬βββββββ β a β b β β --- β --- β β i64 β i64 β ββββββββͺβββββββ‘ β null β null β β 1 β 5 β β 2 β 6 β β 3 β 7 β ββββββββ΄βββββββ
Pass a negative value to shift in the opposite direction instead.
>>> lf.shift(-2).collect() shape: (4, 2) ββββββββ¬βββββββ β a β b β β --- β --- β β i64 β i64 β ββββββββͺβββββββ‘ β 3 β 7 β β 4 β 8 β β null β null β β null β null β ββββββββ΄βββββββ
Specify
fill_value
to fill the resulting null values.>>> lf.shift(-2, fill_value=100).collect() shape: (4, 2) βββββββ¬ββββββ β a β b β β --- β --- β β i64 β i64 β βββββββͺββββββ‘ β 3 β 7 β β 4 β 8 β β 100 β 100 β β 100 β 100 β βββββββ΄ββββββ
- shift_and_fill(fill_value: Expr | int | str | float, *, n: int = 1) Self [source]
Shift values by the given number of places and fill the resulting null values.
Deprecated since version 0.19.12: Use
shift()
instead.- Parameters:
- fill_value
fill None values with the result of this expression.
- n
Number of places to shift (may be negative).
- show_graph(
- *,
- optimized: bool = True,
- show: bool = True,
- output_path: str | Path | None = None,
- raw_output: bool = False,
- figsize: tuple[float, float] = (16.0, 12.0),
- type_coercion: bool = True,
- predicate_pushdown: bool = True,
- projection_pushdown: bool = True,
- simplify_expression: bool = True,
- slice_pushdown: bool = True,
- comm_subplan_elim: bool = True,
- comm_subexpr_elim: bool = True,
- cluster_with_columns: bool = True,
- streaming: bool = False,
Show a plot of the query plan.
Note that graphviz must be installed to render the visualization (if not already present you can download it here: <https://graphviz.org/download>`_).
- Parameters:
- optimized
Optimize the query plan.
- show
Show the figure.
- output_path
Write the figure to disk.
- raw_output
Return dot syntax. This cannot be combined with
show
and/oroutput_path
.- figsize
Passed to matplotlib if
show
== True.- type_coercion
Do type coercion optimization.
- predicate_pushdown
Do predicate pushdown optimization.
- projection_pushdown
Do projection pushdown optimization.
- simplify_expression
Run simplify expressions optimization.
- slice_pushdown
Slice pushdown optimization.
- comm_subplan_elim
Will try to cache branching subplans that occur on self-joins or unions.
- comm_subexpr_elim
Common subexpressions will be cached and reused.
- cluster_with_columns
Combine sequential independent calls to with_columns
- streaming
Run parts of the query in a streaming fashion (this is in an alpha state)
Examples
>>> lf = pl.LazyFrame( ... { ... "a": ["a", "b", "a", "b", "b", "c"], ... "b": [1, 2, 3, 4, 5, 6], ... "c": [6, 5, 4, 3, 2, 1], ... } ... ) >>> lf.group_by("a", maintain_order=True).agg(pl.all().sum()).sort( ... "a" ... ).show_graph()
- sink_csv(
- path: str | Path,
- *,
- include_bom: bool = False,
- include_header: bool = True,
- separator: str = ',',
- line_terminator: str = '\n',
- quote_char: str = '"',
- batch_size: int = 1024,
- datetime_format: str | None = None,
- date_format: str | None = None,
- time_format: str | None = None,
- float_precision: int | None = None,
- null_value: str | None = None,
- quote_style: CsvQuoteStyle | None = None,
- maintain_order: bool = True,
- type_coercion: bool = True,
- predicate_pushdown: bool = True,
- projection_pushdown: bool = True,
- simplify_expression: bool = True,
- slice_pushdown: bool = True,
- no_optimization: bool = False,
Evaluate the query in streaming mode and write to a CSV file.
Warning
Streaming mode is considered unstable. It may be changed at any point without it being considered a breaking change.
This allows streaming results that are larger than RAM to be written to disk.
- Parameters:
- path
File path to which the file should be written.
- include_bom
Whether to include UTF-8 BOM in the CSV output.
- include_header
Whether to include header in the CSV output.
- separator
Separate CSV fields with this symbol.
- line_terminator
String used to end each row.
- quote_char
Byte to use as quoting character.
- batch_size
Number of rows that will be processed per thread.
- datetime_format
A format string, with the specifiers defined by the chrono Rust crate. If no format specified, the default fractional-second precision is inferred from the maximum timeunit found in the frameβs Datetime cols (if any).
- date_format
A format string, with the specifiers defined by the chrono Rust crate.
- time_format
A format string, with the specifiers defined by the chrono Rust crate.
- float_precision
Number of decimal places to write, applied to both
Float32
andFloat64
datatypes.- null_value
A string representing null values (defaulting to the empty string).
- quote_style{βnecessaryβ, βalwaysβ, βnon_numericβ, βneverβ}
Determines the quoting strategy used.
necessary (default): This puts quotes around fields only when necessary. They are necessary when fields contain a quote, delimiter or record terminator. Quotes are also necessary when writing an empty record (which is indistinguishable from a record with one empty field). This is the default.
always: This puts quotes around every field. Always.
never: This never puts quotes around fields, even if that results in invalid CSV data (e.g.: by not quoting strings containing the separator).
non_numeric: This puts quotes around all fields that are non-numeric. Namely, when writing a field that does not parse as a valid float or integer, then quotes will be used even if they aren`t strictly necessary.
- maintain_order
Maintain the order in which data is processed. Setting this to
False
will be slightly faster.- type_coercion
Do type coercion optimization.
- predicate_pushdown
Do predicate pushdown optimization.
- projection_pushdown
Do projection pushdown optimization.
- simplify_expression
Run simplify expressions optimization.
- slice_pushdown
Slice pushdown optimization.
- no_optimization
Turn off (certain) optimizations.
- Returns:
- DataFrame
Examples
>>> lf = pl.scan_csv("/path/to/my_larger_than_ram_file.csv") >>> lf.sink_csv("out.csv")
- sink_ipc(
- path: str | Path,
- *,
- compression: str | None = 'zstd',
- maintain_order: bool = True,
- type_coercion: bool = True,
- predicate_pushdown: bool = True,
- projection_pushdown: bool = True,
- simplify_expression: bool = True,
- slice_pushdown: bool = True,
- no_optimization: bool = False,
Evaluate the query in streaming mode and write to an IPC file.
Warning
Streaming mode is considered unstable. It may be changed at any point without it being considered a breaking change.
This allows streaming results that are larger than RAM to be written to disk.
- Parameters:
- path
File path to which the file should be written.
- compression{βlz4β, βzstdβ}
Choose βzstdβ for good compression performance. Choose βlz4β for fast compression/decompression.
- maintain_order
Maintain the order in which data is processed. Setting this to
False
will be slightly faster.- type_coercion
Do type coercion optimization.
- predicate_pushdown
Do predicate pushdown optimization.
- projection_pushdown
Do projection pushdown optimization.
- simplify_expression
Run simplify expressions optimization.
- slice_pushdown
Slice pushdown optimization.
- no_optimization
Turn off (certain) optimizations.
- Returns:
- DataFrame
Examples
>>> lf = pl.scan_csv("/path/to/my_larger_than_ram_file.csv") >>> lf.sink_ipc("out.arrow")
- sink_ndjson(
- path: str | Path,
- *,
- maintain_order: bool = True,
- type_coercion: bool = True,
- predicate_pushdown: bool = True,
- projection_pushdown: bool = True,
- simplify_expression: bool = True,
- slice_pushdown: bool = True,
- no_optimization: bool = False,
Evaluate the query in streaming mode and write to an NDJSON file.
Warning
Streaming mode is considered unstable. It may be changed at any point without it being considered a breaking change.
This allows streaming results that are larger than RAM to be written to disk.
- Parameters:
- path
File path to which the file should be written.
- maintain_order
Maintain the order in which data is processed. Setting this to
False
will be slightly faster.- type_coercion
Do type coercion optimization.
- predicate_pushdown
Do predicate pushdown optimization.
- projection_pushdown
Do projection pushdown optimization.
- simplify_expression
Run simplify expressions optimization.
- slice_pushdown
Slice pushdown optimization.
- no_optimization
Turn off (certain) optimizations.
- Returns:
- DataFrame
Examples
>>> lf = pl.scan_csv("/path/to/my_larger_than_ram_file.csv") >>> lf.sink_ndjson("out.ndjson")
- sink_parquet(
- path: str | Path,
- *,
- compression: str = 'zstd',
- compression_level: int | None = None,
- statistics: bool = True,
- row_group_size: int | None = None,
- data_pagesize_limit: int | None = None,
- maintain_order: bool = True,
- type_coercion: bool = True,
- predicate_pushdown: bool = True,
- projection_pushdown: bool = True,
- simplify_expression: bool = True,
- slice_pushdown: bool = True,
- no_optimization: bool = False,
Evaluate the query in streaming mode and write to a Parquet file.
Warning
Streaming mode is considered unstable. It may be changed at any point without it being considered a breaking change.
This allows streaming results that are larger than RAM to be written to disk.
- Parameters:
- path
File path to which the file should be written.
- compression{βlz4β, βuncompressedβ, βsnappyβ, βgzipβ, βlzoβ, βbrotliβ, βzstdβ}
Choose βzstdβ for good compression performance. Choose βlz4β for fast compression/decompression. Choose βsnappyβ for more backwards compatibility guarantees when you deal with older parquet readers.
- compression_level
The level of compression to use. Higher compression means smaller files on disk.
βgzipβ : min-level: 0, max-level: 10.
βbrotliβ : min-level: 0, max-level: 11.
βzstdβ : min-level: 1, max-level: 22.
- statistics
Write statistics to the parquet headers. This is the default behavior.
- row_group_size
Size of the row groups in number of rows. If None (default), the chunks of the
DataFrame
are used. Writing in smaller chunks may reduce memory pressure and improve writing speeds.- data_pagesize_limit
Size limit of individual data pages. If not set defaults to 1024 * 1024 bytes
- maintain_order
Maintain the order in which data is processed. Setting this to
False
will be slightly faster.- type_coercion
Do type coercion optimization.
- predicate_pushdown
Do predicate pushdown optimization.
- projection_pushdown
Do projection pushdown optimization.
- simplify_expression
Run simplify expressions optimization.
- slice_pushdown
Slice pushdown optimization.
- no_optimization
Turn off (certain) optimizations.
- Returns:
- DataFrame
Examples
>>> lf = pl.scan_csv("/path/to/my_larger_than_ram_file.csv") >>> lf.sink_parquet("out.parquet")
- slice(offset: int, length: int | None = None) Self [source]
Get a slice of this DataFrame.
- Parameters:
- offset
Start index. Negative indexing is supported.
- length
Length of the slice. If set to
None
, all rows starting at the offset will be selected.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": ["x", "y", "z"], ... "b": [1, 3, 5], ... "c": [2, 4, 6], ... } ... ) >>> lf.slice(1, 2).collect() shape: (2, 3) βββββββ¬ββββββ¬ββββββ β a β b β c β β --- β --- β --- β β str β i64 β i64 β βββββββͺββββββͺββββββ‘ β y β 3 β 4 β β z β 5 β 6 β βββββββ΄ββββββ΄ββββββ
- sort(
- by: IntoExpr | Iterable[IntoExpr],
- *more_by: IntoExpr,
- descending: bool | Sequence[bool] = False,
- nulls_last: bool | Sequence[bool] = False,
- maintain_order: bool = False,
- multithreaded: bool = True,
Sort the LazyFrame by the given columns.
- Parameters:
- by
Column(s) to sort by. Accepts expression input. Strings are parsed as column names.
- *more_by
Additional columns to sort by, specified as positional arguments.
- descending
Sort in descending order. When sorting by multiple columns, can be specified per column by passing a sequence of booleans.
- nulls_last
Place null values last; can specify a single boolean applying to all columns or a sequence of booleans for per-column control.
- maintain_order
Whether the order should be maintained if elements are equal. Note that if
true
streaming is not possible and performance might be worse since this requires a stable search.- multithreaded
Sort using multiple threads.
Examples
Pass a single column name to sort by that column.
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, None], ... "b": [6.0, 5.0, 4.0], ... "c": ["a", "c", "b"], ... } ... ) >>> lf.sort("a").collect() shape: (3, 3) ββββββββ¬ββββββ¬ββββββ β a β b β c β β --- β --- β --- β β i64 β f64 β str β ββββββββͺββββββͺββββββ‘ β null β 4.0 β b β β 1 β 6.0 β a β β 2 β 5.0 β c β ββββββββ΄ββββββ΄ββββββ
Sorting by expressions is also supported.
>>> lf.sort(pl.col("a") + pl.col("b") * 2, nulls_last=True).collect() shape: (3, 3) ββββββββ¬ββββββ¬ββββββ β a β b β c β β --- β --- β --- β β i64 β f64 β str β ββββββββͺββββββͺββββββ‘ β 2 β 5.0 β c β β 1 β 6.0 β a β β null β 4.0 β b β ββββββββ΄ββββββ΄ββββββ
Sort by multiple columns by passing a list of columns.
>>> lf.sort(["c", "a"], descending=True).collect() shape: (3, 3) ββββββββ¬ββββββ¬ββββββ β a β b β c β β --- β --- β --- β β i64 β f64 β str β ββββββββͺββββββͺββββββ‘ β 2 β 5.0 β c β β null β 4.0 β b β β 1 β 6.0 β a β ββββββββ΄ββββββ΄ββββββ
Or use positional arguments to sort by multiple columns in the same way.
>>> lf.sort("c", "a", descending=[False, True]).collect() shape: (3, 3) ββββββββ¬ββββββ¬ββββββ β a β b β c β β --- β --- β --- β β i64 β f64 β str β ββββββββͺββββββͺββββββ‘ β 1 β 6.0 β a β β null β 4.0 β b β β 2 β 5.0 β c β ββββββββ΄ββββββ΄ββββββ
- sql(query: str, *, table_name: str = 'self') Self [source]
Execute a SQL query against the LazyFrame.
New in version 0.20.23.
Warning
This functionality is considered unstable, although it is close to being considered stable. It may be changed at any point without it being considered a breaking change.
- Parameters:
- query
SQL query to execute.
- table_name
Optionally provide an explicit name for the table that represents the calling frame (defaults to βselfβ).
See also
Notes
The calling frame is automatically registered as a table in the SQL context under the name βselfβ. If you want access to the DataFrames and LazyFrames found in the current globals, use the top-level
pl.sql
.More control over registration and execution behaviour is available by using the
SQLContext
object.
Examples
>>> lf1 = pl.LazyFrame({"a": [1, 2, 3], "b": [6, 7, 8], "c": ["z", "y", "x"]}) >>> lf2 = pl.LazyFrame({"a": [3, 2, 1], "d": [125, -654, 888]})
Query the LazyFrame using SQL:
>>> lf1.sql("SELECT c, b FROM self WHERE a > 1").collect() shape: (2, 2) βββββββ¬ββββββ β c β b β β --- β --- β β str β i64 β βββββββͺββββββ‘ β y β 7 β β x β 8 β βββββββ΄ββββββ
Apply SQL transforms (aliasing βselfβ to βframeβ) then filter natively (you can freely mix SQL and native operations):
>>> lf1.sql( ... query=''' ... SELECT ... a, ... (a % 2 == 0) AS a_is_even, ... (b::float4 / 2) AS "b/2", ... CONCAT_WS(':', c, c, c) AS c_c_c ... FROM frame ... ORDER BY a ... ''', ... table_name="frame", ... ).filter(~pl.col("c_c_c").str.starts_with("x")).collect() shape: (2, 4) βββββββ¬ββββββββββββ¬ββββββ¬ββββββββ β a β a_is_even β b/2 β c_c_c β β --- β --- β --- β --- β β i64 β bool β f32 β str β βββββββͺββββββββββββͺββββββͺββββββββ‘ β 1 β false β 3.0 β z:z:z β β 2 β true β 3.5 β y:y:y β βββββββ΄ββββββββββββ΄ββββββ΄ββββββββ
- std(ddof: int = 1) Self [source]
Aggregate the columns in the LazyFrame to their standard deviation value.
- Parameters:
- ddof
βDelta Degrees of Freedomβ: the divisor used in the calculation is N - ddof, where N represents the number of elements. By default ddof is 1.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [1, 2, 1, 1], ... } ... ) >>> lf.std().collect() shape: (1, 2) ββββββββββββ¬ββββββ β a β b β β --- β --- β β f64 β f64 β ββββββββββββͺββββββ‘ β 1.290994 β 0.5 β ββββββββββββ΄ββββββ >>> lf.std(ddof=0).collect() shape: (1, 2) ββββββββββββ¬βββββββββββ β a β b β β --- β --- β β f64 β f64 β ββββββββββββͺβββββββββββ‘ β 1.118034 β 0.433013 β ββββββββββββ΄βββββββββββ
- sum() Self [source]
Aggregate the columns in the LazyFrame to their sum value.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [1, 2, 1, 1], ... } ... ) >>> lf.sum().collect() shape: (1, 2) βββββββ¬ββββββ β a β b β β --- β --- β β i64 β i64 β βββββββͺββββββ‘ β 10 β 5 β βββββββ΄ββββββ
- tail(n: int = 5) Self [source]
Get the last
n
rows.- Parameters:
- n
Number of rows to return.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4, 5, 6], ... "b": [7, 8, 9, 10, 11, 12], ... } ... ) >>> lf.tail().collect() shape: (5, 2) βββββββ¬ββββββ β a β b β β --- β --- β β i64 β i64 β βββββββͺββββββ‘ β 2 β 8 β β 3 β 9 β β 4 β 10 β β 5 β 11 β β 6 β 12 β βββββββ΄ββββββ >>> lf.tail(2).collect() shape: (2, 2) βββββββ¬ββββββ β a β b β β --- β --- β β i64 β i64 β βββββββͺββββββ‘ β 5 β 11 β β 6 β 12 β βββββββ΄ββββββ
- take_every(n: int, offset: int = 0) Self [source]
Take every nth row in the LazyFrame and return as a new LazyFrame.
Deprecated since version 0.19.0: This method has been renamed to
gather_every()
.- Parameters:
- n
Gather every n-th row.
- offset
Starting index.
- top_k(
- k: int,
- *,
- by: IntoExpr | Iterable[IntoExpr],
- descending: bool | Sequence[bool] = False,
- nulls_last: bool | Sequence[bool] | None = None,
- maintain_order: bool | None = None,
- multithreaded: bool | None = None,
Return the
k
largest rows.- Parameters:
- k
Number of rows to return.
- by
Column(s) used to determine the top rows. Accepts expression input. Strings are parsed as column names.
- descending
Consider the
k
smallest elements of theby
column(s) (instead of thek
largest). This can be specified per column by passing a sequence of booleans.- nulls_last
Place null values last.
Deprecated since version 0.20.31: This parameter will be removed in the next breaking release. Null values will be considered lowest priority and will only be included if
k
is larger than the number of non-null elements.- maintain_order
Whether the order should be maintained if elements are equal. Note that if
true
streaming is not possible and performance might be worse since this requires a stable search.Deprecated since version 0.20.31: This parameter will be removed in the next breaking release. There will be no guarantees about the order of the output.
- multithreaded
Sort using multiple threads.
Deprecated since version 0.20.31: This parameter will be removed in the next breaking release. Polars itself will determine whether to use multithreading or not.
See also
Examples
>>> lf = pl.LazyFrame( ... { ... "a": ["a", "b", "a", "b", "b", "c"], ... "b": [2, 1, 1, 3, 2, 1], ... } ... )
Get the rows which contain the 4 largest values in column b.
>>> lf.top_k(4, by="b").collect() shape: (4, 2) βββββββ¬ββββββ β a β b β β --- β --- β β str β i64 β βββββββͺββββββ‘ β b β 3 β β a β 2 β β b β 2 β β b β 1 β βββββββ΄ββββββ
Get the rows which contain the 4 largest values when sorting on column b and a.
>>> lf.top_k(4, by=["b", "a"]).collect() shape: (4, 2) βββββββ¬ββββββ β a β b β β --- β --- β β str β i64 β βββββββͺββββββ‘ β b β 3 β β b β 2 β β a β 2 β β c β 1 β βββββββ΄ββββββ
- unique(
- subset: ColumnNameOrSelector | Collection[ColumnNameOrSelector] | None = None,
- *,
- keep: UniqueKeepStrategy = 'any',
- maintain_order: bool = False,
Drop duplicate rows from this DataFrame.
- Parameters:
- subset
Column name(s) or selector(s), to consider when identifying duplicate rows. If set to
None
(default), use all columns.- keep{βfirstβ, βlastβ, βanyβ, βnoneβ}
Which of the duplicate rows to keep.
- βanyβ: Does not give any guarantee of which row is kept.
This allows more optimizations.
βnoneβ: Donβt keep duplicate rows.
βfirstβ: Keep first unique row.
βlastβ: Keep last unique row.
- maintain_order
Keep the same order as the original DataFrame. This is more expensive to compute. Settings this to
True
blocks the possibility to run on the streaming engine.
- Returns:
- LazyFrame
LazyFrame with unique rows.
Warning
This method will fail if there is a column of type
List
in the DataFrame or subset.Examples
>>> lf = pl.LazyFrame( ... { ... "foo": [1, 2, 3, 1], ... "bar": ["a", "a", "a", "a"], ... "ham": ["b", "b", "b", "b"], ... } ... ) >>> lf.unique(maintain_order=True).collect() shape: (3, 3) βββββββ¬ββββββ¬ββββββ β foo β bar β ham β β --- β --- β --- β β i64 β str β str β βββββββͺββββββͺββββββ‘ β 1 β a β b β β 2 β a β b β β 3 β a β b β βββββββ΄ββββββ΄ββββββ >>> lf.unique(subset=["bar", "ham"], maintain_order=True).collect() shape: (1, 3) βββββββ¬ββββββ¬ββββββ β foo β bar β ham β β --- β --- β --- β β i64 β str β str β βββββββͺββββββͺββββββ‘ β 1 β a β b β βββββββ΄ββββββ΄ββββββ >>> lf.unique(keep="last", maintain_order=True).collect() shape: (3, 3) βββββββ¬ββββββ¬ββββββ β foo β bar β ham β β --- β --- β --- β β i64 β str β str β βββββββͺββββββͺββββββ‘ β 2 β a β b β β 3 β a β b β β 1 β a β b β βββββββ΄ββββββ΄ββββββ
- unnest(
- columns: ColumnNameOrSelector | Collection[ColumnNameOrSelector],
- *more_columns: ColumnNameOrSelector,
Decompose struct columns into separate columns for each of their fields.
The new columns will be inserted into the DataFrame at the location of the struct column.
- Parameters:
- columns
Name of the struct column(s) that should be unnested.
- *more_columns
Additional columns to unnest, specified as positional arguments.
Examples
>>> df = pl.LazyFrame( ... { ... "before": ["foo", "bar"], ... "t_a": [1, 2], ... "t_b": ["a", "b"], ... "t_c": [True, None], ... "t_d": [[1, 2], [3]], ... "after": ["baz", "womp"], ... } ... ).select("before", pl.struct(pl.col("^t_.$")).alias("t_struct"), "after") >>> df.collect() shape: (2, 3) ββββββββββ¬ββββββββββββββββββββββ¬ββββββββ β before β t_struct β after β β --- β --- β --- β β str β struct[4] β str β ββββββββββͺββββββββββββββββββββββͺββββββββ‘ β foo β {1,"a",true,[1, 2]} β baz β β bar β {2,"b",null,[3]} β womp β ββββββββββ΄ββββββββββββββββββββββ΄ββββββββ >>> df.unnest("t_struct").collect() shape: (2, 6) ββββββββββ¬ββββββ¬ββββββ¬βββββββ¬ββββββββββββ¬ββββββββ β before β t_a β t_b β t_c β t_d β after β β --- β --- β --- β --- β --- β --- β β str β i64 β str β bool β list[i64] β str β ββββββββββͺββββββͺββββββͺβββββββͺββββββββββββͺββββββββ‘ β foo β 1 β a β true β [1, 2] β baz β β bar β 2 β b β null β [3] β womp β ββββββββββ΄ββββββ΄ββββββ΄βββββββ΄ββββββββββββ΄ββββββββ
- update(
- other: LazyFrame,
- on: str | Sequence[str] | None = None,
- how: Literal['left', 'inner', 'full'] = 'left',
- *,
- left_on: str | Sequence[str] | None = None,
- right_on: str | Sequence[str] | None = None,
- include_nulls: bool = False,
Update the values in this
LazyFrame
with the non-null values inother
.Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- Parameters:
- other
LazyFrame that will be used to update the values
- on
Column names that will be joined on. If set to
None
(default), the implicit row index of each frame is used as a join key.- how{βleftβ, βinnerβ, βfullβ}
βleftβ will keep all rows from the left table; rows may be duplicated if multiple rows in the right frame match the left rowβs key.
βinnerβ keeps only those rows where the key exists in both frames.
βfullβ will update existing rows where the key matches while also adding any new rows contained in the given frame.
- left_on
Join column(s) of the left DataFrame.
- right_on
Join column(s) of the right DataFrame.
- include_nulls
If True, null values from the right DataFrame will be used to update the left DataFrame.
Notes
This is syntactic sugar for a left/inner join, with an optional coalesce when
include_nulls = False
.Examples
>>> lf = pl.LazyFrame( ... { ... "A": [1, 2, 3, 4], ... "B": [400, 500, 600, 700], ... } ... ) >>> lf.collect() shape: (4, 2) βββββββ¬ββββββ β A β B β β --- β --- β β i64 β i64 β βββββββͺββββββ‘ β 1 β 400 β β 2 β 500 β β 3 β 600 β β 4 β 700 β βββββββ΄ββββββ >>> new_lf = pl.LazyFrame( ... { ... "B": [-66, None, -99], ... "C": [5, 3, 1], ... } ... )
Update
df
values with the non-null values innew_df
, by row index:>>> lf.update(new_lf).collect() shape: (4, 2) βββββββ¬ββββββ β A β B β β --- β --- β β i64 β i64 β βββββββͺββββββ‘ β 1 β -66 β β 2 β 500 β β 3 β -99 β β 4 β 700 β βββββββ΄ββββββ
Update
df
values with the non-null values innew_df
, by row index, but only keeping those rows that are common to both frames:>>> lf.update(new_lf, how="inner").collect() shape: (3, 2) βββββββ¬ββββββ β A β B β β --- β --- β β i64 β i64 β βββββββͺββββββ‘ β 1 β -66 β β 2 β 500 β β 3 β -99 β βββββββ΄ββββββ
Update
df
values with the non-null values innew_df
, using a full outer join strategy that defines explicit join columns in each frame:>>> lf.update(new_lf, left_on=["A"], right_on=["C"], how="full").collect() shape: (5, 2) βββββββ¬ββββββ β A β B β β --- β --- β β i64 β i64 β βββββββͺββββββ‘ β 1 β -99 β β 2 β 500 β β 3 β 600 β β 4 β 700 β β 5 β -66 β βββββββ΄ββββββ
Update
df
values including null values innew_df
, using a full outer join strategy that defines explicit join columns in each frame:>>> lf.update( ... new_lf, left_on="A", right_on="C", how="full", include_nulls=True ... ).collect() shape: (5, 2) βββββββ¬βββββββ β A β B β β --- β --- β β i64 β i64 β βββββββͺβββββββ‘ β 1 β -99 β β 2 β 500 β β 3 β null β β 4 β 700 β β 5 β -66 β βββββββ΄βββββββ
- var(ddof: int = 1) Self [source]
Aggregate the columns in the LazyFrame to their variance value.
- Parameters:
- ddof
βDelta Degrees of Freedomβ: the divisor used in the calculation is N - ddof, where N represents the number of elements. By default ddof is 1.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [1, 2, 1, 1], ... } ... ) >>> lf.var().collect() shape: (1, 2) ββββββββββββ¬βββββββ β a β b β β --- β --- β β f64 β f64 β ββββββββββββͺβββββββ‘ β 1.666667 β 0.25 β ββββββββββββ΄βββββββ >>> lf.var(ddof=0).collect() shape: (1, 2) ββββββββ¬βββββββββ β a β b β β --- β --- β β f64 β f64 β ββββββββͺβββββββββ‘ β 1.25 β 0.1875 β ββββββββ΄βββββββββ
- property width: int[source]
Get the number of columns.
- Returns:
- int
Warning
Determining the width of a LazyFrame requires resolving its schema. Resolving the schema of a LazyFrame can be an expensive operation. Avoid accessing this property repeatedly if possible.
Examples
>>> lf = pl.LazyFrame( ... { ... "foo": [1, 2, 3], ... "bar": [4, 5, 6], ... } ... ) >>> lf.width 2
- with_columns(
- *exprs: IntoExpr | Iterable[IntoExpr],
- **named_exprs: IntoExpr,
Add columns to this LazyFrame.
Added columns will replace existing columns with the same name.
- Parameters:
- *exprs
Column(s) to add, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.
- **named_exprs
Additional columns to add, specified as keyword arguments. The columns will be renamed to the keyword used.
- Returns:
- LazyFrame
A new LazyFrame with the columns added.
Notes
Creating a new LazyFrame using this method does not create a new copy of existing data.
Examples
Pass an expression to add it as a new column.
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [0.5, 4, 10, 13], ... "c": [True, True, False, True], ... } ... ) >>> lf.with_columns((pl.col("a") ** 2).alias("a^2")).collect() shape: (4, 4) βββββββ¬βββββββ¬ββββββββ¬ββββββ β a β b β c β a^2 β β --- β --- β --- β --- β β i64 β f64 β bool β i64 β βββββββͺβββββββͺββββββββͺββββββ‘ β 1 β 0.5 β true β 1 β β 2 β 4.0 β true β 4 β β 3 β 10.0 β false β 9 β β 4 β 13.0 β true β 16 β βββββββ΄βββββββ΄ββββββββ΄ββββββ
Added columns will replace existing columns with the same name.
>>> lf.with_columns(pl.col("a").cast(pl.Float64)).collect() shape: (4, 3) βββββββ¬βββββββ¬ββββββββ β a β b β c β β --- β --- β --- β β f64 β f64 β bool β βββββββͺβββββββͺββββββββ‘ β 1.0 β 0.5 β true β β 2.0 β 4.0 β true β β 3.0 β 10.0 β false β β 4.0 β 13.0 β true β βββββββ΄βββββββ΄ββββββββ
Multiple columns can be added by passing a list of expressions.
>>> lf.with_columns( ... [ ... (pl.col("a") ** 2).alias("a^2"), ... (pl.col("b") / 2).alias("b/2"), ... (pl.col("c").not_()).alias("not c"), ... ] ... ).collect() shape: (4, 6) βββββββ¬βββββββ¬ββββββββ¬ββββββ¬βββββββ¬ββββββββ β a β b β c β a^2 β b/2 β not c β β --- β --- β --- β --- β --- β --- β β i64 β f64 β bool β i64 β f64 β bool β βββββββͺβββββββͺββββββββͺββββββͺβββββββͺββββββββ‘ β 1 β 0.5 β true β 1 β 0.25 β false β β 2 β 4.0 β true β 4 β 2.0 β false β β 3 β 10.0 β false β 9 β 5.0 β true β β 4 β 13.0 β true β 16 β 6.5 β false β βββββββ΄βββββββ΄ββββββββ΄ββββββ΄βββββββ΄ββββββββ
Multiple columns also can be added using positional arguments instead of a list.
>>> lf.with_columns( ... (pl.col("a") ** 2).alias("a^2"), ... (pl.col("b") / 2).alias("b/2"), ... (pl.col("c").not_()).alias("not c"), ... ).collect() shape: (4, 6) βββββββ¬βββββββ¬ββββββββ¬ββββββ¬βββββββ¬ββββββββ β a β b β c β a^2 β b/2 β not c β β --- β --- β --- β --- β --- β --- β β i64 β f64 β bool β i64 β f64 β bool β βββββββͺβββββββͺββββββββͺββββββͺβββββββͺββββββββ‘ β 1 β 0.5 β true β 1 β 0.25 β false β β 2 β 4.0 β true β 4 β 2.0 β false β β 3 β 10.0 β false β 9 β 5.0 β true β β 4 β 13.0 β true β 16 β 6.5 β false β βββββββ΄βββββββ΄ββββββββ΄ββββββ΄βββββββ΄ββββββββ
Use keyword arguments to easily name your expression inputs.
>>> lf.with_columns( ... ab=pl.col("a") * pl.col("b"), ... not_c=pl.col("c").not_(), ... ).collect() shape: (4, 5) βββββββ¬βββββββ¬ββββββββ¬βββββββ¬ββββββββ β a β b β c β ab β not_c β β --- β --- β --- β --- β --- β β i64 β f64 β bool β f64 β bool β βββββββͺβββββββͺββββββββͺβββββββͺββββββββ‘ β 1 β 0.5 β true β 0.5 β false β β 2 β 4.0 β true β 8.0 β false β β 3 β 10.0 β false β 30.0 β true β β 4 β 13.0 β true β 52.0 β false β βββββββ΄βββββββ΄ββββββββ΄βββββββ΄ββββββββ
Expressions with multiple outputs can be automatically instantiated as Structs by enabling the setting
Config.set_auto_structify(True)
:>>> with pl.Config(auto_structify=True): ... lf.drop("c").with_columns( ... diffs=pl.col(["a", "b"]).diff().name.suffix("_diff"), ... ).collect() shape: (4, 3) βββββββ¬βββββββ¬ββββββββββββββ β a β b β diffs β β --- β --- β --- β β i64 β f64 β struct[2] β βββββββͺβββββββͺββββββββββββββ‘ β 1 β 0.5 β {null,null} β β 2 β 4.0 β {1,3.5} β β 3 β 10.0 β {1,6.0} β β 4 β 13.0 β {1,3.0} β βββββββ΄βββββββ΄ββββββββββββββ
- with_columns_seq(
- *exprs: IntoExpr | Iterable[IntoExpr],
- **named_exprs: IntoExpr,
Add columns to this LazyFrame.
Added columns will replace existing columns with the same name.
This will run all expression sequentially instead of in parallel. Use this when the work per expression is cheap.
- Parameters:
- *exprs
Column(s) to add, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.
- **named_exprs
Additional columns to add, specified as keyword arguments. The columns will be renamed to the keyword used.
- Returns:
- LazyFrame
A new LazyFrame with the columns added.
See also
- with_context(other: Self | list[Self]) Self [source]
Add an external context to the computation graph.
This allows expressions to also access columns from DataFrames that are not part of this one.
- Parameters:
- other
Lazy DataFrame to join with.
Examples
>>> lf = pl.LazyFrame({"a": [1, 2, 3], "b": ["a", "c", None]}) >>> lf_other = pl.LazyFrame({"c": ["foo", "ham"]}) >>> lf.with_context(lf_other).select( ... pl.col("b") + pl.col("c").first() ... ).collect() shape: (3, 1) ββββββββ β b β β --- β β str β ββββββββ‘ β afoo β β cfoo β β null β ββββββββ
Fill nulls with the median from another DataFrame:
>>> train_lf = pl.LazyFrame( ... {"feature_0": [-1.0, 0, 1], "feature_1": [-1.0, 0, 1]} ... ) >>> test_lf = pl.LazyFrame( ... {"feature_0": [-1.0, None, 1], "feature_1": [-1.0, 0, 1]} ... ) >>> test_lf.with_context( ... train_lf.select(pl.all().name.suffix("_train")) ... ).select( ... pl.col("feature_0").fill_null(pl.col("feature_0_train").median()) ... ).collect() shape: (3, 1) βββββββββββββ β feature_0 β β --- β β f64 β βββββββββββββ‘ β -1.0 β β 0.0 β β 1.0 β βββββββββββββ
- with_row_count(name: str = 'row_nr', offset: int = 0) Self [source]
Add a column at index 0 that counts the rows.
Deprecated since version 0.20.4: Use
with_row_index()
instead. Note that the default column name has changed from βrow_nrβ to βindexβ.- Parameters:
- name
Name of the column to add.
- offset
Start the row count at this offset.
Warning
This can have a negative effect on query performance. This may, for instance, block predicate pushdown optimization.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 3, 5], ... "b": [2, 4, 6], ... } ... ) >>> lf.with_row_count().collect() shape: (3, 3) ββββββββββ¬ββββββ¬ββββββ β row_nr β a β b β β --- β --- β --- β β u32 β i64 β i64 β ββββββββββͺββββββͺββββββ‘ β 0 β 1 β 2 β β 1 β 3 β 4 β β 2 β 5 β 6 β ββββββββββ΄ββββββ΄ββββββ
- with_row_index(name: str = 'index', offset: int = 0) Self [source]
Add a row index as the first column in the LazyFrame.
- Parameters:
- name
Name of the index column.
- offset
Start the index at this offset. Cannot be negative.
Warning
Using this function can have a negative effect on query performance. This may, for instance, block predicate pushdown optimization.
Notes
The resulting column does not have any special properties. It is a regular column of type
UInt32
(orUInt64
inpolars-u64-idx
).Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 3, 5], ... "b": [2, 4, 6], ... } ... ) >>> lf.with_row_index().collect() shape: (3, 3) βββββββββ¬ββββββ¬ββββββ β index β a β b β β --- β --- β --- β β u32 β i64 β i64 β βββββββββͺββββββͺββββββ‘ β 0 β 1 β 2 β β 1 β 3 β 4 β β 2 β 5 β 6 β βββββββββ΄ββββββ΄ββββββ >>> lf.with_row_index("id", offset=1000).collect() shape: (3, 3) ββββββββ¬ββββββ¬ββββββ β id β a β b β β --- β --- β --- β β u32 β i64 β i64 β ββββββββͺββββββͺββββββ‘ β 1000 β 1 β 2 β β 1001 β 3 β 4 β β 1002 β 5 β 6 β ββββββββ΄ββββββ΄ββββββ
An index column can also be created using the expressions
int_range()
andlen()
.>>> lf.select( ... pl.int_range(pl.len(), dtype=pl.UInt32).alias("index"), ... pl.all(), ... ).collect() shape: (3, 3) βββββββββ¬ββββββ¬ββββββ β index β a β b β β --- β --- β --- β β u32 β i64 β i64 β βββββββββͺββββββͺββββββ‘ β 0 β 1 β 2 β β 1 β 3 β 4 β β 2 β 5 β 6 β βββββββββ΄ββββββ΄ββββββ