Expressions#
This page gives an overview of all public Polars expressions.
- class polars.Expr[source]
Expressions that can be used in various contexts.
Methods:
Compute absolute values.
Method equivalent of addition operator
expr + other.Get the group indexes of the group by operation.
Rename the expression.
Return whether all values in the column are
True.Method equivalent of bitwise "and" operator
expr & other & ....Return whether any of the values in the column are
True.Append expressions.
Approximate count of unique values.
Compute the element-wise value for the inverse cosine.
Compute the element-wise value for the inverse hyperbolic cosine.
Compute the element-wise value for the inverse sine.
Compute the element-wise value for the inverse hyperbolic sine.
Compute the element-wise value for the inverse tangent.
Compute the element-wise value for the inverse hyperbolic tangent.
Get the index of the maximal value.
Get the index of the minimal value.
Get the index values that would sort this column.
Return indices where expression evaluates
True.Get index of first unique value.
Fill missing values with the next non-null value.
Perform an aggregation of bitwise ANDs.
Evaluate the number of set bits.
Evaluate the number of unset bits.
Evaluate the number most-significant set bits before seeing an unset bit.
Evaluate the number most-significant unset bits before seeing a set bit.
Perform an aggregation of bitwise ORs.
Evaluate the number least-significant set bits before seeing an unset bit.
Evaluate the number least-significant unset bits before seeing a set bit.
Perform an aggregation of bitwise XORs.
Return the
ksmallest elements.Return the elements corresponding to the
ksmallest elements of thebycolumn(s).Cast between data types.
Compute the cube root of the elements.
Rounds up to the nearest integer value.
Set values outside the given boundaries to the boundary value.
Compute the element-wise value for the cosine.
Compute the element-wise value for the hyperbolic cosine.
Compute the element-wise value for the cotangent.
Return the number of non-null elements in the column.
Return the cumulative count of the non-null values in the column.
Get an array with the cumulative max computed at every element.
Get an array with the cumulative min computed at every element.
Get an array with the cumulative product computed at every element.
Get an array with the cumulative sum computed at every element.
Run an expression over a sliding window that increases
1slot every iteration.Bin continuous values into discrete categories.
Convert from radians to degrees.
Read a serialized expression from a file.
Calculate the first discrete difference between shifted items.
Compute the dot/inner product between two Expressions.
Drop all floating point NaN values.
Drop all null values.
Computes the entropy.
Method equivalent of equality operator
expr == other.Method equivalent of equality operator
expr == otherwhereNone == None.Compute exponentially-weighted moving average.
Compute time-based exponentially weighted moving average.
Compute exponentially-weighted moving standard deviation.
Compute exponentially-weighted moving variance.
Exclude columns from a multi-column expression.
Compute the exponential, element-wise.
Explode a list expression.
Extremely fast method for extending the Series with 'n' copies of a value.
Fill floating point NaN value with a fill value.
Fill null values using the specified value or strategy.
Filter the expression based on one or more predicate expressions.
Get the first value.
Flatten a list or string column.
Rounds down to the nearest integer value.
Method equivalent of integer division operator
expr // other.Fill missing values with the last non-null value.
Read an expression from a JSON encoded string to construct an Expression.
Take values by index.
Take every nth value in the Series and return as a new Series.
Method equivalent of "greater than or equal" operator
expr >= other.Return a single value by index.
Method equivalent of "greater than" operator
expr > other.Check whether the expression contains one or more null values.
Hash the elements in the selection.
Get the first
nrows.Bin values into buckets and count their occurrences.
Aggregate values into a list.
Get the index of the first occurrence of a value, or
Noneif it's not found.Print the value that this expression evaluates to and pass on the value.
Interpolate intermediate values.
Fill null values using interpolation based on another column.
Check if this expression is between the given lower and upper bounds.
Check if this expression is close, i.e. almost equal, to the other expression.
Return a boolean mask indicating duplicated values.
Return whether the column is empty.
Returns a boolean Series indicating which values are finite.
Return a boolean mask indicating the first occurrence of each distinct value.
Check if elements of this expression are present in the other Series.
Returns a boolean Series indicating which values are infinite.
Return a boolean mask indicating the last occurrence of each distinct value.
Returns a boolean Series indicating which values are NaN.
Returns a boolean Series indicating which values are not NaN.
Returns a boolean Series indicating which values are not null.
Returns a boolean Series indicating which values are null.
Get mask of unique values.
Get the single value.
Compute the kurtosis (Fisher or Pearson) of a dataset.
Get the last value.
Method equivalent of "less than or equal" operator
expr <= other.Return the number of elements in the column.
Get the first
nrows (alias forExpr.head()).Compute the logarithm to a given base.
Compute the base 10 logarithm of the input array, element-wise.
Compute the natural logarithm of each element plus one.
Calculate the lower bound.
Method equivalent of "less than" operator
expr < other.Apply a custom python function to a whole Series or sequence of Series.
Map a custom/user-defined function (UDF) to each element of a column.
Get maximum value.
Get maximum value, ordered by another expression.
Get mean value.
Get median value using linear interpolation.
Get minimum value.
Get minimum value, ordered by another expression.
Method equivalent of modulus operator
expr % other.Compute the most occurring value(s).
Method equivalent of multiplication operator
expr * other.Count unique values.
Get maximum value, but propagate/poison encountered NaN values.
Get minimum value, but propagate/poison encountered NaN values.
Method equivalent of inequality operator
expr != other.Method equivalent of equality operator
expr != otherwhereNone == None.Method equivalent of unary minus operator
-expr.Method equivalent of bitwise "not" operator
~expr.Count null values.
Method equivalent of bitwise "or" operator
expr | other | ....Compute expressions over the given groups.
Computes percentage change between values.
Get a boolean mask of the local maximum peaks.
Get a boolean mask of the local minimum peaks.
Offers a structured way to apply a sequence of user-defined functions (UDFs).
Method equivalent of exponentiation operator
expr ** exponent.Compute the product of an expression.
Bin continuous values into discrete categories based on their quantiles.
Get quantile value.
Convert from degrees to radians.
Assign ranks to data, dealing with ties appropriately.
Create a single chunk of memory for this Series.
register_pluginRegister a plugin function.
Reinterpret the underlying bits as a signed/unsigned integer or float.
Repeat the elements in this Series as specified in the given expression.
Replace the given values by different values of the same data type.
Replace all values by different values.
Reshape this Expr to a flat column or an Array column.
Reverse the selection.
Compress the column data using run-length encoding.
Get a distinct integer ID for each run of identical values.
Create rolling groups based on a temporal or integer column.
Compute a rolling kurtosis.
Compute a custom rolling window function.
Apply a rolling max (moving max) over the values in this array.
Apply a rolling max based on another column.
Apply a rolling mean (moving mean) over the values in this array.
Apply a rolling mean based on another column.
Compute a rolling median.
Compute a rolling median based on another column.
Apply a rolling min (moving min) over the values in this array.
Apply a rolling min based on another column.
Compute a rolling quantile.
Compute a rolling quantile based on another column.
Compute a rolling rank.
Compute a rolling rank based on another column.
Compute a rolling skew.
Compute a rolling standard deviation.
Compute a rolling standard deviation based on another column.
Apply a rolling sum (moving sum) over the values in this array.
Apply a rolling sum based on another column.
Compute a rolling variance.
Compute a rolling variance based on another column.
Round underlying floating point data by
decimalsdigits.Round to a number of significant figures.
Sample from this expression.
Find indices where elements should be inserted to maintain order.
Flags the expression as 'sorted'.
Shift values by the given number of indices.
Shrink numeric columns to the minimal required datatype.
Shuffle the contents of this expression.
Compute the element-wise sign function on numeric types.
Compute the element-wise value for the sine.
Compute the element-wise value for the hyperbolic sine.
Compute the sample skewness of a data set.
Get a slice of this expression.
Sort this column.
Sort this column by the ordering of other columns.
Compute the square root of the elements.
Get standard deviation.
Method equivalent of subtraction operator
expr - other.Get sum value.
Get the last
nrows.Compute the element-wise value for the tangent.
Compute the element-wise value for the hyperbolic tangent.
Cast to physical representation of the logical dtype.
Return the
klargest elements.Return the elements corresponding to the
klargest elements of thebycolumn(s).Method equivalent of float division operator
expr / other.Truncate numeric data toward zero to
decimalsnumber of decimal places.Get unique values of this expression.
Return a count of the unique values in the order of appearance.
Calculate the upper bound.
Count the occurrence of unique values.
Get variance.
Filter a single column.
Method equivalent of bitwise exclusive-or operator
expr ^ other.- abs() Expr[source]
Compute absolute values.
Same as
abs(expr).Examples
>>> df = pl.DataFrame( ... { ... "A": [-1.0, 0.0, 1.0, 2.0], ... } ... ) >>> df.select(pl.col("A").abs()) shape: (4, 1) βββββββ β A β β --- β β f64 β βββββββ‘ β 1.0 β β 0.0 β β 1.0 β β 2.0 β βββββββ
- add(other: Any) Expr[source]
Method equivalent of addition operator
expr + other.- Parameters:
- other
numeric or string value; accepts expression input.
Examples
>>> df = pl.DataFrame({"x": [1, 2, 3, 4, 5]}) >>> df.with_columns( ... pl.col("x").add(2).alias("x+int"), ... pl.col("x").add(pl.col("x").cum_prod()).alias("x+expr"), ... ) shape: (5, 3) βββββββ¬ββββββββ¬βββββββββ β x β x+int β x+expr β β --- β --- β --- β β i64 β i64 β i64 β βββββββͺββββββββͺβββββββββ‘ β 1 β 3 β 2 β β 2 β 4 β 4 β β 3 β 5 β 9 β β 4 β 6 β 28 β β 5 β 7 β 125 β βββββββ΄ββββββββ΄βββββββββ
>>> df = pl.DataFrame( ... {"x": ["a", "d", "g"], "y": ["b", "e", "h"], "z": ["c", "f", "i"]} ... ) >>> df.with_columns(pl.col("x").add(pl.col("y")).add(pl.col("z")).alias("xyz")) shape: (3, 4) βββββββ¬ββββββ¬ββββββ¬ββββββ β x β y β z β xyz β β --- β --- β --- β --- β β str β str β str β str β βββββββͺββββββͺββββββͺββββββ‘ β a β b β c β abc β β d β e β f β def β β g β h β i β ghi β βββββββ΄ββββββ΄ββββββ΄ββββββ
- agg_groups() Expr[source]
Get the group indexes of the group by operation.
Deprecated since version 1.35: use
df.with_row_index().group_by(...).agg(pl.col('index'))instead. This method will be removed in Polars 2.0.Should be used in aggregation context only.
Examples
>>> import warnings >>> warnings.filterwarnings("ignore", category=DeprecationWarning) >>> df = pl.DataFrame( ... { ... "group": [ ... "one", ... "one", ... "one", ... "two", ... "two", ... "two", ... ], ... "value": [94, 95, 96, 97, 97, 99], ... } ... ) >>> df.group_by("group", maintain_order=True).agg(pl.col("value").agg_groups()) shape: (2, 2) βββββββββ¬ββββββββββββ β group β value β β --- β --- β β str β list[u32] β βββββββββͺββββββββββββ‘ β one β [0, 1, 2] β β two β [3, 4, 5] β βββββββββ΄ββββββββββββ
New recommended approach: >>> ( β¦ df.with_row_index() β¦ .group_by(βgroupβ, maintain_order=True) β¦ .agg(pl.col(βindexβ)) β¦ ) shape: (2, 2) βββββββββ¬ββββββββββββ β group β index β β β β β β β str β list[u32] β βββββββββͺββββββββββββ‘ β one β [0, 1, 2] β β two β [3, 4, 5] β βββββββββ΄ββββββββββββ
- alias(name: str_) Expr[source]
Rename the expression.
- Parameters:
- name
The new name.
See also
Examples
Rename an expression to avoid overwriting an existing column.
>>> df = pl.DataFrame( ... { ... "a": [1, 2, 3], ... "b": ["x", "y", "z"], ... } ... ) >>> df.with_columns( ... pl.col("a") + 10, ... pl.col("b").str.to_uppercase().alias("c"), ... ) shape: (3, 3) βββββββ¬ββββββ¬ββββββ β a β b β c β β --- β --- β --- β β i64 β str β str β βββββββͺββββββͺββββββ‘ β 11 β x β X β β 12 β y β Y β β 13 β z β Z β βββββββ΄ββββββ΄ββββββ
Overwrite the default name of literal columns to prevent errors due to duplicate column names.
>>> df.with_columns( ... pl.lit(True).alias("c"), ... pl.lit(4.0).alias("d"), ... ) shape: (3, 4) βββββββ¬ββββββ¬βββββββ¬ββββββ β a β b β c β d β β --- β --- β --- β --- β β i64 β str β bool β f64 β βββββββͺββββββͺβββββββͺββββββ‘ β 1 β x β true β 4.0 β β 2 β y β true β 4.0 β β 3 β z β true β 4.0 β βββββββ΄ββββββ΄βββββββ΄ββββββ
- all(*, ignore_nulls: bool = True) Expr[source]
Return whether all values in the column are
True.Only works on columns of data type
Boolean.Note
This method is not to be confused with the function
polars.all(), which can be used to select all columns.- Parameters:
- ignore_nulls
If set to
True(default), null values are ignored. If there are no non-null values, the output isTrue.If set to
False, Kleene logic is used to deal with nulls: if the column contains any null values and noFalsevalues, the output is null.
- Returns:
- Expr
Expression of data type
Boolean.
Examples
>>> df = pl.DataFrame( ... { ... "a": [True, True], ... "b": [False, True], ... "c": [None, True], ... } ... ) >>> df.select(pl.col("*").all()) shape: (1, 3) ββββββββ¬ββββββββ¬βββββββ β a β b β c β β --- β --- β --- β β bool β bool β bool β ββββββββͺββββββββͺβββββββ‘ β true β false β true β ββββββββ΄ββββββββ΄βββββββ
Enable Kleene logic by setting
ignore_nulls=False.>>> df.select(pl.col("*").all(ignore_nulls=False)) shape: (1, 3) ββββββββ¬ββββββββ¬βββββββ β a β b β c β β --- β --- β --- β β bool β bool β bool β ββββββββͺββββββββͺβββββββ‘ β true β false β null β ββββββββ΄ββββββββ΄βββββββ
- and_(*others: Any) Expr[source]
Method equivalent of bitwise βandβ operator
expr & other & ....This has the effect of combining logical boolean expressions, but operates bitwise on integers.
- Parameters:
- *others
One or more integer or boolean expressions to evaluate/combine.
Examples
>>> df = pl.DataFrame( ... data={ ... "x": [5, 6, 7, 4, 8], ... "y": [1.5, 2.5, 1.0, 4.0, -5.75], ... "z": [-9, 2, -1, 4, 8], ... } ... )
Combine logical βandβ conditions:
>>> df.select( ... (pl.col("x") >= pl.col("z")) ... .and_( ... pl.col("y") >= pl.col("z"), ... pl.col("y") == pl.col("y"), ... pl.col("z") <= pl.col("x"), ... pl.col("y") != pl.col("x"), ... ) ... .alias("all") ... ) shape: (5, 1) βββββββββ β all β β --- β β bool β βββββββββ‘ β true β β true β β true β β false β β false β βββββββββ
Bitwise βandβ operation on integer columns:
>>> df.select("x", "z", x_and_z=pl.col("x").and_(pl.col("z"))) shape: (5, 3) βββββββ¬ββββββ¬ββββββββββ β x β z β x_and_z β β --- β --- β --- β β i64 β i64 β i64 β βββββββͺββββββͺββββββββββ‘ β 5 β -9 β 5 β β 6 β 2 β 2 β β 7 β -1 β 7 β β 4 β 4 β 4 β β 8 β 8 β 8 β βββββββ΄ββββββ΄ββββββββββ
- any(*, ignore_nulls: bool = True) Expr[source]
Return whether any of the values in the column are
True.Only works on columns of data type
Boolean.- Parameters:
- ignore_nulls
If set to
True(default), null values are ignored. If there are no non-null values, the output isFalse.If set to
False, Kleene logic is used to deal with nulls: if the column contains any null values and noTruevalues, the output is null.
- Returns:
- Expr
Expression of data type
Boolean.
Examples
>>> df = pl.DataFrame( ... { ... "a": [True, False], ... "b": [False, False], ... "c": [None, False], ... } ... ) >>> df.select(pl.col("*").any()) shape: (1, 3) ββββββββ¬ββββββββ¬ββββββββ β a β b β c β β --- β --- β --- β β bool β bool β bool β ββββββββͺββββββββͺββββββββ‘ β true β false β false β ββββββββ΄ββββββββ΄ββββββββ
Enable Kleene logic by setting
ignore_nulls=False.>>> df.select(pl.col("*").any(ignore_nulls=False)) shape: (1, 3) ββββββββ¬ββββββββ¬βββββββ β a β b β c β β --- β --- β --- β β bool β bool β bool β ββββββββͺββββββββͺβββββββ‘ β true β false β null β ββββββββ΄ββββββββ΄βββββββ
- append(other: IntoExpr, *, upcast: bool = True) Expr[source]
Append expressions.
This is done by adding the chunks of
otherto thisSeries.- Parameters:
- other
Expression to append.
- upcast
Cast both
Seriesto the same supertype.
Examples
>>> df = pl.DataFrame( ... { ... "a": [8, 9, 10], ... "b": [None, 4, 4], ... } ... ) >>> df.select(pl.all().head(1).append(pl.all().tail(1))) shape: (2, 2) βββββββ¬βββββββ β a β b β β --- β --- β β i64 β i64 β βββββββͺβββββββ‘ β 8 β null β β 10 β 4 β βββββββ΄βββββββ
- approx_n_unique() Expr[source]
Approximate count of unique values.
This is done using the HyperLogLog++ algorithm for cardinality estimation.
Examples
>>> df = pl.DataFrame({"n": [1, 1, 2]}) >>> df.select(pl.col("n").approx_n_unique()) shape: (1, 1) βββββββ β n β β --- β β u32 β βββββββ‘ β 2 β βββββββ >>> df = pl.DataFrame({"n": range(1000)}) >>> df.select( ... exact=pl.col("n").n_unique(), ... approx=pl.col("n").approx_n_unique(), ... ) shape: (1, 2) βββββββββ¬βββββββββ β exact β approx β β --- β --- β β u32 β u32 β βββββββββͺβββββββββ‘ β 1000 β 1005 β βββββββββ΄βββββββββ
- arccos() Expr[source]
Compute the element-wise value for the inverse cosine.
- Returns:
- Expr
Expression of data type
Float64.
Examples
>>> df = pl.DataFrame({"a": [0.0]}) >>> df.select(pl.col("a").arccos()) shape: (1, 1) ββββββββββββ β a β β --- β β f64 β ββββββββββββ‘ β 1.570796 β ββββββββββββ
- arccosh() Expr[source]
Compute the element-wise value for the inverse hyperbolic cosine.
- Returns:
- Expr
Expression of data type
Float64.
Examples
>>> df = pl.DataFrame({"a": [1.0]}) >>> df.select(pl.col("a").arccosh()) shape: (1, 1) βββββββ β a β β --- β β f64 β βββββββ‘ β 0.0 β βββββββ
- arcsin() Expr[source]
Compute the element-wise value for the inverse sine.
- Returns:
- Expr
Expression of data type
Float64.
Examples
>>> df = pl.DataFrame({"a": [1.0]}) >>> df.select(pl.col("a").arcsin()) shape: (1, 1) ββββββββββββ β a β β --- β β f64 β ββββββββββββ‘ β 1.570796 β ββββββββββββ
- arcsinh() Expr[source]
Compute the element-wise value for the inverse hyperbolic sine.
- Returns:
- Expr
Expression of data type
Float64.
Examples
>>> df = pl.DataFrame({"a": [1.0]}) >>> df.select(pl.col("a").arcsinh()) shape: (1, 1) ββββββββββββ β a β β --- β β f64 β ββββββββββββ‘ β 0.881374 β ββββββββββββ
- arctan() Expr[source]
Compute the element-wise value for the inverse tangent.
- Returns:
- Expr
Expression of data type
Float64.
Examples
>>> df = pl.DataFrame({"a": [1.0]}) >>> df.select(pl.col("a").arctan()) shape: (1, 1) ββββββββββββ β a β β --- β β f64 β ββββββββββββ‘ β 0.785398 β ββββββββββββ
- arctanh() Expr[source]
Compute the element-wise value for the inverse hyperbolic tangent.
- Returns:
- Expr
Expression of data type
Float64.
Examples
>>> df = pl.DataFrame({"a": [1.0]}) >>> df.select(pl.col("a").arctanh()) shape: (1, 1) βββββββ β a β β --- β β f64 β βββββββ‘ β inf β βββββββ
- arg_max() Expr[source]
Get the index of the maximal value.
Examples
>>> df = pl.DataFrame( ... { ... "a": [20, 10, 30], ... } ... ) >>> df.select(pl.col("a").arg_max()) shape: (1, 1) βββββββ β a β β --- β β u32 β βββββββ‘ β 2 β βββββββ
- arg_min() Expr[source]
Get the index of the minimal value.
Examples
>>> df = pl.DataFrame( ... { ... "a": [20, 10, 30], ... } ... ) >>> df.select(pl.col("a").arg_min()) shape: (1, 1) βββββββ β a β β --- β β u32 β βββββββ‘ β 1 β βββββββ
- arg_sort( ) Expr[source]
Get the index values that would sort this column.
- Parameters:
- descending
Sort in descending (descending) order.
- nulls_last
Place null values last instead of first.
- Returns:
- Expr
Expression of data type
UInt32.
See also
Expr.gatherTake values by index.
Expr.rankGet the rank of each row.
Examples
>>> df = pl.DataFrame( ... { ... "a": [20, 10, 30], ... "b": [1, 2, 3], ... } ... ) >>> df.select(pl.col("a").arg_sort()) shape: (3, 1) βββββββ β a β β --- β β u32 β βββββββ‘ β 1 β β 0 β β 2 β βββββββ
Use gather to apply the arg sort to other columns.
>>> df.select(pl.col("b").gather(pl.col("a").arg_sort())) shape: (3, 1) βββββββ β b β β --- β β i64 β βββββββ‘ β 2 β β 1 β β 3 β βββββββ
- arg_true() Expr[source]
Return indices where expression evaluates
True.Warning
Modifies number of rows returned, so will fail in combination with other expressions. Use as only expression in
select/with_columns.See also
Series.arg_trueReturn indices where Series is True
polars.arg_where
Examples
>>> df = pl.DataFrame({"a": [1, 1, 2, 1]}) >>> df.select((pl.col("a") == 1).arg_true()) shape: (3, 1) βββββββ β a β β --- β β u32 β βββββββ‘ β 0 β β 1 β β 3 β βββββββ
- arg_unique() Expr[source]
Get index of first unique value.
Examples
>>> df = pl.DataFrame( ... { ... "a": [8, 9, 10], ... "b": [None, 4, 4], ... } ... ) >>> df.select(pl.col("a").arg_unique()) shape: (3, 1) βββββββ β a β β --- β β u32 β βββββββ‘ β 0 β β 1 β β 2 β βββββββ >>> df.select(pl.col("b").arg_unique()) shape: (2, 1) βββββββ β b β β --- β β u32 β βββββββ‘ β 0 β β 1 β βββββββ
- backward_fill(limit: int | None = None) Expr[source]
Fill missing values with the next non-null value.
This is an alias of
.fill_null(strategy="backward").- Parameters:
- limit
The number of consecutive null values to backward fill.
See also
- bitwise_and() Expr[source]
Perform an aggregation of bitwise ANDs.
Examples
>>> df = pl.DataFrame({"n": [-1, 0, 1]}) >>> df.select(pl.col("n").bitwise_and()) shape: (1, 1) βββββββ β n β β --- β β i64 β βββββββ‘ β 0 β βββββββ >>> df = pl.DataFrame( ... {"grouper": ["a", "a", "a", "b", "b"], "n": [-1, 0, 1, -1, 1]} ... ) >>> df.group_by("grouper", maintain_order=True).agg(pl.col("n").bitwise_and()) shape: (2, 2) βββββββββββ¬ββββββ β grouper β n β β --- β --- β β str β i64 β βββββββββββͺββββββ‘ β a β 0 β β b β 1 β βββββββββββ΄ββββββ
- bitwise_count_ones() Expr[source]
Evaluate the number of set bits.
- bitwise_count_zeros() Expr[source]
Evaluate the number of unset bits.
- bitwise_leading_ones() Expr[source]
Evaluate the number most-significant set bits before seeing an unset bit.
- bitwise_leading_zeros() Expr[source]
Evaluate the number most-significant unset bits before seeing a set bit.
- bitwise_or() Expr[source]
Perform an aggregation of bitwise ORs.
Examples
>>> df = pl.DataFrame({"n": [-1, 0, 1]}) >>> df.select(pl.col("n").bitwise_or()) shape: (1, 1) βββββββ β n β β --- β β i64 β βββββββ‘ β -1 β βββββββ >>> df = pl.DataFrame( ... {"grouper": ["a", "a", "a", "b", "b"], "n": [-1, 0, 1, -1, 1]} ... ) >>> df.group_by("grouper", maintain_order=True).agg(pl.col("n").bitwise_or()) shape: (2, 2) βββββββββββ¬ββββββ β grouper β n β β --- β --- β β str β i64 β βββββββββββͺββββββ‘ β a β -1 β β b β -1 β βββββββββββ΄ββββββ
- bitwise_trailing_ones() Expr[source]
Evaluate the number least-significant set bits before seeing an unset bit.
- bitwise_trailing_zeros() Expr[source]
Evaluate the number least-significant unset bits before seeing a set bit.
- bitwise_xor() Expr[source]
Perform an aggregation of bitwise XORs.
Examples
>>> df = pl.DataFrame({"n": [-1, 0, 1]}) >>> df.select(pl.col("n").bitwise_xor()) shape: (1, 1) βββββββ β n β β --- β β i64 β βββββββ‘ β -2 β βββββββ >>> df = pl.DataFrame( ... {"grouper": ["a", "a", "a", "b", "b"], "n": [-1, 0, 1, -1, 1]} ... ) >>> df.group_by("grouper", maintain_order=True).agg(pl.col("n").bitwise_xor()) shape: (2, 2) βββββββββββ¬ββββββ β grouper β n β β --- β --- β β str β i64 β βββββββββββͺββββββ‘ β a β -2 β β b β -2 β βββββββββββ΄ββββββ
- bottom_k(k: int | IntoExprColumn = 5) Expr[source]
Return the
ksmallest elements.Non-null elements are always preferred over null elements. The output is not guaranteed to be in any particular order, call
sort()after this function if you wish the output to be sorted.This has time complexity:
\[O(n)\]- Parameters:
- k
Number of elements to return.
See also
Examples
>>> df = pl.DataFrame( ... { ... "value": [1, 98, 2, 3, 99, 4], ... } ... ) >>> df.select( ... pl.col("value").top_k().alias("top_k"), ... pl.col("value").bottom_k().alias("bottom_k"), ... ) shape: (5, 2) βββββββββ¬βββββββββββ β top_k β bottom_k β β --- β --- β β i64 β i64 β βββββββββͺβββββββββββ‘ β 4 β 1 β β 98 β 98 β β 2 β 2 β β 3 β 3 β β 99 β 4 β βββββββββ΄βββββββββββ
- bottom_k_by(
- by: IntoExpr | Iterable[IntoExpr],
- k: int | IntoExprColumn = 5,
- *,
- reverse: bool | Sequence[bool] = False,
Return the elements corresponding to the
ksmallest elements of thebycolumn(s).Non-null elements are always preferred over null elements, regardless of the value of
reverse. The output is not guaranteed to be in any particular order, callsort()after this function if you wish the output to be sorted.This has time complexity:
\[O(n \log{n})\]Changed in version 1.0.0: The
descendingparameter was renamedreverse.- Parameters:
- by
Column(s) used to determine the smallest elements. Accepts expression input. Strings are parsed as column names.
- k
Number of elements to return.
- reverse
Consider the
klargest elements of thebycolumn(s) (instead of theksmallest). This can be specified per column by passing a sequence of booleans.
Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 2, 3, 4, 5, 6], ... "b": [6, 5, 4, 3, 2, 1], ... "c": ["Apple", "Orange", "Apple", "Apple", "Banana", "Banana"], ... } ... ) >>> df shape: (6, 3) βββββββ¬ββββββ¬βββββββββ β a β b β c β β --- β --- β --- β β i64 β i64 β str β βββββββͺββββββͺβββββββββ‘ β 1 β 6 β Apple β β 2 β 5 β Orange β β 3 β 4 β Apple β β 4 β 3 β Apple β β 5 β 2 β Banana β β 6 β 1 β Banana β βββββββ΄ββββββ΄βββββββββ
Get the bottom 2 rows by column
aorb.>>> df.select( ... pl.all().bottom_k_by("a", 2).name.suffix("_btm_by_a"), ... pl.all().bottom_k_by("b", 2).name.suffix("_btm_by_b"), ... ) shape: (2, 6) ββββββββββββββ¬βββββββββββββ¬βββββββββββββ¬βββββββββββββ¬βββββββββββββ¬βββββββββββββ β a_btm_by_a β b_btm_by_a β c_btm_by_a β a_btm_by_b β b_btm_by_b β c_btm_by_b β β --- β --- β --- β --- β --- β --- β β i64 β i64 β str β i64 β i64 β str β ββββββββββββββͺβββββββββββββͺβββββββββββββͺβββββββββββββͺβββββββββββββͺβββββββββββββ‘ β 1 β 6 β Apple β 6 β 1 β Banana β β 2 β 5 β Orange β 5 β 2 β Banana β ββββββββββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββββ
Get the bottom 2 rows by multiple columns with given order.
>>> df.select( ... pl.all() ... .bottom_k_by(["c", "a"], 2, reverse=[False, True]) ... .name.suffix("_by_ca"), ... pl.all() ... .bottom_k_by(["c", "b"], 2, reverse=[False, True]) ... .name.suffix("_by_cb"), ... ) shape: (2, 6) βββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ β a_by_ca β b_by_ca β c_by_ca β a_by_cb β b_by_cb β c_by_cb β β --- β --- β --- β --- β --- β --- β β i64 β i64 β str β i64 β i64 β str β βββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββ‘ β 4 β 3 β Apple β 1 β 6 β Apple β β 3 β 4 β Apple β 3 β 4 β Apple β βββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ
Get the bottom 2 rows by column
ain each group.>>> ( ... df.group_by("c", maintain_order=True) ... .agg(pl.all().bottom_k_by("a", 2)) ... .explode(pl.all().exclude("c")) ... ) shape: (5, 3) ββββββββββ¬ββββββ¬ββββββ β c β a β b β β --- β --- β --- β β str β i64 β i64 β ββββββββββͺββββββͺββββββ‘ β Apple β 1 β 6 β β Apple β 3 β 4 β β Orange β 2 β 5 β β Banana β 5 β 2 β β Banana β 6 β 1 β ββββββββββ΄ββββββ΄ββββββ
- cast(
- dtype: PolarsDataType | DataTypeExpr | type[Any],
- *,
- strict: bool = True,
- wrap_numerical: bool = False,
Cast between data types.
- Parameters:
- dtype
DataType to cast to.
- strict
Raise if cast is invalid on rows after predicates are pushed down. If
False, invalid casts will produce null values.- wrap_numerical
If True numeric casts wrap overflowing values instead of marking the cast as invalid.
Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 2, 3], ... "b": ["4", "5", "6"], ... } ... ) >>> df.with_columns( ... pl.col("a").cast(pl.Float64), ... pl.col("b").cast(pl.Int32), ... ) shape: (3, 2) βββββββ¬ββββββ β a β b β β --- β --- β β f64 β i32 β βββββββͺββββββ‘ β 1.0 β 4 β β 2.0 β 5 β β 3.0 β 6 β βββββββ΄ββββββ
- cbrt() Expr[source]
Compute the cube root of the elements.
Examples
>>> df = pl.DataFrame({"values": [1.0, 2.0, 4.0]}) >>> df.select(pl.col("values").cbrt()) shape: (3, 1) ββββββββββββ β values β β --- β β f64 β ββββββββββββ‘ β 1.0 β β 1.259921 β β 1.587401 β ββββββββββββ
- ceil() Expr[source]
Rounds up to the nearest integer value.
Only works on floating point Series.
Examples
>>> df = pl.DataFrame({"a": [0.3, 0.5, 1.0, 1.1]}) >>> df.select(pl.col("a").ceil()) shape: (4, 1) βββββββ β a β β --- β β f64 β βββββββ‘ β 1.0 β β 1.0 β β 1.0 β β 2.0 β βββββββ
- clip(
- lower_bound: NumericLiteral | TemporalLiteral | IntoExprColumn | None = None,
- upper_bound: NumericLiteral | TemporalLiteral | IntoExprColumn | None = None,
Set values outside the given boundaries to the boundary value.
- Parameters:
- lower_bound
Lower bound. Accepts expression input. Non-expression inputs are parsed as literals. Strings are parsed as column names.
- upper_bound
Upper bound. Accepts expression input. Non-expression inputs are parsed as literals. Strings are parsed as column names.
See also
Notes
This method only works for numeric and temporal columns. To clip other data types, consider writing a
when-then-otherwiseexpression. Seewhen().Examples
Specifying both a lower and upper bound:
>>> df = pl.DataFrame({"a": [-50, 5, 50, None]}) >>> df.with_columns(clip=pl.col("a").clip(1, 10)) shape: (4, 2) ββββββββ¬βββββββ β a β clip β β --- β --- β β i64 β i64 β ββββββββͺβββββββ‘ β -50 β 1 β β 5 β 5 β β 50 β 10 β β null β null β ββββββββ΄βββββββ
Specifying only a single bound:
>>> df.with_columns(clip=pl.col("a").clip(upper_bound=10)) shape: (4, 2) ββββββββ¬βββββββ β a β clip β β --- β --- β β i64 β i64 β ββββββββͺβββββββ‘ β -50 β -50 β β 5 β 5 β β 50 β 10 β β null β null β ββββββββ΄βββββββ
Using columns as bounds:
>>> df = pl.DataFrame( ... {"a": [-50, 5, 50, None], "low": [10, 1, 0, 0], "up": [20, 4, 3, 2]} ... ) >>> df.with_columns(clip=pl.col("a").clip("low", "up")) shape: (4, 4) ββββββββ¬ββββββ¬ββββββ¬βββββββ β a β low β up β clip β β --- β --- β --- β --- β β i64 β i64 β i64 β i64 β ββββββββͺββββββͺββββββͺβββββββ‘ β -50 β 10 β 20 β 10 β β 5 β 1 β 4 β 4 β β 50 β 0 β 3 β 3 β β null β 0 β 2 β null β ββββββββ΄ββββββ΄ββββββ΄βββββββ
- cos() Expr[source]
Compute the element-wise value for the cosine.
- Returns:
- Expr
Expression of data type
Float64.
Examples
>>> df = pl.DataFrame({"a": [0.0]}) >>> df.select(pl.col("a").cos()) shape: (1, 1) βββββββ β a β β --- β β f64 β βββββββ‘ β 1.0 β βββββββ
- cosh() Expr[source]
Compute the element-wise value for the hyperbolic cosine.
- Returns:
- Expr
Expression of data type
Float64.
Examples
>>> df = pl.DataFrame({"a": [1.0]}) >>> df.select(pl.col("a").cosh()) shape: (1, 1) ββββββββββββ β a β β --- β β f64 β ββββββββββββ‘ β 1.543081 β ββββββββββββ
- cot() Expr[source]
Compute the element-wise value for the cotangent.
- Returns:
- Expr
Expression of data type
Float64.
Examples
>>> df = pl.DataFrame({"a": [1.0]}) >>> df.select(pl.col("a").cot().round(2)) shape: (1, 1) ββββββββ β a β β --- β β f64 β ββββββββ‘ β 0.64 β ββββββββ
- count() Expr[source]
Return the number of non-null elements in the column.
- Returns:
- Expr
Expression of data type
UInt32.
See also
Examples
>>> df = pl.DataFrame({"a": [1, 2, 3], "b": [None, 4, 4]}) >>> df.select(pl.all().count()) shape: (1, 2) βββββββ¬ββββββ β a β b β β --- β --- β β u32 β u32 β βββββββͺββββββ‘ β 3 β 2 β βββββββ΄ββββββ
- cum_count(*, reverse: bool = False) Expr[source]
Return the cumulative count of the non-null values in the column.
- Parameters:
- reverse
Reverse the operation.
Examples
>>> df = pl.DataFrame({"a": ["x", "k", None, "d"]}) >>> df.with_columns( ... pl.col("a").cum_count().alias("cum_count"), ... pl.col("a").cum_count(reverse=True).alias("cum_count_reverse"), ... ) shape: (4, 3) ββββββββ¬ββββββββββββ¬ββββββββββββββββββββ β a β cum_count β cum_count_reverse β β --- β --- β --- β β str β u32 β u32 β ββββββββͺββββββββββββͺββββββββββββββββββββ‘ β x β 1 β 3 β β k β 2 β 2 β β null β 2 β 1 β β d β 3 β 1 β ββββββββ΄ββββββββββββ΄ββββββββββββββββββββ
- cum_max(*, reverse: bool = False) Expr[source]
Get an array with the cumulative max computed at every element.
- Parameters:
- reverse
Reverse the operation.
Examples
>>> df = pl.DataFrame({"a": [1, 3, 2]}) >>> df.with_columns( ... pl.col("a").cum_max().alias("cum_max"), ... pl.col("a").cum_max(reverse=True).alias("cum_max_reverse"), ... ) shape: (3, 3) βββββββ¬ββββββββββ¬ββββββββββββββββββ β a β cum_max β cum_max_reverse β β --- β --- β --- β β i64 β i64 β i64 β βββββββͺββββββββββͺββββββββββββββββββ‘ β 1 β 1 β 3 β β 3 β 3 β 3 β β 2 β 3 β 2 β βββββββ΄ββββββββββ΄ββββββββββββββββββ
Null values are excluded, but can also be filled by calling
fill_null(strategy="forward").>>> df = pl.DataFrame({"values": [None, 10, None, 8, 9, None, 16, None]}) >>> df.with_columns( ... pl.col("values").cum_max().alias("cum_max"), ... pl.col("values") ... .cum_max() ... .fill_null(strategy="forward") ... .alias("cum_max_all_filled"), ... ) shape: (8, 3) ββββββββββ¬ββββββββββ¬βββββββββββββββββββββ β values β cum_max β cum_max_all_filled β β --- β --- β --- β β i64 β i64 β i64 β ββββββββββͺββββββββββͺβββββββββββββββββββββ‘ β null β null β null β β 10 β 10 β 10 β β null β null β 10 β β 8 β 10 β 10 β β 9 β 10 β 10 β β null β null β 10 β β 16 β 16 β 16 β β null β null β 16 β ββββββββββ΄ββββββββββ΄βββββββββββββββββββββ
- cum_min(*, reverse: bool = False) Expr[source]
Get an array with the cumulative min computed at every element.
- Parameters:
- reverse
Reverse the operation.
Examples
>>> df = pl.DataFrame({"a": [3, 1, 2]}) >>> df.with_columns( ... pl.col("a").cum_min().alias("cum_min"), ... pl.col("a").cum_min(reverse=True).alias("cum_min_reverse"), ... ) shape: (3, 3) βββββββ¬ββββββββββ¬ββββββββββββββββββ β a β cum_min β cum_min_reverse β β --- β --- β --- β β i64 β i64 β i64 β βββββββͺββββββββββͺββββββββββββββββββ‘ β 3 β 3 β 1 β β 1 β 1 β 1 β β 2 β 1 β 2 β βββββββ΄ββββββββββ΄ββββββββββββββββββ
- cum_prod(*, reverse: bool = False) Expr[source]
Get an array with the cumulative product computed at every element.
- Parameters:
- reverse
Reverse the operation.
Notes
Dtypes in {Int8, UInt8, Int16, UInt16} are cast to Int64 before summing to prevent overflow issues.
Examples
>>> df = pl.DataFrame({"a": [1, 2, 3, 4]}) >>> df.with_columns( ... pl.col("a").cum_prod().alias("cum_prod"), ... pl.col("a").cum_prod(reverse=True).alias("cum_prod_reverse"), ... ) shape: (4, 3) βββββββ¬βββββββββββ¬βββββββββββββββββββ β a β cum_prod β cum_prod_reverse β β --- β --- β --- β β i64 β i64 β i64 β βββββββͺβββββββββββͺβββββββββββββββββββ‘ β 1 β 1 β 24 β β 2 β 2 β 24 β β 3 β 6 β 12 β β 4 β 24 β 4 β βββββββ΄βββββββββββ΄βββββββββββββββββββ
- cum_sum(*, reverse: bool = False) Expr[source]
Get an array with the cumulative sum computed at every element.
- Parameters:
- reverse
Reverse the operation.
Notes
Dtypes in {Int8, UInt8, Int16, UInt16} are cast to Int64 before summing to prevent overflow issues.
Examples
>>> df = pl.DataFrame({"a": [1, 2, 3, 4]}) >>> df.with_columns( ... pl.col("a").cum_sum().alias("cum_sum"), ... pl.col("a").cum_sum(reverse=True).alias("cum_sum_reverse"), ... ) shape: (4, 3) βββββββ¬ββββββββββ¬ββββββββββββββββββ β a β cum_sum β cum_sum_reverse β β --- β --- β --- β β i64 β i64 β i64 β βββββββͺββββββββββͺββββββββββββββββββ‘ β 1 β 1 β 10 β β 2 β 3 β 9 β β 3 β 6 β 7 β β 4 β 10 β 4 β βββββββ΄ββββββββββ΄ββββββββββββββββββ
Null values are excluded, but can also be filled by calling
fill_null(strategy="forward").>>> df = pl.DataFrame({"values": [None, 10, None, 8, 9, None, 16, None]}) >>> df.with_columns( ... pl.col("values").cum_sum().alias("value_cum_sum"), ... pl.col("values") ... .cum_sum() ... .fill_null(strategy="forward") ... .alias("value_cum_sum_all_filled"), ... ) shape: (8, 3) ββββββββββ¬ββββββββββββββββ¬βββββββββββββββββββββββββββ β values β value_cum_sum β value_cum_sum_all_filled β β --- β --- β --- β β i64 β i64 β i64 β ββββββββββͺββββββββββββββββͺβββββββββββββββββββββββββββ‘ β null β null β null β β 10 β 10 β 10 β β null β null β 10 β β 8 β 18 β 18 β β 9 β 27 β 27 β β null β null β 27 β β 16 β 43 β 43 β β null β null β 43 β ββββββββββ΄ββββββββββββββββ΄βββββββββββββββββββββββββββ
- cumulative_eval(
- expr: Expr,
- *,
- min_samples: int = 1,
Run an expression over a sliding window that increases
1slot every iteration.Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
Changed in version 1.21.0: The
min_periodsparameter was renamedmin_samples.- Parameters:
- expr
Expression to evaluate
- min_samples
Number of valid values there should be in the window before the expression is evaluated. valid values =
length - null_count
Warning
This can be really slow as it can have
O(n^2)complexity. Donβt use this for operations that visit all elements.Examples
>>> df = pl.DataFrame({"values": [1, 2, 3, 4, 5]}) >>> df.select( ... [ ... pl.col("values").cumulative_eval( ... pl.element().first() - pl.element().last() ** 2 ... ) ... ] ... ) shape: (5, 1) ββββββββββ β values β β --- β β i64 β ββββββββββ‘ β 0 β β -3 β β -8 β β -15 β β -24 β ββββββββββ
- cut(
- breaks: Sequence[float],
- *,
- labels: Sequence[str_] | None = None,
- left_closed: bool = False,
- include_breaks: bool = False,
Bin continuous values into discrete categories.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- Parameters:
- breaks
List of unique cut points.
- labels
Names of the categories. The number of labels must be equal to the number of cut points plus one.
- left_closed
Set the intervals to be left-closed instead of right-closed.
- include_breaks
Include a column with the right endpoint of the bin each observation falls in. This will change the data type of the output from an
Enumto aStruct.
- Returns:
- Expr
Expression of data type
Enumifinclude_breaksis set toFalse(default), otherwise an expression of data typeStruct.
See also
Examples
Divide a column into three categories.
>>> df = pl.DataFrame({"foo": [-2, -1, 0, 1, 2]}) >>> df.with_columns( ... pl.col("foo").cut([-1, 1], labels=["a", "b", "c"]).alias("cut") ... ) shape: (5, 2) βββββββ¬βββββββ β foo β cut β β --- β --- β β i64 β enum β βββββββͺβββββββ‘ β -2 β a β β -1 β a β β 0 β b β β 1 β b β β 2 β c β βββββββ΄βββββββ
Add both the category and the breakpoint.
>>> df.with_columns( ... pl.col("foo").cut([-1, 1], include_breaks=True).alias("cut") ... ).unnest("cut") shape: (5, 3) βββββββ¬βββββββββββββ¬βββββββββββββ β foo β breakpoint β category β β --- β --- β --- β β i64 β f64 β enum β βββββββͺβββββββββββββͺβββββββββββββ‘ β -2 β -1.0 β (-inf, -1] β β -1 β -1.0 β (-inf, -1] β β 0 β 1.0 β (-1, 1] β β 1 β 1.0 β (-1, 1] β β 2 β inf β (1, inf] β βββββββ΄βββββββββββββ΄βββββββββββββ
- degrees() Expr[source]
Convert from radians to degrees.
- Returns:
- Expr
Expression of data type
Float64.
Examples
>>> import math >>> df = pl.DataFrame({"a": [x * math.pi for x in range(-4, 5)]}) >>> df.select(pl.col("a").degrees()) shape: (9, 1) ββββββββββ β a β β --- β β f64 β ββββββββββ‘ β -720.0 β β -540.0 β β -360.0 β β -180.0 β β 0.0 β β 180.0 β β 360.0 β β 540.0 β β 720.0 β ββββββββββ
- classmethod deserialize(
- source: str_ | Path | IOBase | bytes,
- *,
- format: SerializationFormat = 'binary',
Read a serialized expression from a file.
- Parameters:
- source
Path to a file or a file-like object (by file-like object, we refer to objects that have a
read()method, such as a file handler (e.g. via builtinopenfunction) orBytesIO).- format
The format with which the Expr was serialized. Options:
"binary": Deserialize from binary format (bytes). This is the default."json": Deserialize from JSON format (string).
Warning
This function uses
pickleif the logical plan contains Python UDFs, and as such inherits the security implications. Deserializing can execute arbitrary code, so it should only be attempted on trusted data.See also
Notes
Serialization is not stable across Polars versions: a LazyFrame serialized in one Polars version may not be deserializable in another Polars version.
Examples
>>> import io >>> expr = pl.col("foo").sum().over("bar") >>> bytes = expr.meta.serialize() >>> pl.Expr.deserialize(io.BytesIO(bytes)) <Expr ['col("foo").sum().over([col("baβ¦'] at ...>
- diff(n: int | IntoExpr = 1, null_behavior: NullBehavior = 'ignore') Expr[source]
Calculate the first discrete difference between shifted items.
- Parameters:
- n
Number of slots to shift.
- null_behavior{βignoreβ, βdropβ}
How to handle null values.
Examples
>>> df = pl.DataFrame({"int": [20, 10, 30, 25, 35]}) >>> df.with_columns(change=pl.col("int").diff()) shape: (5, 2) βββββββ¬βββββββββ β int β change β β --- β --- β β i64 β i64 β βββββββͺβββββββββ‘ β 20 β null β β 10 β -10 β β 30 β 20 β β 25 β -5 β β 35 β 10 β βββββββ΄βββββββββ
>>> df.with_columns(change=pl.col("int").diff(n=2)) shape: (5, 2) βββββββ¬βββββββββ β int β change β β --- β --- β β i64 β i64 β βββββββͺβββββββββ‘ β 20 β null β β 10 β null β β 30 β 10 β β 25 β 15 β β 35 β 5 β βββββββ΄βββββββββ
>>> df.select(pl.col("int").diff(n=2, null_behavior="drop").alias("diff")) shape: (3, 1) ββββββββ β diff β β --- β β i64 β ββββββββ‘ β 10 β β 15 β β 5 β ββββββββ
- dot(other: Expr | str_) Expr[source]
Compute the dot/inner product between two Expressions.
- Parameters:
- other
Expression to compute dot product with.
Examples
>>> df = pl.DataFrame({"a": [1, 3, 5], "b": [2, 4, 6]}) >>> df.select(pl.col("a").dot(pl.col("b"))) shape: (1, 1) βββββββ β a β β --- β β i64 β βββββββ‘ β 44 β βββββββ
- drop_nans() Expr[source]
Drop all floating point NaN values.
The original order of the remaining elements is preserved.
See also
Notes
A NaN value is not the same as a null value. To drop null values, use
drop_nulls().Examples
>>> df = pl.DataFrame({"a": [1.0, None, 3.0, float("nan")]}) >>> df.select(pl.col("a").drop_nans()) shape: (3, 1) ββββββββ β a β β --- β β f64 β ββββββββ‘ β 1.0 β β null β β 3.0 β ββββββββ
- drop_nulls() Expr[source]
Drop all null values.
The original order of the remaining elements is preserved.
See also
Notes
A null value is not the same as a NaN value. To drop NaN values, use
drop_nans().Examples
>>> df = pl.DataFrame({"a": [1.0, None, 3.0, float("nan")]}) >>> df.select(pl.col("a").drop_nulls()) shape: (3, 1) βββββββ β a β β --- β β f64 β βββββββ‘ β 1.0 β β 3.0 β β NaN β βββββββ
- entropy( ) Expr[source]
Computes the entropy.
Uses the formula
-sum(pk * log(pk))wherepkare discrete probabilities.- Parameters:
- base
Given base, defaults to
e- normalize
Normalize pk if it doesnβt sum to 1.
Examples
>>> df = pl.DataFrame({"a": [1, 2, 3]}) >>> df.select(pl.col("a").entropy(base=2)) shape: (1, 1) ββββββββββββ β a β β --- β β f64 β ββββββββββββ‘ β 1.459148 β ββββββββββββ >>> df.select(pl.col("a").entropy(base=2, normalize=False)) shape: (1, 1) βββββββββββββ β a β β --- β β f64 β βββββββββββββ‘ β -6.754888 β βββββββββββββ
- eq(other: Any) Expr[source]
Method equivalent of equality operator
expr == other.- Parameters:
- other
A literal or expression value to compare with.
Examples
>>> df = pl.DataFrame( ... data={ ... "x": [1.0, 2.0, float("nan"), 4.0], ... "y": [2.0, 2.0, float("nan"), 4.0], ... } ... ) >>> df.with_columns( ... pl.col("x").eq(pl.col("y")).alias("x == y"), ... ) shape: (4, 3) βββββββ¬ββββββ¬βββββββββ β x β y β x == y β β --- β --- β --- β β f64 β f64 β bool β βββββββͺββββββͺβββββββββ‘ β 1.0 β 2.0 β false β β 2.0 β 2.0 β true β β NaN β NaN β true β β 4.0 β 4.0 β true β βββββββ΄ββββββ΄βββββββββ
- eq_missing(other: Any) Expr[source]
Method equivalent of equality operator
expr == otherwhereNone == None.This differs from default
eqwhere null values are propagated.- Parameters:
- other
A literal or expression value to compare with.
Examples
>>> df = pl.DataFrame( ... data={ ... "x": [1.0, 2.0, float("nan"), 4.0, None, None], ... "y": [2.0, 2.0, float("nan"), 4.0, 5.0, None], ... } ... ) >>> df.with_columns( ... pl.col("x").eq(pl.col("y")).alias("x eq y"), ... pl.col("x").eq_missing(pl.col("y")).alias("x eq_missing y"), ... ) shape: (6, 4) ββββββββ¬βββββββ¬βββββββββ¬βββββββββββββββββ β x β y β x eq y β x eq_missing y β β --- β --- β --- β --- β β f64 β f64 β bool β bool β ββββββββͺβββββββͺβββββββββͺβββββββββββββββββ‘ β 1.0 β 2.0 β false β false β β 2.0 β 2.0 β true β true β β NaN β NaN β true β true β β 4.0 β 4.0 β true β true β β null β 5.0 β null β false β β null β null β null β true β ββββββββ΄βββββββ΄βββββββββ΄βββββββββββββββββ
- ewm_mean(
- *,
- com: float | None = None,
- span: float | None = None,
- half_life: float | None = None,
- alpha: float | None = None,
- adjust: bool = True,
- min_samples: int = 1,
- ignore_nulls: bool = False,
Compute exponentially-weighted moving average.
Changed in version 1.21.0: The
min_periodsparameter was renamedmin_samples.- Parameters:
- com
Specify decay in terms of center of mass, \(\gamma\), with
\[\alpha = \frac{1}{1 + \gamma} \; \forall \; \gamma \geq 0\]- span
Specify decay in terms of span, \(\theta\), with
\[\alpha = \frac{2}{\theta + 1} \; \forall \; \theta \geq 1\]- half_life
Specify decay in terms of half-life, \(\tau\), with
\[\alpha = 1 - \exp \left\{ \frac{ -\ln(2) }{ \tau } \right\} \; \forall \; \tau > 0\]- alpha
Specify smoothing factor alpha directly, \(0 < \alpha \leq 1\).
- adjust
Divide by decaying adjustment factor in beginning periods to account for imbalance in relative weightings
When
adjust=True(the default) the EW function is calculated using weights \(w_i = (1 - \alpha)^i\)When
adjust=Falsethe EW function is calculated recursively by\[\begin{split}y_0 &= x_0 \\ y_t &= (1 - \alpha)y_{t - 1} + \alpha x_t\end{split}\]
- min_samples
Minimum number of observations in window required to have a value (otherwise result is null).
- ignore_nulls
Ignore missing values when calculating weights.
When
ignore_nulls=False(default), weights are based on absolute positions. For example, the weights of \(x_0\) and \(x_2\) used in calculating the final weighted average of [\(x_0\), None, \(x_2\)] are \((1-\alpha)^2\) and \(1\) ifadjust=True, and \((1-\alpha)^2\) and \(\alpha\) ifadjust=False.When
ignore_nulls=True, weights are based on relative positions. For example, the weights of \(x_0\) and \(x_2\) used in calculating the final weighted average of [\(x_0\), None, \(x_2\)] are \(1-\alpha\) and \(1\) ifadjust=True, and \(1-\alpha\) and \(\alpha\) ifadjust=False.
Examples
>>> df = pl.DataFrame({"a": [1, 2, 3]}) >>> df.select(pl.col("a").ewm_mean(com=1, ignore_nulls=False)) shape: (3, 1) ββββββββββββ β a β β --- β β f64 β ββββββββββββ‘ β 1.0 β β 1.666667 β β 2.428571 β ββββββββββββ
- ewm_mean_by(by: str_ | IntoExpr, *, half_life: str_ | timedelta) Expr[source]
Compute time-based exponentially weighted moving average.
Given observations \(x_0, x_1, \ldots, x_{n-1}\) at times \(t_0, t_1, \ldots, t_{n-1}\), the EWMA is calculated as
\[ \begin{align}\begin{aligned}y_0 &= x_0\\\alpha_i &= 1 - \exp \left\{ \frac{ -\ln(2)(t_i-t_{i-1}) } { \tau } \right\}\\y_i &= \alpha_i x_i + (1 - \alpha_i) y_{i-1}; \quad i > 0\end{aligned}\end{align} \]where \(\tau\) is the
half_life.- Parameters:
- by
Times to calculate average by. Should be
DateTime,Date,UInt64,UInt32,Int64, orInt32data type.- half_life
Unit over which observation decays to half its value.
Can be created either from a timedelta, or by using the following string language:
1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 day)
1w (1 week)
1i (1 index count)
Or combine them: β3d12h4m25sβ # 3 days, 12 hours, 4 minutes, and 25 seconds
Note that
half_lifeis treated as a constant duration - calendar durations such as months (or even days in the time-zone-aware case) are not supported, please express your duration in an approximately equivalent number of hours (e.g. β370hβ instead of β1moβ).
- Returns:
- Expr
Float16if input isFloat16, class:.Float32if input isFloat32, otherwise class:.Float64.
Examples
>>> from datetime import date, timedelta >>> df = pl.DataFrame( ... { ... "values": [0, 1, 2, None, 4], ... "times": [ ... date(2020, 1, 1), ... date(2020, 1, 3), ... date(2020, 1, 10), ... date(2020, 1, 15), ... date(2020, 1, 17), ... ], ... } ... ).sort("times") >>> df.with_columns( ... result=pl.col("values").ewm_mean_by("times", half_life="4d"), ... ) shape: (5, 3) ββββββββββ¬βββββββββββββ¬βββββββββββ β values β times β result β β --- β --- β --- β β i64 β date β f64 β ββββββββββͺβββββββββββββͺβββββββββββ‘ β 0 β 2020-01-01 β 0.0 β β 1 β 2020-01-03 β 0.292893 β β 2 β 2020-01-10 β 1.492474 β β null β 2020-01-15 β null β β 4 β 2020-01-17 β 3.254508 β ββββββββββ΄βββββββββββββ΄βββββββββββ
- ewm_std(
- *,
- com: float | None = None,
- span: float | None = None,
- half_life: float | None = None,
- alpha: float | None = None,
- adjust: bool = True,
- bias: bool = False,
- min_samples: int = 1,
- ignore_nulls: bool = False,
Compute exponentially-weighted moving standard deviation.
Changed in version 1.21.0: The
min_periodsparameter was renamedmin_samples.- Parameters:
- com
Specify decay in terms of center of mass, \(\gamma\), with
\[\alpha = \frac{1}{1 + \gamma} \; \forall \; \gamma \geq 0\]- span
Specify decay in terms of span, \(\theta\), with
\[\alpha = \frac{2}{\theta + 1} \; \forall \; \theta \geq 1\]- half_life
Specify decay in terms of half-life, \(\lambda\), with
\[\alpha = 1 - \exp \left\{ \frac{ -\ln(2) }{ \lambda } \right\} \; \forall \; \lambda > 0\]- alpha
Specify smoothing factor alpha directly, \(0 < \alpha \leq 1\).
- adjust
Divide by decaying adjustment factor in beginning periods to account for imbalance in relative weightings
When
adjust=True(the default) the EW function is calculated using weights \(w_i = (1 - \alpha)^i\)When
adjust=Falsethe EW function is calculated recursively by\[\begin{split}y_0 &= x_0 \\ y_t &= (1 - \alpha)y_{t - 1} + \alpha x_t\end{split}\]
- bias
When
bias=False, apply a correction to make the estimate statistically unbiased.- min_samples
Minimum number of observations in window required to have a value (otherwise result is null).
- ignore_nulls
Ignore missing values when calculating weights.
When
ignore_nulls=False(default), weights are based on absolute positions. For example, the weights of \(x_0\) and \(x_2\) used in calculating the final weighted average of [\(x_0\), None, \(x_2\)] are \((1-\alpha)^2\) and \(1\) ifadjust=True, and \((1-\alpha)^2\) and \(\alpha\) ifadjust=False.When
ignore_nulls=True, weights are based on relative positions. For example, the weights of \(x_0\) and \(x_2\) used in calculating the final weighted average of [\(x_0\), None, \(x_2\)] are \(1-\alpha\) and \(1\) ifadjust=True, and \(1-\alpha\) and \(\alpha\) ifadjust=False.
Examples
>>> df = pl.DataFrame({"a": [1, 2, 3]}) >>> df.select(pl.col("a").ewm_std(com=1, ignore_nulls=False)) shape: (3, 1) ββββββββββββ β a β β --- β β f64 β ββββββββββββ‘ β 0.0 β β 0.707107 β β 0.963624 β ββββββββββββ
- ewm_var(
- *,
- com: float | None = None,
- span: float | None = None,
- half_life: float | None = None,
- alpha: float | None = None,
- adjust: bool = True,
- bias: bool = False,
- min_samples: int = 1,
- ignore_nulls: bool = False,
Compute exponentially-weighted moving variance.
Changed in version 1.21.0: The
min_periodsparameter was renamedmin_samples.- Parameters:
- com
Specify decay in terms of center of mass, \(\gamma\), with
\[\alpha = \frac{1}{1 + \gamma} \; \forall \; \gamma \geq 0\]- span
Specify decay in terms of span, \(\theta\), with
\[\alpha = \frac{2}{\theta + 1} \; \forall \; \theta \geq 1\]- half_life
Specify decay in terms of half-life, \(\lambda\), with
\[\alpha = 1 - \exp \left\{ \frac{ -\ln(2) }{ \lambda } \right\} \; \forall \; \lambda > 0\]- alpha
Specify smoothing factor alpha directly, \(0 < \alpha \leq 1\).
- adjust
Divide by decaying adjustment factor in beginning periods to account for imbalance in relative weightings
When
adjust=True(the default) the EW function is calculated using weights \(w_i = (1 - \alpha)^i\)When
adjust=Falsethe EW function is calculated recursively by\[\begin{split}y_0 &= x_0 \\ y_t &= (1 - \alpha)y_{t - 1} + \alpha x_t\end{split}\]
- bias
When
bias=False, apply a correction to make the estimate statistically unbiased.- min_samples
Minimum number of observations in window required to have a value (otherwise result is null).
- ignore_nulls
Ignore missing values when calculating weights.
When
ignore_nulls=False(default), weights are based on absolute positions. For example, the weights of \(x_0\) and \(x_2\) used in calculating the final weighted average of [\(x_0\), None, \(x_2\)] are \((1-\alpha)^2\) and \(1\) ifadjust=True, and \((1-\alpha)^2\) and \(\alpha\) ifadjust=False.When
ignore_nulls=True, weights are based on relative positions. For example, the weights of \(x_0\) and \(x_2\) used in calculating the final weighted average of [\(x_0\), None, \(x_2\)] are \(1-\alpha\) and \(1\) ifadjust=True, and \(1-\alpha\) and \(\alpha\) ifadjust=False.
Examples
>>> df = pl.DataFrame({"a": [1, 2, 3]}) >>> df.select(pl.col("a").ewm_var(com=1, ignore_nulls=False)) shape: (3, 1) ββββββββββββ β a β β --- β β f64 β ββββββββββββ‘ β 0.0 β β 0.5 β β 0.928571 β ββββββββββββ
- exclude(
- columns: str_ | PolarsDataType | Collection[str_ | PolarsDataType],
- *more_columns: str_ | PolarsDataType,
Exclude columns from a multi-column expression.
Only works after a wildcard or regex column selection, and you cannot provide both string column names and dtypes (you may prefer to use selectors instead).
- Parameters:
- columns
The name or datatype of the column(s) to exclude. Accepts regular expression input. Regular expressions should start with
^and end with$.- *more_columns
Additional names or datatypes of columns to exclude, specified as positional arguments.
Examples
>>> df = pl.DataFrame( ... { ... "aa": [1, 2, 3], ... "ba": ["a", "b", None], ... "cc": [None, 2.5, 1.5], ... } ... ) >>> df shape: (3, 3) βββββββ¬βββββββ¬βββββββ β aa β ba β cc β β --- β --- β --- β β i64 β str β f64 β βββββββͺβββββββͺβββββββ‘ β 1 β a β null β β 2 β b β 2.5 β β 3 β null β 1.5 β βββββββ΄βββββββ΄βββββββ
Exclude by column name(s):
>>> df.select(pl.all().exclude("ba")) shape: (3, 2) βββββββ¬βββββββ β aa β cc β β --- β --- β β i64 β f64 β βββββββͺβββββββ‘ β 1 β null β β 2 β 2.5 β β 3 β 1.5 β βββββββ΄βββββββ
Exclude by regex, e.g. removing all columns whose names end with the letter βaβ:
>>> df.select(pl.all().exclude("^.*a$")) shape: (3, 1) ββββββββ β cc β β --- β β f64 β ββββββββ‘ β null β β 2.5 β β 1.5 β ββββββββ
Exclude by dtype(s), e.g. removing all columns of type Int64 or Float64:
>>> df.select(pl.all().exclude([pl.Int64, pl.Float64])) shape: (3, 1) ββββββββ β ba β β --- β β str β ββββββββ‘ β a β β b β β null β ββββββββ
- exp() Expr[source]
Compute the exponential, element-wise.
Examples
>>> df = pl.DataFrame({"values": [1.0, 2.0, 4.0]}) >>> df.select(pl.col("values").exp()) shape: (3, 1) ββββββββββββ β values β β --- β β f64 β ββββββββββββ‘ β 2.718282 β β 7.389056 β β 54.59815 β ββββββββββββ
- explode( ) Expr[source]
Explode a list expression.
This means that every item is expanded to a new row.
- Parameters:
- empty_as_null
Explode an empty list/array into a
null.- keep_nulls
Explode a
nulllist/array into anull.
- Returns:
- Expr
Expression with the data type of the list elements.
See also
Expr.list.explodeExplode a list column.
Examples
>>> df = pl.DataFrame( ... { ... "group": ["a", "b"], ... "values": [ ... [1, 2], ... [3, 4], ... ], ... } ... ) >>> df.select(pl.col("values").explode()) shape: (4, 1) ββββββββββ β values β β --- β β i64 β ββββββββββ‘ β 1 β β 2 β β 3 β β 4 β ββββββββββ
- extend_constant(value: IntoExpr, n: int | IntoExprColumn) Expr[source]
Extremely fast method for extending the Series with βnβ copies of a value.
- Parameters:
- value
A constant literal value or a unit expression with which to extend the expression result Series; can pass None to extend with nulls.
- n
The number of additional values that will be added.
Examples
>>> df = pl.DataFrame({"values": [1, 2, 3]}) >>> df.select((pl.col("values") - 1).extend_constant(99, n=2)) shape: (5, 1) ββββββββββ β values β β --- β β i64 β ββββββββββ‘ β 0 β β 1 β β 2 β β 99 β β 99 β ββββββββββ
- fill_nan( ) Expr[source]
Fill floating point NaN value with a fill value.
- Parameters:
- value
Value used to fill NaN values.
See also
Notes
A NaN value is not the same as a null value. To fill null values, use
fill_null().Examples
>>> df = pl.DataFrame( ... { ... "a": [1.0, None, float("nan")], ... "b": [4.0, float("nan"), 6], ... } ... ) >>> df.with_columns(pl.col("b").fill_nan(0)) shape: (3, 2) ββββββββ¬ββββββ β a β b β β --- β --- β β f64 β f64 β ββββββββͺββββββ‘ β 1.0 β 4.0 β β null β 0.0 β β NaN β 6.0 β ββββββββ΄ββββββ
- fill_null(
- value: Any | Expr | None = None,
- strategy: FillNullStrategy | None = None,
- limit: int | None = None,
Fill null values using the specified value or strategy.
To interpolate over null values see interpolate. See the examples below to fill nulls with an expression.
- Parameters:
- value
Value used to fill null values.
- strategy{None, βforwardβ, βbackwardβ, βminβ, βmaxβ, βmeanβ, βzeroβ, βoneβ}
Strategy used to fill null values.
- limit
Number of consecutive null values to fill when using the βforwardβ or βbackwardβ strategy.
See also
Notes
A null value is not the same as a NaN value. To fill NaN values, use
fill_nan().Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 2, None], ... "b": [4, None, 6], ... } ... ) >>> df.with_columns(pl.col("b").fill_null(strategy="zero")) shape: (3, 2) ββββββββ¬ββββββ β a β b β β --- β --- β β i64 β i64 β ββββββββͺββββββ‘ β 1 β 4 β β 2 β 0 β β null β 6 β ββββββββ΄ββββββ >>> df.with_columns(pl.col("b").fill_null(99)) shape: (3, 2) ββββββββ¬ββββββ β a β b β β --- β --- β β i64 β i64 β ββββββββͺββββββ‘ β 1 β 4 β β 2 β 99 β β null β 6 β ββββββββ΄ββββββ >>> df.with_columns(pl.col("b").fill_null(strategy="forward")) shape: (3, 2) ββββββββ¬ββββββ β a β b β β --- β --- β β i64 β i64 β ββββββββͺββββββ‘ β 1 β 4 β β 2 β 4 β β null β 6 β ββββββββ΄ββββββ >>> df.with_columns(pl.col("b").fill_null(pl.col("b").median())) shape: (3, 2) ββββββββ¬ββββββ β a β b β β --- β --- β β i64 β f64 β ββββββββͺββββββ‘ β 1 β 4.0 β β 2 β 5.0 β β null β 6.0 β ββββββββ΄ββββββ >>> df.with_columns(pl.all().fill_null(pl.all().median())) shape: (3, 2) βββββββ¬ββββββ β a β b β β --- β --- β β f64 β f64 β βββββββͺββββββ‘ β 1.0 β 4.0 β β 2.0 β 5.0 β β 1.5 β 6.0 β βββββββ΄ββββββ
- filter(
- *predicates: IntoExprColumn | Iterable[IntoExprColumn],
- **constraints: Any,
Filter the expression based on one or more predicate expressions.
The original order of the remaining elements is preserved.
Elements where the filter does not evaluate to True are discarded, including nulls.
Mostly useful in an aggregation context. If you want to filter on a DataFrame level, use
LazyFrame.filter.- Parameters:
- predicates
Expression(s) that evaluates to a boolean Series.
- constraints
Column filters; use
name = valueto filter columns by the supplied value. Each constraint will behave the same aspl.col(name).eq(value), and be implicitly joined with the other filter conditions using&.
Examples
>>> df = pl.DataFrame( ... { ... "group_col": ["g1", "g1", "g2"], ... "b": [1, 2, 3], ... } ... ) >>> df.group_by("group_col").agg( ... lt=pl.col("b").filter(pl.col("b") < 2).sum(), ... gte=pl.col("b").filter(pl.col("b") >= 2).sum(), ... ).sort("group_col") shape: (2, 3) βββββββββββββ¬ββββββ¬ββββββ β group_col β lt β gte β β --- β --- β --- β β str β i64 β i64 β βββββββββββββͺββββββͺββββββ‘ β g1 β 1 β 2 β β g2 β 0 β 3 β βββββββββββββ΄ββββββ΄ββββββ
Filter expressions can also take constraints as keyword arguments.
>>> df = pl.DataFrame( ... { ... "key": ["a", "a", "a", "a", "b", "b", "b", "b", "b"], ... "n": [1, 2, 2, 3, 1, 3, 3, 2, 3], ... }, ... ) >>> df.group_by("key").agg( ... n_1=pl.col("n").filter(n=1).sum(), ... n_2=pl.col("n").filter(n=2).sum(), ... n_3=pl.col("n").filter(n=3).sum(), ... ).sort(by="key") shape: (2, 4) βββββββ¬ββββββ¬ββββββ¬ββββββ β key β n_1 β n_2 β n_3 β β --- β --- β --- β --- β β str β i64 β i64 β i64 β βββββββͺββββββͺββββββͺββββββ‘ β a β 1 β 4 β 3 β β b β 1 β 2 β 9 β βββββββ΄ββββββ΄ββββββ΄ββββββ
- first(*, ignore_nulls: bool = False) Expr[source]
Get the first value.
- Parameters:
- ignore_nulls
Ignore null values (default
False). If set toTrue, the first non-null value is returned, otherwiseNoneis returned if no non-null value exists.
Examples
>>> df = pl.DataFrame({"a": [None, 1, 2]}) >>> df.select(pl.col("a").first()) shape: (1, 1) ββββββββ β a β β --- β β i64 β ββββββββ‘ β null β ββββββββ >>> df.select(pl.col("a").first(ignore_nulls=True)) shape: (1, 1) βββββββ β a β β --- β β i64 β βββββββ‘ β 1 β βββββββ
- flatten() Expr[source]
Flatten a list or string column.
Alias for
Expr.list.explode().Deprecated since version 1.38:
Expr.flatten()is deprecated and will be removed in version 2.0. UseExpr.list.explode(keep_nulls=False, empty_as_null=False)instead, which provides the behavior you likely expect.Examples
>>> df = pl.DataFrame( ... { ... "group": ["a", "b", "b"], ... "values": [[1, 2], [2, 3], [4]], ... } ... ) >>> df.group_by("group").agg(pl.col("values").flatten()) shape: (2, 2) βββββββββ¬ββββββββββββ β group β values β β --- β --- β β str β list[i64] β βββββββββͺββββββββββββ‘ β a β [1, 2] β β b β [2, 3, 4] β βββββββββ΄ββββββββββββ
- floor() Expr[source]
Rounds down to the nearest integer value.
Only works on floating point Series.
Examples
>>> df = pl.DataFrame({"a": [0.3, 0.5, 1.0, 1.1]}) >>> df.select(pl.col("a").floor()) shape: (4, 1) βββββββ β a β β --- β β f64 β βββββββ‘ β 0.0 β β 0.0 β β 1.0 β β 1.0 β βββββββ
- floordiv(other: Any) Expr[source]
Method equivalent of integer division operator
expr // other.- Parameters:
- other
Numeric literal or expression value.
See also
Examples
>>> df = pl.DataFrame({"x": [1, 2, 3, 4, 5]}) >>> df.with_columns( ... pl.col("x").truediv(2).alias("x/2"), ... pl.col("x").floordiv(2).alias("x//2"), ... ) shape: (5, 3) βββββββ¬ββββββ¬βββββββ β x β x/2 β x//2 β β --- β --- β --- β β i64 β f64 β i64 β βββββββͺββββββͺβββββββ‘ β 1 β 0.5 β 0 β β 2 β 1.0 β 1 β β 3 β 1.5 β 1 β β 4 β 2.0 β 2 β β 5 β 2.5 β 2 β βββββββ΄ββββββ΄βββββββ
Note that Polarsβ
floordivis subtly different from Pythonβs floor division. For example, consider 6.0 floor-divided by 0.1. Python gives:>>> 6.0 // 0.1 59.0
because
0.1is not represented internally as that exact value, but a slightly larger value. So the result of the division is slightly less than 60, meaning the flooring operation returns 59.0.Polars instead first does the floating-point division, resulting in a floating-point value of 60.0, and then performs the flooring operation using
floor:>>> df = pl.DataFrame({"x": [6.0, 6.03]}) >>> df.with_columns( ... pl.col("x").truediv(0.1).alias("x/0.1"), ... ).with_columns( ... pl.col("x/0.1").floor().alias("x/0.1 floor"), ... ) shape: (2, 3) ββββββββ¬ββββββββ¬ββββββββββββββ β x β x/0.1 β x/0.1 floor β β --- β --- β --- β β f64 β f64 β f64 β ββββββββͺββββββββͺββββββββββββββ‘ β 6.0 β 60.0 β 60.0 β β 6.03 β 60.3 β 60.0 β ββββββββ΄ββββββββ΄ββββββββββββββ
yielding the more intuitive result 60.0. The row with x = 6.03 is included to demonstrate the effect of the flooring operation.
floordivcombines those two steps to give the same result with one expression:>>> df.with_columns( ... pl.col("x").floordiv(0.1).alias("x//0.1"), ... ) shape: (2, 2) ββββββββ¬βββββββββ β x β x//0.1 β β --- β --- β β f64 β f64 β ββββββββͺβββββββββ‘ β 6.0 β 60.0 β β 6.03 β 60.0 β ββββββββ΄βββββββββ
- forward_fill(limit: int | None = None) Expr[source]
Fill missing values with the last non-null value.
This is an alias of
.fill_null(strategy="forward").- Parameters:
- limit
The number of consecutive null values to forward fill.
See also
- classmethod from_json(value: str_) Expr[source]
Read an expression from a JSON encoded string to construct an Expression.
Deprecated since version 0.20.11: This method has been renamed to
deserialize(). Note that the new method operates on file-like inputs rather than strings. Enclose your input inio.StringIOto keep the same behavior.- Parameters:
- value
JSON encoded string value
- gather(
- indices: int | Sequence[int] | IntoExpr | Series | np.ndarray[Any, Any],
- *,
- null_on_oob: bool = False,
Take values by index.
- Parameters:
- indices
An expression that leads to a UInt32 dtyped Series.
- null_on_oob
Behavior if an index is out of bounds:
True -> set the result to null
False -> raise an error
- Returns:
- Expr
Expression of the same data type.
See also
Expr.getTake a single value
Examples
>>> df = pl.DataFrame( ... { ... "group": [ ... "one", ... "one", ... "one", ... "two", ... "two", ... "two", ... ], ... "value": [1, 98, 2, 3, 99, 4], ... } ... ) >>> df.group_by("group", maintain_order=True).agg( ... pl.col("value").gather([2, 1]) ... ) shape: (2, 2) βββββββββ¬ββββββββββββ β group β value β β --- β --- β β str β list[i64] β βββββββββͺββββββββββββ‘ β one β [2, 98] β β two β [4, 99] β βββββββββ΄ββββββββββββ
Use
null_on_oob=Trueto return null for out-of-bounds indices.>>> df = pl.DataFrame({"a": [1, 2, 3]}) >>> df.select(pl.col("a").gather([0, 1, 10], null_on_oob=True)) shape: (3, 1) ββββββββ β a β β --- β β i64 β ββββββββ‘ β 1 β β 2 β β null β ββββββββ
- gather_every(n: int, offset: int = 0) Expr[source]
Take every nth value in the Series and return as a new Series.
- Parameters:
- n
Gather every n-th row.
- offset
Starting index.
Examples
>>> df = pl.DataFrame({"foo": [1, 2, 3, 4, 5, 6, 7, 8, 9]}) >>> df.select(pl.col("foo").gather_every(3)) shape: (3, 1) βββββββ β foo β β --- β β i64 β βββββββ‘ β 1 β β 4 β β 7 β βββββββ
>>> df.select(pl.col("foo").gather_every(3, offset=1)) shape: (3, 1) βββββββ β foo β β --- β β i64 β βββββββ‘ β 2 β β 5 β β 8 β βββββββ
- ge(other: Any) Expr[source]
Method equivalent of βgreater than or equalβ operator
expr >= other.- Parameters:
- other
A literal or expression value to compare with.
Examples
>>> df = pl.DataFrame( ... data={ ... "x": [5.0, 4.0, float("nan"), 2.0], ... "y": [5.0, 3.0, float("nan"), 1.0], ... } ... ) >>> df.with_columns( ... pl.col("x").ge(pl.col("y")).alias("x >= y"), ... ) shape: (4, 3) βββββββ¬ββββββ¬βββββββββ β x β y β x >= y β β --- β --- β --- β β f64 β f64 β bool β βββββββͺββββββͺβββββββββ‘ β 5.0 β 5.0 β true β β 4.0 β 3.0 β true β β NaN β NaN β true β β 2.0 β 1.0 β true β βββββββ΄ββββββ΄βββββββββ
- get( ) Expr[source]
Return a single value by index.
- Parameters:
- index
An expression that evaluates to an integer. Negative indexing is supported.
- null_on_oob
Behavior if an index is out of bounds:
True -> set the result to null
False -> raise an error
- Returns:
- Expr
Expression of the same data type.
Examples
>>> df = pl.DataFrame( ... { ... "group": [ ... "one", ... "one", ... "one", ... "two", ... "two", ... "two", ... ], ... "value": [1, 98, 2, 3, 99, 4], ... } ... ) >>> df.group_by("group", maintain_order=True).agg(pl.col("value").get(1)) shape: (2, 2) βββββββββ¬ββββββββ β group β value β β --- β --- β β str β i64 β βββββββββͺββββββββ‘ β one β 98 β β two β 99 β βββββββββ΄ββββββββ
- gt(other: Any) Expr[source]
Method equivalent of βgreater thanβ operator
expr > other.- Parameters:
- other
A literal or expression value to compare with.
Examples
>>> df = pl.DataFrame( ... data={ ... "x": [5.0, 4.0, float("nan"), 2.0], ... "y": [5.0, 3.0, float("nan"), 1.0], ... } ... ) >>> df.with_columns( ... pl.col("x").gt(pl.col("y")).alias("x > y"), ... ) shape: (4, 3) βββββββ¬ββββββ¬ββββββββ β x β y β x > y β β --- β --- β --- β β f64 β f64 β bool β βββββββͺββββββͺββββββββ‘ β 5.0 β 5.0 β false β β 4.0 β 3.0 β true β β NaN β NaN β false β β 2.0 β 1.0 β true β βββββββ΄ββββββ΄ββββββββ
- has_nulls() Expr[source]
Check whether the expression contains one or more null values.
Examples
>>> df = pl.DataFrame( ... { ... "a": [None, 1, None], ... "b": [10, None, 300], ... "c": [350, 650, 850], ... } ... ) >>> df.select(pl.all().has_nulls()) shape: (1, 3) ββββββββ¬βββββββ¬ββββββββ β a β b β c β β --- β --- β --- β β bool β bool β bool β ββββββββͺβββββββͺββββββββ‘ β true β true β false β ββββββββ΄βββββββ΄ββββββββ
- hash( ) Expr[source]
Hash the elements in the selection.
The hash value is of type
UInt64.- Parameters:
- seed
Random seed parameter. Defaults to 0.
- seed_1
Random seed parameter. Defaults to
seedif not set.- seed_2
Random seed parameter. Defaults to
seedif not set.- seed_3
Random seed parameter. Defaults to
seedif not set.
Notes
This implementation of
hashdoes not guarantee stable results across different Polars versions. Its stability is only guaranteed within a single version.Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 2, None], ... "b": ["x", None, "z"], ... } ... ) >>> df.with_columns(pl.all().hash(10, 20, 30, 40)) shape: (3, 2) ββββββββββββββββββββββββ¬βββββββββββββββββββββββ β a β b β β --- β --- β β u64 β u64 β ββββββββββββββββββββββββͺβββββββββββββββββββββββ‘ β 9774092659964970114 β 13614470193936745724 β β 1101441246220388612 β 11638928888656214026 β β 11638928888656214026 β 13382926553367784577 β ββββββββββββββββββββββββ΄βββββββββββββββββββββββ
- head(n: int | Expr = 10) Expr[source]
Get the first
nrows.- Parameters:
- n
Number of rows to return.
Examples
>>> df = pl.DataFrame({"foo": [1, 2, 3, 4, 5, 6, 7]}) >>> df.select(pl.col("foo").head(3)) shape: (3, 1) βββββββ β foo β β --- β β i64 β βββββββ‘ β 1 β β 2 β β 3 β βββββββ
- hist(
- bins: IntoExpr | None = None,
- *,
- bin_count: int | None = None,
- include_category: bool = False,
- include_breakpoint: bool = False,
Bin values into buckets and count their occurrences.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- Parameters:
- bins
Bin edges. If None given, we determine the edges based on the data.
- bin_count
If
binsis not provided,bin_countuniform bins are created that fully encompass the data.- include_breakpoint
Include a column that indicates the upper breakpoint.
- include_category
Include a column that shows the intervals as categories.
- Returns:
- DataFrame
Examples
>>> df = pl.DataFrame({"a": [1, 3, 8, 8, 2, 1, 3]}) >>> df.select(pl.col("a").hist(bins=[1, 2, 3])) shape: (2, 1) βββββββ β a β β --- β β u32 β βββββββ‘ β 3 β β 2 β βββββββ >>> df.select( ... pl.col("a").hist( ... bins=[1, 2, 3], include_breakpoint=True, include_category=True ... ) ... ) shape: (2, 1) ββββββββββββββββββββββββ β a β β --- β β struct[3] β ββββββββββββββββββββββββ‘ β {2.0,"[1.0, 2.0]",3} β β {3.0,"(2.0, 3.0]",2} β ββββββββββββββββββββββββ
- implode(*, maintain_order: bool = True) Expr[source]
Aggregate values into a list.
The returned list itself is a scalar value of
listdtype.- Parameters:
- maintain_order
Whether to preserve the order of elements in the list. Setting this to
Falsecan improve performance, especially withingroup_by.
Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 2, 3], ... "b": [4, 5, 6], ... } ... ) >>> df.select(pl.all().implode()) shape: (1, 2) βββββββββββββ¬ββββββββββββ β a β b β β --- β --- β β list[i64] β list[i64] β βββββββββββββͺββββββββββββ‘ β [1, 2, 3] β [4, 5, 6] β βββββββββββββ΄ββββββββββββ
- index_of(element: IntoExpr) Expr[source]
Get the index of the first occurrence of a value, or
Noneif itβs not found.- Parameters:
- element
Value to find.
Examples
>>> df = pl.DataFrame({"a": [1, None, 17]}) >>> df.select( ... [ ... pl.col("a").index_of(17).alias("seventeen"), ... pl.col("a").index_of(None).alias("null"), ... pl.col("a").index_of(55).alias("fiftyfive"), ... ] ... ) shape: (1, 3) βββββββββββββ¬βββββββ¬ββββββββββββ β seventeen β null β fiftyfive β β --- β --- β --- β β u32 β u32 β u32 β βββββββββββββͺβββββββͺββββββββββββ‘ β 2 β 1 β null β βββββββββββββ΄βββββββ΄ββββββββββββ
- inspect(fmt: str_ = '{}') Expr[source]
Print the value that this expression evaluates to and pass on the value.
Examples
>>> df = pl.DataFrame({"foo": [1, 1, 2]}) >>> df.select(pl.col("foo").cum_sum().inspect("value is: {}").alias("bar")) value is: shape: (3,) Series: 'foo' [i64] [ 1 2 4 ] shape: (3, 1) βββββββ β bar β β --- β β i64 β βββββββ‘ β 1 β β 2 β β 4 β βββββββ
- interpolate(method: InterpolationMethod = 'linear') Expr[source]
Interpolate intermediate values.
Nulls at the beginning and end of the series remain null.
- Parameters:
- method{βlinearβ, βnearestβ}
Interpolation method.
Examples
Fill null values using linear interpolation.
>>> df = pl.DataFrame( ... { ... "a": [1, None, 3], ... "b": [1.0, float("nan"), 3.0], ... } ... ) >>> df.select(pl.all().interpolate()) shape: (3, 2) βββββββ¬ββββββ β a β b β β --- β --- β β f64 β f64 β βββββββͺββββββ‘ β 1.0 β 1.0 β β 2.0 β NaN β β 3.0 β 3.0 β βββββββ΄ββββββ
Fill null values using nearest interpolation.
>>> df.select(pl.all().interpolate("nearest")) shape: (3, 2) βββββββ¬ββββββ β a β b β β --- β --- β β i64 β f64 β βββββββͺββββββ‘ β 1 β 1.0 β β 3 β NaN β β 3 β 3.0 β βββββββ΄ββββββ
Regrid data to a new grid.
>>> df_original_grid = pl.DataFrame( ... { ... "grid_points": [1, 3, 10], ... "values": [2.0, 6.0, 20.0], ... } ... ) # Interpolate from this to the new grid >>> df_new_grid = pl.DataFrame({"grid_points": range(1, 11)}) >>> df_new_grid.join( ... df_original_grid, on="grid_points", how="left", coalesce=True ... ).with_columns(pl.col("values").interpolate()) shape: (10, 2) βββββββββββββββ¬βββββββββ β grid_points β values β β --- β --- β β i64 β f64 β βββββββββββββββͺβββββββββ‘ β 1 β 2.0 β β 2 β 4.0 β β 3 β 6.0 β β 4 β 8.0 β β 5 β 10.0 β β 6 β 12.0 β β 7 β 14.0 β β 8 β 16.0 β β 9 β 18.0 β β 10 β 20.0 β βββββββββββββββ΄βββββββββ
- interpolate_by(by: IntoExpr) Expr[source]
Fill null values using interpolation based on another column.
Nulls at the beginning and end of the series remain null.
- Parameters:
- by
Column to interpolate values based on.
Examples
Fill null values using linear interpolation.
>>> df = pl.DataFrame( ... { ... "a": [1, None, None, 3], ... "b": [1, 2, 7, 8], ... } ... ) >>> df.with_columns(a_interpolated=pl.col("a").interpolate_by("b")) shape: (4, 3) ββββββββ¬ββββββ¬βββββββββββββββββ β a β b β a_interpolated β β --- β --- β --- β β i64 β i64 β f64 β ββββββββͺββββββͺβββββββββββββββββ‘ β 1 β 1 β 1.0 β β null β 2 β 1.285714 β β null β 7 β 2.714286 β β 3 β 8 β 3.0 β ββββββββ΄ββββββ΄βββββββββββββββββ
- is_between(
- lower_bound: IntoExpr,
- upper_bound: IntoExpr,
- closed: ClosedInterval = 'both',
Check if this expression is between the given lower and upper bounds.
- Parameters:
- lower_bound
Lower bound value. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.
- upper_bound
Upper bound value. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.
- closed{βbothβ, βleftβ, βrightβ, βnoneβ}
Define which sides of the interval are closed (inclusive).
- Returns:
- Expr
Expression of data type
Boolean.
Notes
If the value of the
lower_boundis greater than that of theupper_boundthen the result will be False, as no value can satisfy the condition.Examples
>>> df = pl.DataFrame({"num": [1, 2, 3, 4, 5]}) >>> df.with_columns(pl.col("num").is_between(2, 4).alias("is_between")) shape: (5, 2) βββββββ¬βββββββββββββ β num β is_between β β --- β --- β β i64 β bool β βββββββͺβββββββββββββ‘ β 1 β false β β 2 β true β β 3 β true β β 4 β true β β 5 β false β βββββββ΄βββββββββββββ
Use the
closedargument to include or exclude the values at the bounds:>>> df.with_columns( ... pl.col("num").is_between(2, 4, closed="left").alias("is_between") ... ) shape: (5, 2) βββββββ¬βββββββββββββ β num β is_between β β --- β --- β β i64 β bool β βββββββͺβββββββββββββ‘ β 1 β false β β 2 β true β β 3 β true β β 4 β false β β 5 β false β βββββββ΄βββββββββββββ
You can also use strings as well as numeric/temporal values (note: ensure that string literals are wrapped with
litso as not to conflate them with column names):>>> df = pl.DataFrame({"a": ["a", "b", "c", "d", "e"]}) >>> df.with_columns( ... pl.col("a") ... .is_between(pl.lit("a"), pl.lit("c"), closed="both") ... .alias("is_between") ... ) shape: (5, 2) βββββββ¬βββββββββββββ β a β is_between β β --- β --- β β str β bool β βββββββͺβββββββββββββ‘ β a β true β β b β true β β c β true β β d β false β β e β false β βββββββ΄βββββββββββββ
Use column expressions as lower/upper bounds, comparing to a literal value:
>>> df = pl.DataFrame({"a": [1, 2, 3, 4, 5], "b": [5, 4, 3, 2, 1]}) >>> df.with_columns( ... pl.lit(3).is_between(pl.col("a"), pl.col("b")).alias("between_ab") ... ) shape: (5, 3) βββββββ¬ββββββ¬βββββββββββββ β a β b β between_ab β β --- β --- β --- β β i64 β i64 β bool β βββββββͺββββββͺβββββββββββββ‘ β 1 β 5 β true β β 2 β 4 β true β β 3 β 3 β true β β 4 β 2 β false β β 5 β 1 β false β βββββββ΄ββββββ΄βββββββββββββ
- is_close( ) Expr[source]
Check if this expression is close, i.e. almost equal, to the other expression.
Two values
aandbare considered close if the following condition holds:\[|a-b| \le max \{ \text{rel_tol} \cdot max \{ |a|, |b| \}, \text{abs_tol} \}\]- Parameters:
- other
A literal or expression value to compare with.
- abs_tol
Absolute tolerance. This is the maximum allowed absolute difference between two values. Must be non-negative.
- rel_tol
Relative tolerance. This is the maximum allowed difference between two values, relative to the larger absolute value. Must be non-negative.
- nans_equal
Whether NaN values should be considered equal.
- Returns:
- Expr
Expression of data type
Boolean.
Notes
The implementation of this method is symmetric and mirrors the behavior of
math.isclose(). Specifically note that this behavior is different tonumpy.isclose().Examples
>>> df = pl.DataFrame({"a": [1.5, 2.0, 2.5], "b": [1.55, 2.2, 3.0]}) >>> df.with_columns(pl.col("a").is_close("b", abs_tol=0.1).alias("is_close")) shape: (3, 3) βββββββ¬βββββββ¬βββββββββββ β a β b β is_close β β --- β --- β --- β β f64 β f64 β bool β βββββββͺβββββββͺβββββββββββ‘ β 1.5 β 1.55 β true β β 2.0 β 2.2 β false β β 2.5 β 3.0 β false β βββββββ΄βββββββ΄βββββββββββ
- is_duplicated() Expr[source]
Return a boolean mask indicating duplicated values.
- Returns:
- Expr
Expression of data type
Boolean.
Examples
>>> df = pl.DataFrame({"a": [1, 1, 2]}) >>> df.select(pl.col("a").is_duplicated()) shape: (3, 1) βββββββββ β a β β --- β β bool β βββββββββ‘ β true β β true β β false β βββββββββ
- is_empty(*, ignore_nulls: bool = False) Expr[source]
Return whether the column is empty.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- Parameters:
- ignore_nulls
If true a column containing only nulls will also be considered empty. The default is false.
- Returns:
- Expr
Expression of data type
Boolean.
Examples
>>> df = pl.DataFrame({"x": [None, None]}) >>> df.select( ... a=pl.col.x.is_empty(), ... b=pl.col.x.drop_nulls().is_empty(), ... c=pl.col.x.is_empty(ignore_nulls=True), ... ) shape: (1, 3) βββββββββ¬βββββββ¬βββββββ β a β b β c β β --- β --- β --- β β bool β bool β bool β βββββββββͺβββββββͺβββββββ‘ β false β true β true β βββββββββ΄βββββββ΄βββββββ
- is_finite() Expr[source]
Returns a boolean Series indicating which values are finite.
- Returns:
- Expr
Expression of data type
Boolean.
Examples
>>> df = pl.DataFrame( ... { ... "A": [1.0, 2], ... "B": [3.0, float("inf")], ... } ... ) >>> df.select(pl.all().is_finite()) shape: (2, 2) ββββββββ¬ββββββββ β A β B β β --- β --- β β bool β bool β ββββββββͺββββββββ‘ β true β true β β true β false β ββββββββ΄ββββββββ
- is_first_distinct() Expr[source]
Return a boolean mask indicating the first occurrence of each distinct value.
- Returns:
- Expr
Expression of data type
Boolean.
Examples
>>> df = pl.DataFrame({"a": [1, 1, 2, 3, 2]}) >>> df.with_columns(pl.col("a").is_first_distinct().alias("first")) shape: (5, 2) βββββββ¬ββββββββ β a β first β β --- β --- β β i64 β bool β βββββββͺββββββββ‘ β 1 β true β β 1 β false β β 2 β true β β 3 β true β β 2 β false β βββββββ΄ββββββββ
- is_in(other: Expr | Collection[Any] | Series, *, nulls_equal: bool = False) Expr[source]
Check if elements of this expression are present in the other Series.
- Parameters:
- other
Series or sequence of primitive type.
- nulls_equalbool, default False
If True, treat null as a distinct value. Null values will not propagate.
- Returns:
- Expr
Expression of data type
Boolean.
Examples
>>> df = pl.DataFrame( ... {"sets": [[1, 2, 3], [1, 2], [9, 10]], "optional_members": [1, 2, 3]} ... ) >>> df.with_columns(contains=pl.col("optional_members").is_in("sets")) shape: (3, 3) βββββββββββββ¬βββββββββββββββββββ¬βββββββββββ β sets β optional_members β contains β β --- β --- β --- β β list[i64] β i64 β bool β βββββββββββββͺβββββββββββββββββββͺβββββββββββ‘ β [1, 2, 3] β 1 β true β β [1, 2] β 2 β true β β [9, 10] β 3 β false β βββββββββββββ΄βββββββββββββββββββ΄βββββββββββ
- is_infinite() Expr[source]
Returns a boolean Series indicating which values are infinite.
- Returns:
- Expr
Expression of data type
Boolean.
Examples
>>> df = pl.DataFrame( ... { ... "A": [1.0, 2], ... "B": [3.0, float("inf")], ... } ... ) >>> df.select(pl.all().is_infinite()) shape: (2, 2) βββββββββ¬ββββββββ β A β B β β --- β --- β β bool β bool β βββββββββͺββββββββ‘ β false β false β β false β true β βββββββββ΄ββββββββ
- is_last_distinct() Expr[source]
Return a boolean mask indicating the last occurrence of each distinct value.
- Returns:
- Expr
Expression of data type
Boolean.
Examples
>>> df = pl.DataFrame({"a": [1, 1, 2, 3, 2]}) >>> df.with_columns(pl.col("a").is_last_distinct().alias("last")) shape: (5, 2) βββββββ¬ββββββββ β a β last β β --- β --- β β i64 β bool β βββββββͺββββββββ‘ β 1 β false β β 1 β true β β 2 β false β β 3 β true β β 2 β true β βββββββ΄ββββββββ
- is_nan() Expr[source]
Returns a boolean Series indicating which values are NaN.
Notes
Floating point
NaN(Not A Number) should not be confused with missing data represented asNull/None.Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 2, None, 1, 5], ... "b": [1.0, 2.0, float("nan"), 1.0, 5.0], ... } ... ) >>> df.with_columns(pl.col(pl.Float64).is_nan().name.suffix("_isnan")) shape: (5, 3) ββββββββ¬ββββββ¬ββββββββββ β a β b β b_isnan β β --- β --- β --- β β i64 β f64 β bool β ββββββββͺββββββͺββββββββββ‘ β 1 β 1.0 β false β β 2 β 2.0 β false β β null β NaN β true β β 1 β 1.0 β false β β 5 β 5.0 β false β ββββββββ΄ββββββ΄ββββββββββ
- is_not_nan() Expr[source]
Returns a boolean Series indicating which values are not NaN.
Notes
Floating point
NaN(Not A Number) should not be confused with missing data represented asNull/None.Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 2, None, 1, 5], ... "b": [1.0, 2.0, float("nan"), 1.0, 5.0], ... } ... ) >>> df.with_columns(pl.col(pl.Float64).is_not_nan().name.suffix("_is_not_nan")) shape: (5, 3) ββββββββ¬ββββββ¬βββββββββββββββ β a β b β b_is_not_nan β β --- β --- β --- β β i64 β f64 β bool β ββββββββͺββββββͺβββββββββββββββ‘ β 1 β 1.0 β true β β 2 β 2.0 β true β β null β NaN β false β β 1 β 1.0 β true β β 5 β 5.0 β true β ββββββββ΄ββββββ΄βββββββββββββββ
- is_not_null() Expr[source]
Returns a boolean Series indicating which values are not null.
Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 2, None, 1, 5], ... "b": [1.0, 2.0, float("nan"), 1.0, 5.0], ... } ... ) >>> df.with_columns( ... pl.all().is_not_null().name.suffix("_not_null") # nan != null ... ) shape: (5, 4) ββββββββ¬ββββββ¬βββββββββββββ¬βββββββββββββ β a β b β a_not_null β b_not_null β β --- β --- β --- β --- β β i64 β f64 β bool β bool β ββββββββͺββββββͺβββββββββββββͺβββββββββββββ‘ β 1 β 1.0 β true β true β β 2 β 2.0 β true β true β β null β NaN β false β true β β 1 β 1.0 β true β true β β 5 β 5.0 β true β true β ββββββββ΄ββββββ΄βββββββββββββ΄βββββββββββββ
- is_null() Expr[source]
Returns a boolean Series indicating which values are null.
Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 2, None, 1, 5], ... "b": [1.0, 2.0, float("nan"), 1.0, 5.0], ... } ... ) >>> df.with_columns(pl.all().is_null().name.suffix("_isnull")) # nan != null shape: (5, 4) ββββββββ¬ββββββ¬βββββββββββ¬βββββββββββ β a β b β a_isnull β b_isnull β β --- β --- β --- β --- β β i64 β f64 β bool β bool β ββββββββͺββββββͺβββββββββββͺβββββββββββ‘ β 1 β 1.0 β false β false β β 2 β 2.0 β false β false β β null β NaN β true β false β β 1 β 1.0 β false β false β β 5 β 5.0 β false β false β ββββββββ΄ββββββ΄βββββββββββ΄βββββββββββ
- is_unique() Expr[source]
Get mask of unique values.
Examples
>>> df = pl.DataFrame({"a": [1, 1, 2]}) >>> df.select(pl.col("a").is_unique()) shape: (3, 1) βββββββββ β a β β --- β β bool β βββββββββ‘ β false β β false β β true β βββββββββ
- item(*, allow_empty: bool = False) Expr[source]
Get the single value.
This raises an error if there is not exactly one value.
- Parameters:
- allow_empty
Allow having no values to return
null.
See also
Expr.get()Get a single value by index.
Examples
>>> df = pl.DataFrame({"a": [1]}) >>> df.select(pl.col("a").item()) shape: (1, 1) βββββββ β a β β --- β β i64 β βββββββ‘ β 1 β βββββββ >>> df = pl.DataFrame({"a": [1, 2, 3]}) >>> df.select(pl.col("a").item()) Traceback (most recent call last): ... polars.exceptions.ComputeError: aggregation 'item' expected a single value, got 3 values ... >>> df.head(0).select(pl.col("a").item(allow_empty=True)) shape: (1, 1) ββββββββ β a β β --- β β i64 β ββββββββ‘ β null β ββββββββ
- kurtosis(*, fisher: bool = True, bias: bool = True) Expr[source]
Compute the kurtosis (Fisher or Pearson) of a dataset.
Kurtosis is the fourth central moment divided by the square of the variance. If Fisherβs definition is used, then 3.0 is subtracted from the result to give 0.0 for a normal distribution. If bias is False then the kurtosis is calculated using k statistics to eliminate bias coming from biased moment estimators.
See scipy.stats for more information
- Parameters:
- fisherbool, optional
If True, Fisherβs definition is used (normal ==> 0.0). If False, Pearsonβs definition is used (normal ==> 3.0).
- biasbool, optional
If False, the calculations are corrected for statistical bias.
Examples
>>> df = pl.DataFrame({"a": [1, 2, 3, 2, 1]}) >>> df.select(pl.col("a").kurtosis()) shape: (1, 1) βββββββββββββ β a β β --- β β f64 β βββββββββββββ‘ β -1.153061 β βββββββββββββ
- last(*, ignore_nulls: bool = False) Expr[source]
Get the last value.
- Parameters:
- ignore_nulls
Ignore null values (default
False). If set toTrue, the last non-null value is returned, otherwiseNoneis returned if no non-null value exists.
Examples
>>> df = pl.DataFrame({"a": [1, 3, 2]}) >>> df.select(pl.col("a").last()) shape: (1, 1) βββββββ β a β β --- β β i64 β βββββββ‘ β 2 β βββββββ
- le(other: Any) Expr[source]
Method equivalent of βless than or equalβ operator
expr <= other.- Parameters:
- other
A literal or expression value to compare with.
Examples
>>> df = pl.DataFrame( ... data={ ... "x": [5.0, 4.0, float("nan"), 0.5], ... "y": [5.0, 3.5, float("nan"), 2.0], ... } ... ) >>> df.with_columns( ... pl.col("x").le(pl.col("y")).alias("x <= y"), ... ) shape: (4, 3) βββββββ¬ββββββ¬βββββββββ β x β y β x <= y β β --- β --- β --- β β f64 β f64 β bool β βββββββͺββββββͺβββββββββ‘ β 5.0 β 5.0 β true β β 4.0 β 3.5 β false β β NaN β NaN β true β β 0.5 β 2.0 β true β βββββββ΄ββββββ΄βββββββββ
- len() Expr[source]
Return the number of elements in the column.
Null values count towards the total.
- Returns:
- Expr
Expression of data type
UInt32.
See also
Examples
>>> df = pl.DataFrame({"a": [1, 2, 3], "b": [None, 4, 4]}) >>> df.select(pl.all().len()) shape: (1, 2) βββββββ¬ββββββ β a β b β β --- β --- β β u32 β u32 β βββββββͺββββββ‘ β 3 β 3 β βββββββ΄ββββββ
- limit(n: int | Expr = 10) Expr[source]
Get the first
nrows (alias forExpr.head()).- Parameters:
- n
Number of rows to return.
Examples
>>> df = pl.DataFrame({"foo": [1, 2, 3, 4, 5, 6, 7]}) >>> df.select(pl.col("foo").limit(3)) shape: (3, 1) βββββββ β foo β β --- β β i64 β βββββββ‘ β 1 β β 2 β β 3 β βββββββ
- log(base: float | IntoExpr = 2.718281828459045) Expr[source]
Compute the logarithm to a given base.
- Parameters:
- base
Given base, defaults to
e
Examples
>>> df = pl.DataFrame({"a": [1, 2, 3]}) >>> df.select(pl.col("a").log(base=2)) shape: (3, 1) ββββββββββββ β a β β --- β β f64 β ββββββββββββ‘ β 0.0 β β 1.0 β β 1.584963 β ββββββββββββ
- log10() Expr[source]
Compute the base 10 logarithm of the input array, element-wise.
Examples
>>> df = pl.DataFrame({"values": [1.0, 2.0, 4.0]}) >>> df.select(pl.col("values").log10()) shape: (3, 1) βββββββββββ β values β β --- β β f64 β βββββββββββ‘ β 0.0 β β 0.30103 β β 0.60206 β βββββββββββ
- log1p() Expr[source]
Compute the natural logarithm of each element plus one.
This computes
log(1 + x)but is more numerically stable forxclose to zero.Examples
>>> df = pl.DataFrame({"a": [1, 2, 3]}) >>> df.select(pl.col("a").log1p()) shape: (3, 1) ββββββββββββ β a β β --- β β f64 β ββββββββββββ‘ β 0.693147 β β 1.098612 β β 1.386294 β ββββββββββββ
- lower_bound() Expr[source]
Calculate the lower bound.
Returns a unit Series with the lowest value possible for the dtype of this expression.
Examples
>>> df = pl.DataFrame({"a": [1, 2, 3, 2, 1]}) >>> df.select(pl.col("a").lower_bound()) shape: (1, 1) ββββββββββββββββββββββββ β a β β --- β β i64 β ββββββββββββββββββββββββ‘ β -9223372036854775808 β ββββββββββββββββββββββββ
- lt(other: Any) Expr[source]
Method equivalent of βless thanβ operator
expr < other.- Parameters:
- other
A literal or expression value to compare with.
Examples
>>> df = pl.DataFrame( ... data={ ... "x": [1.0, 2.0, float("nan"), 3.0], ... "y": [2.0, 2.0, float("nan"), 4.0], ... } ... ) >>> df.with_columns( ... pl.col("x").lt(pl.col("y")).alias("x < y"), ... ) shape: (4, 3) βββββββ¬ββββββ¬ββββββββ β x β y β x < y β β --- β --- β --- β β f64 β f64 β bool β βββββββͺββββββͺββββββββ‘ β 1.0 β 2.0 β true β β 2.0 β 2.0 β false β β NaN β NaN β false β β 3.0 β 4.0 β true β βββββββ΄ββββββ΄ββββββββ
- map_batches(
- function: Callable[[Series], Series | Any],
- return_dtype: PolarsDataType | DataTypeExpr | None = None,
- *,
- agg_list: bool = False,
- is_elementwise: bool = False,
- returns_scalar: bool = False,
Apply a custom python function to a whole Series or sequence of Series.
The output of this custom function is presumed to be either a Series, or a NumPy array (in which case it will be automatically converted into a Series), or a scalar that will be converted into a Series. If the result is a scalar and you want it to stay as a scalar, pass in
returns_scalar=True. If you want to apply a custom function elementwise over single values, seemap_elements(). A reasonable use case formapfunctions is transforming the values represented by an expression using a third-party library.- Parameters:
- function
Lambda/function to apply.
- return_dtype
Datatype of the output Series.
It is recommended to set this whenever possible. If this is
None, it tries to infer the datatype by calling the function with dummy data and looking at the output.- agg_list
First implode when in a group-by aggregation.
Deprecated since version 1.32.0: Use
expr.implode().map_batches(..)instead.- is_elementwise
Set to true if the operations is elementwise for better performance and optimization.
An elementwise operations has unit or equal length for all inputs and can be ran sequentially on slices without results being affected.
- returns_scalar
If the function returns a scalar, by default it will be wrapped in a list in the output, since the assumption is that the function always returns something Series-like. If you want to keep the result as a scalar, set this argument to True.
See also
Notes
A UDF passed to
map_batchesmust be pure, meaning that it cannot modify or depend on state other than its arguments. Polars may call the function with arbitrary input data.Examples
>>> df = pl.DataFrame( ... { ... "sine": [0.0, 1.0, 0.0, -1.0], ... "cosine": [1.0, 0.0, -1.0, 0.0], ... } ... ) >>> df.select( ... pl.all().map_batches( ... lambda x: x.to_numpy().argmax(), ... returns_scalar=True, ... ) ... ) shape: (1, 2) ββββββββ¬βββββββββ β sine β cosine β β --- β --- β β i64 β i64 β ββββββββͺβββββββββ‘ β 1 β 0 β ββββββββ΄βββββββββ
Hereβs an example of a function that returns a scalar, where we want it to stay as a scalar:
>>> df = pl.DataFrame( ... { ... "a": [0, 1, 0, 1], ... "b": [1, 2, 3, 4], ... } ... ) >>> df.group_by("a").agg( ... pl.col("b").map_batches( ... lambda x: x.max(), returns_scalar=True, return_dtype=pl.self_dtype() ... ) ... ) shape: (2, 2) βββββββ¬ββββββ β a β b β β --- β --- β β i64 β i64 β βββββββͺββββββ‘ β 1 β 4 β β 0 β 3 β βββββββ΄ββββββ
Call a function that takes multiple arguments by creating a
structand referencing its fields inside the function call.>>> df = pl.DataFrame( ... { ... "a": [5, 1, 0, 3], ... "b": [4, 2, 3, 4], ... } ... ) >>> df.with_columns( ... a_times_b=pl.struct("a", "b").map_batches( ... lambda x: np.multiply(x.struct.field("a"), x.struct.field("b")), ... return_dtype=pl.Int64, ... ) ... ) shape: (4, 3) βββββββ¬ββββββ¬ββββββββββββ β a β b β a_times_b β β --- β --- β --- β β i64 β i64 β i64 β βββββββͺββββββͺββββββββββββ‘ β 5 β 4 β 20 β β 1 β 2 β 2 β β 0 β 3 β 0 β β 3 β 4 β 12 β βββββββ΄ββββββ΄ββββββββββββ
- map_elements(
- function: Callable[[Any], Any],
- return_dtype: PolarsDataType | DataTypeExpr | None = None,
- *,
- skip_nulls: bool = True,
- pass_name: bool = False,
- strategy: MapElementsStrategy = 'thread_local',
- returns_scalar: bool = False,
Map a custom/user-defined function (UDF) to each element of a column.
Warning
This method is much slower than the native expressions API. Only use it if you cannot implement your logic otherwise.
Suppose that the function is:
x β¦ sqrt(x):For mapping elements of a series, consider:
pl.col("col_name").sqrt().For mapping inner elements of lists, consider:
pl.col("col_name").list.eval(pl.element().sqrt()).For mapping elements of struct fields, consider:
pl.col("col_name").struct.field("field_name").sqrt().
If you want to replace the original column or field, consider
.with_columnsand.with_fields.- Parameters:
- function
Lambda/function to map.
- return_dtype
Datatype of the output Series.
It is recommended to set this whenever possible. If this is
None, it tries to infer the datatype by calling the function with dummy data and looking at the output.- skip_nulls
Donβt map the function over values that contain nulls (this is faster).
- pass_name
Pass the Series name to the custom function (this is more expensive).
- returns_scalar
Deprecated since version 1.32.0: Is ignored and will be removed in 2.0.
- strategy{βthread_localβ, βthreadingβ}
The threading strategy to use.
βthread_localβ: run the python function on a single thread.
βthreadingβ: run the python function on separate threads. Use with care as this can slow performance. This might only speed up your code if the amount of work per element is significant and the python function releases the GIL (e.g. via calling a c function)
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
Notes
Using
map_elementsis strongly discouraged as you will be effectively running python βforβ loops, which will be very slow. Wherever possible you should prefer the native expression API to achieve the best performance.If your function is expensive and you donβt want it to be called more than once for a given input, consider applying an
@lru_cachedecorator to it. If your data is suitable you may achieve significant speedups.Window function application using
overis considered a GroupBy context here, somap_elementscan be used to map functions over window groups.A UDF passed to
map_elementsmust be pure, meaning that it cannot modify or depend on state other than its arguments. Polars may call the function with arbitrary input data.
Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 2, 3, 1], ... "b": ["a", "b", "c", "c"], ... } ... )
The function is applied to each element of column
'a':>>> df.with_columns( ... pl.col("a") ... .map_elements(lambda x: x * 2, return_dtype=pl.self_dtype()) ... .alias("a_times_2"), ... ) shape: (4, 3) βββββββ¬ββββββ¬ββββββββββββ β a β b β a_times_2 β β --- β --- β --- β β i64 β str β i64 β βββββββͺββββββͺββββββββββββ‘ β 1 β a β 2 β β 2 β b β 4 β β 3 β c β 6 β β 1 β c β 2 β βββββββ΄ββββββ΄ββββββββββββ
Tip: it is better to implement this with an expression:
>>> df.with_columns( ... (pl.col("a") * 2).alias("a_times_2"), ... )
>>> ( ... df.lazy() ... .group_by("b") ... .agg( ... pl.col("a") ... .implode() ... .map_elements(lambda x: x.sum(), return_dtype=pl.Int64) ... ) ... .collect() ... ) shape: (3, 2) βββββββ¬ββββββ β b β a β β --- β --- β β str β i64 β βββββββͺββββββ‘ β a β 1 β β b β 2 β β c β 4 β βββββββ΄ββββββ
Tip: again, it is better to implement this with an expression:
>>> ( ... df.lazy() ... .group_by("b", maintain_order=True) ... .agg(pl.col("a").sum()) ... .collect() ... )
Window function application using
overwill behave as a GroupBy context, with your function receiving individual window groups:>>> df = pl.DataFrame( ... { ... "key": ["x", "x", "y", "x", "y", "z"], ... "val": [1, 1, 1, 1, 1, 1], ... } ... ) >>> df.with_columns( ... scaled=pl.col("val") ... .implode() ... .map_elements(lambda s: s * len(s), return_dtype=pl.List(pl.Int64)) ... .explode() ... .over("key"), ... ).sort("key") shape: (6, 3) βββββββ¬ββββββ¬βββββββββ β key β val β scaled β β --- β --- β --- β β str β i64 β i64 β βββββββͺββββββͺβββββββββ‘ β x β 1 β 3 β β x β 1 β 3 β β x β 1 β 3 β β y β 1 β 2 β β y β 1 β 2 β β z β 1 β 1 β βββββββ΄ββββββ΄βββββββββ
Note that this function would also be better-implemented natively:
>>> df.with_columns( ... scaled=(pl.col("val") * pl.col("val").count()).over("key"), ... ).sort("key")
- max() Expr[source]
Get maximum value.
Examples
>>> df = pl.DataFrame({"a": [-1.0, float("nan"), 1.0]}) >>> df.select(pl.col("a").max()) shape: (1, 1) βββββββ β a β β --- β β f64 β βββββββ‘ β 1.0 β βββββββ
- max_by(by: IntoExpr) Expr[source]
Get maximum value, ordered by another expression.
If the by expression has multiple values equal to the maximum it is not defined which value will be chosen.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- Parameters:
- by
Column used to determine the largest element. Accepts expression input. Strings are parsed as column names.
Examples
>>> df = pl.DataFrame({"a": [-1.0, float("nan"), 1.0], "b": ["x", "y", "z"]}) >>> df.select(pl.col("b").max_by("a")) shape: (1, 1) βββββββ β b β β --- β β str β βββββββ‘ β z β βββββββ
- mean() Expr[source]
Get mean value.
Examples
>>> df = pl.DataFrame({"a": [-1, 0, 1]}) >>> df.select(pl.col("a").mean()) shape: (1, 1) βββββββ β a β β --- β β f64 β βββββββ‘ β 0.0 β βββββββ
- median() Expr[source]
Get median value using linear interpolation.
Examples
>>> df = pl.DataFrame({"a": [-1, 0, 1]}) >>> df.select(pl.col("a").median()) shape: (1, 1) βββββββ β a β β --- β β f64 β βββββββ‘ β 0.0 β βββββββ
- min() Expr[source]
Get minimum value.
Examples
>>> df = pl.DataFrame({"a": [-1.0, float("nan"), 1.0]}) >>> df.select(pl.col("a").min()) shape: (1, 1) ββββββββ β a β β --- β β f64 β ββββββββ‘ β -1.0 β ββββββββ
- min_by(by: IntoExpr) Expr[source]
Get minimum value, ordered by another expression.
If the by expression has multiple values equal to the minimum it is not defined which value will be chosen.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- Parameters:
- by
Column used to determine the smallest element. Accepts expression input. Strings are parsed as column names.
Examples
>>> df = pl.DataFrame({"a": [-1.0, float("nan"), 1.0], "b": ["x", "y", "z"]}) >>> df.select(pl.col("b").min_by("a")) shape: (1, 1) βββββββ β b β β --- β β str β βββββββ‘ β x β βββββββ
- mod(other: Any) Expr[source]
Method equivalent of modulus operator
expr % other.- Parameters:
- other
Numeric literal or expression value.
Examples
>>> df = pl.DataFrame({"x": [0, 1, 2, 3, 4]}) >>> df.with_columns(pl.col("x").mod(2).alias("x%2")) shape: (5, 2) βββββββ¬ββββββ β x β x%2 β β --- β --- β β i64 β i64 β βββββββͺββββββ‘ β 0 β 0 β β 1 β 1 β β 2 β 0 β β 3 β 1 β β 4 β 0 β βββββββ΄ββββββ
- mode(*, maintain_order: bool = False) Expr[source]
Compute the most occurring value(s).
Can return multiple Values.
- Parameters:
- maintain_order
Maintain order of data. This requires more work.
Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 1, 2, 3], ... "b": [1, 1, 2, 2], ... } ... ) >>> df.select(pl.all().mode().first()) shape: (2, 2) βββββββ¬ββββββ β a β b β β --- β --- β β i64 β i64 β βββββββͺββββββ‘ β 1 β 1 β βββββββ΄ββββββ
- mul(other: Any) Expr[source]
Method equivalent of multiplication operator
expr * other.- Parameters:
- other
Numeric literal or expression value.
Examples
>>> df = pl.DataFrame({"x": [1, 2, 4, 8, 16]}) >>> df.with_columns( ... pl.col("x").mul(2).alias("x*2"), ... pl.col("x").mul(pl.col("x").log(2)).alias("x * xlog2"), ... ) shape: (5, 3) βββββββ¬ββββββ¬ββββββββββββ β x β x*2 β x * xlog2 β β --- β --- β --- β β i64 β i64 β f64 β βββββββͺββββββͺββββββββββββ‘ β 1 β 2 β 0.0 β β 2 β 4 β 2.0 β β 4 β 8 β 8.0 β β 8 β 16 β 24.0 β β 16 β 32 β 64.0 β βββββββ΄ββββββ΄ββββββββββββ
- n_unique() Expr[source]
Count unique values.
Notes
nullis considered to be a unique value for the purposes of this operation.Examples
>>> df = pl.DataFrame({"x": [1, 1, 2, 2, 3], "y": [1, 1, 1, None, None]}) >>> df.select( ... x_unique=pl.col("x").n_unique(), ... y_unique=pl.col("y").n_unique(), ... ) shape: (1, 2) ββββββββββββ¬βββββββββββ β x_unique β y_unique β β --- β --- β β u32 β u32 β ββββββββββββͺβββββββββββ‘ β 3 β 2 β ββββββββββββ΄βββββββββββ
- nan_max() Expr[source]
Get maximum value, but propagate/poison encountered NaN values.
This differs from numpyβs
nanmaxas numpy defaults to propagating NaN values, whereas polars defaults to ignoring them.Examples
>>> df = pl.DataFrame({"a": [0.0, float("nan")]}) >>> df.select(pl.col("a").nan_max()) shape: (1, 1) βββββββ β a β β --- β β f64 β βββββββ‘ β NaN β βββββββ
- nan_min() Expr[source]
Get minimum value, but propagate/poison encountered NaN values.
This differs from numpyβs
nanmaxas numpy defaults to propagating NaN values, whereas polars defaults to ignoring them.Examples
>>> df = pl.DataFrame({"a": [0.0, float("nan")]}) >>> df.select(pl.col("a").nan_min()) shape: (1, 1) βββββββ β a β β --- β β f64 β βββββββ‘ β NaN β βββββββ
- ne(other: Any) Expr[source]
Method equivalent of inequality operator
expr != other.- Parameters:
- other
A literal or expression value to compare with.
Examples
>>> df = pl.DataFrame( ... data={ ... "x": [1.0, 2.0, float("nan"), 4.0], ... "y": [2.0, 2.0, float("nan"), 4.0], ... } ... ) >>> df.with_columns( ... pl.col("x").ne(pl.col("y")).alias("x != y"), ... ) shape: (4, 3) βββββββ¬ββββββ¬βββββββββ β x β y β x != y β β --- β --- β --- β β f64 β f64 β bool β βββββββͺββββββͺβββββββββ‘ β 1.0 β 2.0 β true β β 2.0 β 2.0 β false β β NaN β NaN β false β β 4.0 β 4.0 β false β βββββββ΄ββββββ΄βββββββββ
- ne_missing(other: Any) Expr[source]
Method equivalent of equality operator
expr != otherwhereNone == None.This differs from default
newhere null values are propagated.- Parameters:
- other
A literal or expression value to compare with.
Examples
>>> df = pl.DataFrame( ... data={ ... "x": [1.0, 2.0, float("nan"), 4.0, None, None], ... "y": [2.0, 2.0, float("nan"), 4.0, 5.0, None], ... } ... ) >>> df.with_columns( ... pl.col("x").ne(pl.col("y")).alias("x ne y"), ... pl.col("x").ne_missing(pl.col("y")).alias("x ne_missing y"), ... ) shape: (6, 4) ββββββββ¬βββββββ¬βββββββββ¬βββββββββββββββββ β x β y β x ne y β x ne_missing y β β --- β --- β --- β --- β β f64 β f64 β bool β bool β ββββββββͺβββββββͺβββββββββͺβββββββββββββββββ‘ β 1.0 β 2.0 β true β true β β 2.0 β 2.0 β false β false β β NaN β NaN β false β false β β 4.0 β 4.0 β false β false β β null β 5.0 β null β true β β null β null β null β false β ββββββββ΄βββββββ΄βββββββββ΄βββββββββββββββββ
- neg() Expr[source]
Method equivalent of unary minus operator
-expr.Examples
>>> df = pl.DataFrame({"a": [-1, 0, 2, None]}) >>> df.with_columns(pl.col("a").neg()) shape: (4, 1) ββββββββ β a β β --- β β i64 β ββββββββ‘ β 1 β β 0 β β -2 β β null β ββββββββ
- not_() Expr[source]
Method equivalent of bitwise βnotβ operator
~expr.This has the effect of negating logical boolean expressions, but operates bitwise on integers.
Examples
>>> df = pl.DataFrame( ... { ... "label": ["aa", "bb", "cc", "dd", "ee"], ... "valid": [True, False, None, False, True], ... "int_code": [1, 0, 2, None, -1], ... } ... )
Apply βnotβ to boolean expression (negates the value) and integer expression (operates bitwise):
>>> df.with_columns( ... not_valid=pl.col("valid").not_(), ... not_int_code=pl.col("int_code").not_(), ... ) shape: (5, 5) βββββββββ¬ββββββββ¬βββββββββββ¬ββββββββββββ¬βββββββββββββββ β label β valid β int_code β not_valid β not_int_code β β --- β --- β --- β --- β --- β β str β bool β i64 β bool β i64 β βββββββββͺββββββββͺβββββββββββͺββββββββββββͺβββββββββββββββ‘ β aa β true β 1 β false β -2 β β bb β false β 0 β true β -1 β β cc β null β 2 β null β -3 β β dd β false β null β true β null β β ee β true β -1 β false β 0 β βββββββββ΄ββββββββ΄βββββββββββ΄ββββββββββββ΄βββββββββββββββ
- null_count() Expr[source]
Count null values.
Examples
>>> df = pl.DataFrame( ... { ... "a": [None, 1, None], ... "b": [10, None, 300], ... "c": [350, 650, 850], ... } ... ) >>> df.select(pl.all().null_count()) shape: (1, 3) βββββββ¬ββββββ¬ββββββ β a β b β c β β --- β --- β --- β β u32 β u32 β u32 β βββββββͺββββββͺββββββ‘ β 2 β 1 β 0 β βββββββ΄ββββββ΄ββββββ
- or_(*others: Any) Expr[source]
Method equivalent of bitwise βorβ operator
expr | other | ....This has the effect of combining logical boolean expressions, but operates bitwise on integers.
- Parameters:
- *others
One or more integer or boolean expressions to evaluate/combine.
Examples
>>> df = pl.DataFrame( ... data={ ... "x": [5, 6, 7, 4, 8], ... "y": [1.5, 2.5, 1.0, 4.0, -5.75], ... "z": [-9, 2, -1, 4, 8], ... } ... )
Combine logical βorβ conditions:
>>> df.select( ... (pl.col("x") == pl.col("y")) ... .or_( ... pl.col("x") == pl.col("y"), ... pl.col("y") == pl.col("z"), ... pl.col("y").cast(int) == pl.col("z"), ... ) ... .alias("any") ... ) shape: (5, 1) βββββββββ β any β β --- β β bool β βββββββββ‘ β false β β true β β false β β true β β false β βββββββββ
Bitwise βorβ operation on integer columns:
>>> df.select("x", "z", x_or_z=pl.col("x").or_(pl.col("z"))) shape: (5, 3) βββββββ¬ββββββ¬βββββββββ β x β z β x_or_z β β --- β --- β --- β β i64 β i64 β i64 β βββββββͺββββββͺβββββββββ‘ β 5 β -9 β -9 β β 6 β 2 β 6 β β 7 β -1 β -1 β β 4 β 4 β 4 β β 8 β 8 β 8 β βββββββ΄ββββββ΄βββββββββ
- over(
- partition_by: IntoExpr | Iterable[IntoExpr] | None = None,
- *more_exprs: IntoExpr,
- order_by: IntoExpr | Iterable[IntoExpr] | None = None,
- descending: bool = False,
- nulls_last: bool = False,
- mapping_strategy: WindowMappingStrategy = 'group_to_rows',
Compute expressions over the given groups.
This expression is similar to performing a group by aggregation and joining the result back into the original DataFrame.
The outcome is similar to how window functions work in PostgreSQL.
- Parameters:
- partition_by
Column(s) to group by. Accepts expression input. Strings are parsed as column names.
- *more_exprs
Additional columns to group by, specified as positional arguments.
- order_by
Order rows within each partition group before evaluating the expression. Useful for order-sensitive operations such as
cum_sum()ordiff().- descending
In case βorder_byβ is given, indicate whether to order in ascending or descending order.
- nulls_last
In case βorder_byβ is given, indicate whether to order the nulls in last position.
- mapping_strategy: {βgroup_to_rowsβ, βjoinβ, βexplodeβ}
- group_to_rows
If the aggregation results in multiple values per group, map them back to their row position in the DataFrame. This can only be done if each group yields the same number of elements before aggregation as after. If the aggregation results in one scalar value per group, this value will be mapped to every row.
- join
If the aggregation may result in multiple values per group, join the values as βList<group_dtype>β to each row position. Warning: this can be memory intensive. If the aggregation always results in one scalar value per group, join this value as β<group_dtype>β to each row position.
- explode
If the aggregation may result in multiple values per group, map each value to a new row, similar to the results of
group_by+agg+explode. If the aggregation always results in one scalar value per group, map this value to one row position. Sorting of the given groups is required if the groups are not part of the window operation for the operation, otherwise the result would not make sense. This operation changes the number of rows.
Examples
Pass the name of a column to compute the expression over that column.
>>> df = pl.DataFrame( ... { ... "a": ["a", "a", "b", "b", "b"], ... "b": [1, 2, 3, 5, 3], ... "c": [5, 4, 3, 2, 1], ... } ... ) >>> df.with_columns(c_max=pl.col("c").max().over("a")) shape: (5, 4) βββββββ¬ββββββ¬ββββββ¬ββββββββ β a β b β c β c_max β β --- β --- β --- β --- β β str β i64 β i64 β i64 β βββββββͺββββββͺββββββͺββββββββ‘ β a β 1 β 5 β 5 β β a β 2 β 4 β 5 β β b β 3 β 3 β 3 β β b β 5 β 2 β 3 β β b β 3 β 1 β 3 β βββββββ΄ββββββ΄ββββββ΄ββββββββ
Expression input is also supported.
>>> df.with_columns(c_max=pl.col("c").max().over(pl.col("b") // 2)) shape: (5, 4) βββββββ¬ββββββ¬ββββββ¬ββββββββ β a β b β c β c_max β β --- β --- β --- β --- β β str β i64 β i64 β i64 β βββββββͺββββββͺββββββͺββββββββ‘ β a β 1 β 5 β 5 β β a β 2 β 4 β 4 β β b β 3 β 3 β 4 β β b β 5 β 2 β 2 β β b β 3 β 1 β 4 β βββββββ΄ββββββ΄ββββββ΄ββββββββ
Group by multiple columns by passing multiple column names or expressions.
>>> df.with_columns(c_min=pl.col("c").min().over("a", pl.col("b") % 2)) shape: (5, 4) βββββββ¬ββββββ¬ββββββ¬ββββββββ β a β b β c β c_min β β --- β --- β --- β --- β β str β i64 β i64 β i64 β βββββββͺββββββͺββββββͺββββββββ‘ β a β 1 β 5 β 5 β β a β 2 β 4 β 4 β β b β 3 β 3 β 1 β β b β 5 β 2 β 1 β β b β 3 β 1 β 1 β βββββββ΄ββββββ΄ββββββ΄ββββββββ
Mapping strategy
joinjoins the values by group.>>> df.with_columns( ... c_pairs=pl.col("c").head(2).over("a", mapping_strategy="join") ... ) shape: (5, 4) βββββββ¬ββββββ¬ββββββ¬ββββββββββββ β a β b β c β c_pairs β β --- β --- β --- β --- β β str β i64 β i64 β list[i64] β βββββββͺββββββͺββββββͺββββββββββββ‘ β a β 1 β 5 β [5, 4] β β a β 2 β 4 β [5, 4] β β b β 3 β 3 β [3, 2] β β b β 5 β 2 β [3, 2] β β b β 3 β 1 β [3, 2] β βββββββ΄ββββββ΄ββββββ΄ββββββββββββ
Mapping strategy
explodemaps the values to new rows, changing the shape.>>> df.select( ... c_first_2=pl.col("c").head(2).over("a", mapping_strategy="explode") ... ) shape: (4, 1) βββββββββββββ β c_first_2 β β --- β β i64 β βββββββββββββ‘ β 5 β β 4 β β 3 β β 2 β βββββββββββββ
You can use non-elementwise expressions with
overtoo. By default they are evaluated using row-order, but you can specify a different one usingorder_by.>>> from datetime import date >>> df = pl.DataFrame( ... { ... "store_id": ["a", "a", "b", "b"], ... "date": [ ... date(2024, 9, 18), ... date(2024, 9, 17), ... date(2024, 9, 18), ... date(2024, 9, 16), ... ], ... "sales": [7, 9, 8, 10], ... } ... ) >>> df.with_columns( ... cumulative_sales=pl.col("sales") ... .cum_sum() ... .over("store_id", order_by="date") ... ) shape: (4, 4) ββββββββββββ¬βββββββββββββ¬ββββββββ¬βββββββββββββββββββ β store_id β date β sales β cumulative_sales β β --- β --- β --- β --- β β str β date β i64 β i64 β ββββββββββββͺβββββββββββββͺββββββββͺβββββββββββββββββββ‘ β a β 2024-09-18 β 7 β 16 β β a β 2024-09-17 β 9 β 9 β β b β 2024-09-18 β 8 β 18 β β b β 2024-09-16 β 10 β 10 β ββββββββββββ΄βββββββββββββ΄ββββββββ΄βββββββββββββββββββ
If you donβt require that the group order be preserved, then the more performant option is to use
mapping_strategy='explode'- be careful however to only ever use this in aselectstatement, not awith_columnsone.>>> window = { ... "partition_by": "store_id", ... "order_by": "date", ... "mapping_strategy": "explode", ... } >>> df.select( ... pl.all().over(**window), ... cumulative_sales=pl.col("sales").cum_sum().over(**window), ... ) shape: (4, 4) ββββββββββββ¬βββββββββββββ¬ββββββββ¬βββββββββββββββββββ β store_id β date β sales β cumulative_sales β β --- β --- β --- β --- β β str β date β i64 β i64 β ββββββββββββͺβββββββββββββͺββββββββͺβββββββββββββββββββ‘ β a β 2024-09-17 β 9 β 9 β β a β 2024-09-18 β 7 β 16 β β b β 2024-09-16 β 10 β 10 β β b β 2024-09-18 β 8 β 18 β ββββββββββββ΄βββββββββββββ΄ββββββββ΄βββββββββββββββββββ
- pct_change(n: int | IntoExprColumn = 1) Expr[source]
Computes percentage change between values.
Percentage change (as fraction) between current element and most-recent non-null element at least
nperiod(s) before the current element.Computes the change from the previous row by default.
- Parameters:
- n
periods to shift for forming percent change.
Notes
Null values are preserved. If youβre coming from pandas, this matches their
fill_method=Nonebehaviour.Examples
>>> df = pl.DataFrame( ... { ... "a": [10, 11, 12, None, 12], ... } ... ) >>> df.with_columns(pl.col("a").pct_change().alias("pct_change")) shape: (5, 2) ββββββββ¬βββββββββββββ β a β pct_change β β --- β --- β β i64 β f64 β ββββββββͺβββββββββββββ‘ β 10 β null β β 11 β 0.1 β β 12 β 0.090909 β β null β null β β 12 β null β ββββββββ΄βββββββββββββ
- peak_max() Expr[source]
Get a boolean mask of the local maximum peaks.
Examples
>>> df = pl.DataFrame({"a": [1, 2, 3, 4, 5]}) >>> df.select(pl.col("a").peak_max()) shape: (5, 1) βββββββββ β a β β --- β β bool β βββββββββ‘ β false β β false β β false β β false β β true β βββββββββ
- peak_min() Expr[source]
Get a boolean mask of the local minimum peaks.
Examples
>>> df = pl.DataFrame({"a": [4, 1, 3, 2, 5]}) >>> df.select(pl.col("a").peak_min()) shape: (5, 1) βββββββββ β a β β --- β β bool β βββββββββ‘ β false β β true β β false β β true β β false β βββββββββ
- pipe(
- function: Callable[Concatenate[Expr, P], T],
- *args: P.args,
- **kwargs: P.kwargs,
Offers a structured way to apply a sequence of user-defined functions (UDFs).
- Parameters:
- function
Callable; will receive the expression as the first parameter, followed by any given args/kwargs.
- *args
Arguments to pass to the UDF.
- **kwargs
Keyword arguments to pass to the UDF.
Examples
>>> def extract_number(expr: pl.Expr) -> pl.Expr: ... """Extract the digits from a string.""" ... return expr.str.extract(r"\d+", 0).cast(pl.Int64) >>> >>> def scale_negative_even(expr: pl.Expr, *, n: int = 1) -> pl.Expr: ... """Set even numbers negative, and scale by a user-supplied value.""" ... expr = pl.when(expr % 2 == 0).then(-expr).otherwise(expr) ... return expr * n >>> >>> df = pl.DataFrame({"val": ["a: 1", "b: 2", "c: 3", "d: 4"]}) >>> df.with_columns( ... udfs=( ... pl.col("val").pipe(extract_number).pipe(scale_negative_even, n=5) ... ), ... ) shape: (4, 2) ββββββββ¬βββββββ β val β udfs β β --- β --- β β str β i64 β ββββββββͺβββββββ‘ β a: 1 β 5 β β b: 2 β -10 β β c: 3 β 15 β β d: 4 β -20 β ββββββββ΄βββββββ
- pow(exponent: IntoExprColumn | int | float) Expr[source]
Method equivalent of exponentiation operator
expr ** exponent.If the exponent is float, the result follows the dtype of exponent. Otherwise, it follows dtype of base.
- Parameters:
- exponent
Numeric literal or expression exponent value.
Examples
>>> df = pl.DataFrame({"x": [1, 2, 4, 8]}) >>> df.with_columns( ... pl.col("x").pow(3).alias("cube"), ... pl.col("x").pow(pl.col("x").log(2)).alias("x ** xlog2"), ... ) shape: (4, 3) βββββββ¬βββββββ¬βββββββββββββ β x β cube β x ** xlog2 β β --- β --- β --- β β i64 β i64 β f64 β βββββββͺβββββββͺβββββββββββββ‘ β 1 β 1 β 1.0 β β 2 β 8 β 2.0 β β 4 β 64 β 16.0 β β 8 β 512 β 512.0 β βββββββ΄βββββββ΄βββββββββββββ
Raising an integer to a positive integer results in an integer - in order to raise to a negative integer, you can cast either the base or the exponent to float first:
>>> df.with_columns( ... x_squared=pl.col("x").pow(2), ... x_inverse=pl.col("x").pow(-1.0), ... ) shape: (4, 3) βββββββ¬ββββββββββββ¬ββββββββββββ β x β x_squared β x_inverse β β --- β --- β --- β β i64 β i64 β f64 β βββββββͺββββββββββββͺββββββββββββ‘ β 1 β 1 β 1.0 β β 2 β 4 β 0.5 β β 4 β 16 β 0.25 β β 8 β 64 β 0.125 β βββββββ΄ββββββββββββ΄ββββββββββββ
- product() Expr[source]
Compute the product of an expression.
Notes
If there are no non-null values, then the output is
1. If you would prefer empty products to returnNone, you can usepl.when(expr.count()>0).then(expr.product())instead ofexpr.product().Examples
>>> df = pl.DataFrame({"a": [1, 2, 3]}) >>> df.select(pl.col("a").product()) shape: (1, 1) βββββββ β a β β --- β β i64 β βββββββ‘ β 6 β βββββββ
- qcut(
- quantiles: Sequence[float] | int,
- *,
- labels: Sequence[str_] | None = None,
- left_closed: bool = False,
- allow_duplicates: bool = False,
- include_breaks: bool = False,
Bin continuous values into discrete categories based on their quantiles.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- Parameters:
- quantiles
Either a list of quantile probabilities between 0 and 1 or a positive integer determining the number of bins with uniform probability.
- labels
Names of the categories. The number of labels must be equal to the number of categories.
- left_closed
Set the intervals to be left-closed instead of right-closed.
- allow_duplicates
If set to
True, duplicates in the resulting quantiles are dropped, rather than raising aDuplicateError. This can happen even with unique probabilities, depending on the data.- include_breaks
Include a column with the right endpoint of the bin each observation falls in. This will change the data type of the output from a
Categoricalto aStruct.
- Returns:
- Expr
Expression of data type
Categoricalifinclude_breaksis set toFalse(default), otherwise an expression of data typeStruct.
See also
Examples
Divide a column into three categories according to pre-defined quantile probabilities.
>>> df = pl.DataFrame({"foo": [-2, -1, 0, 1, 2]}) >>> df.with_columns( ... pl.col("foo").qcut([0.25, 0.75], labels=["a", "b", "c"]).alias("qcut") ... ) shape: (5, 2) βββββββ¬βββββββ β foo β qcut β β --- β --- β β i64 β cat β βββββββͺβββββββ‘ β -2 β a β β -1 β a β β 0 β b β β 1 β b β β 2 β c β βββββββ΄βββββββ
Divide a column into two categories using uniform quantile probabilities.
>>> df.with_columns( ... pl.col("foo") ... .qcut(2, labels=["low", "high"], left_closed=True) ... .alias("qcut") ... ) shape: (5, 2) βββββββ¬βββββββ β foo β qcut β β --- β --- β β i64 β cat β βββββββͺβββββββ‘ β -2 β low β β -1 β low β β 0 β high β β 1 β high β β 2 β high β βββββββ΄βββββββ
Add both the category and the breakpoint.
>>> df.with_columns( ... pl.col("foo").qcut([0.25, 0.75], include_breaks=True).alias("qcut") ... ).unnest("qcut") shape: (5, 3) βββββββ¬βββββββββββββ¬βββββββββββββ β foo β breakpoint β category β β --- β --- β --- β β i64 β f64 β cat β βββββββͺβββββββββββββͺβββββββββββββ‘ β -2 β -1.0 β (-inf, -1] β β -1 β -1.0 β (-inf, -1] β β 0 β 1.0 β (-1, 1] β β 1 β 1.0 β (-1, 1] β β 2 β inf β (1, inf] β βββββββ΄βββββββββββββ΄βββββββββββββ
- quantile( ) Expr[source]
Get quantile value.
- Parameters:
- quantile
Quantile(s) between 0.0 and 1.0. Can be a single float or a list of floats.
If a single float, returns a single f64 value per row.
If a list of floats, returns a list of f64 values per row (one value per quantile).
- interpolation{βnearestβ, βhigherβ, βlowerβ, βmidpointβ, βlinearβ, βequiprobableβ}
Interpolation method.
Examples
>>> df = pl.DataFrame({"a": [0, 1, 2, 3, 4, 5]}) >>> df.select(pl.col("a").quantile(0.3)) shape: (1, 1) βββββββ β a β β --- β β f64 β βββββββ‘ β 2.0 β βββββββ >>> df.select(pl.col("a").quantile(0.3, interpolation="higher")) shape: (1, 1) βββββββ β a β β --- β β f64 β βββββββ‘ β 2.0 β βββββββ >>> df.select(pl.col("a").quantile(0.3, interpolation="lower")) shape: (1, 1) βββββββ β a β β --- β β f64 β βββββββ‘ β 1.0 β βββββββ >>> df.select(pl.col("a").quantile(0.3, interpolation="midpoint")) shape: (1, 1) βββββββ β a β β --- β β f64 β βββββββ‘ β 1.5 β βββββββ >>> df.select(pl.col("a").quantile(0.3, interpolation="linear")) shape: (1, 1) βββββββ β a β β --- β β f64 β βββββββ‘ β 1.5 β βββββββ >>> df.select(pl.col("a").quantile([0.25, 0.75], interpolation="linear")) shape: (1, 1) ββββββββββββββββ β a β β --- β β list[f64] β ββββββββββββββββ‘ β [1.25, 3.75] β ββββββββββββββββ
- radians() Expr[source]
Convert from degrees to radians.
- Returns:
- Expr
Expression of data type
Float64.
Examples
>>> df = pl.DataFrame({"a": [-720, -540, -360, -180, 0, 180, 360, 540, 720]}) >>> df.select(pl.col("a").radians()) shape: (9, 1) ββββββββββββββ β a β β --- β β f64 β ββββββββββββββ‘ β -12.566371 β β -9.424778 β β -6.283185 β β -3.141593 β β 0.0 β β 3.141593 β β 6.283185 β β 9.424778 β β 12.566371 β ββββββββββββββ
- rank( ) Expr[source]
Assign ranks to data, dealing with ties appropriately.
- Parameters:
- method{βaverageβ, βminβ, βmaxβ, βdenseβ, βordinalβ, βrandomβ}
The method used to assign ranks to tied elements. The following methods are available (default is βaverageβ):
βaverageβ : The average of the ranks that would have been assigned to all the tied values is assigned to each value.
βminβ : The minimum of the ranks that would have been assigned to all the tied values is assigned to each value. (This is also referred to as βcompetitionβ ranking.)
βmaxβ : The maximum of the ranks that would have been assigned to all the tied values is assigned to each value.
βdenseβ : Like βminβ, but the rank of the next highest element is assigned the rank immediately after those assigned to the tied elements.
βordinalβ : All values are given a distinct rank, corresponding to the order that the values occur in the Series.
βrandomβ : Like βordinalβ, but the rank for ties is not dependent on the order that the values occur in the Series.
- descending
Rank in descending order.
- seed
If
method="random", use this as seed.
Notes
If youβre coming from SQL, you may be expecting null values to be ranked last. Polars, however, only ranks non-null values and preserves the null ones.
Examples
The βaverageβ method:
>>> df = pl.DataFrame({"a": [3, 6, 1, 1, 6]}) >>> df.select(pl.col("a").rank()) shape: (5, 1) βββββββ β a β β --- β β f64 β βββββββ‘ β 3.0 β β 4.5 β β 1.5 β β 1.5 β β 4.5 β βββββββ
The βordinalβ method:
>>> df = pl.DataFrame({"a": [3, 6, 1, 1, 6]}) >>> df.select(pl.col("a").rank("ordinal")) shape: (5, 1) βββββββ β a β β --- β β u32 β βββββββ‘ β 3 β β 4 β β 1 β β 2 β β 5 β βββββββ
Use βrankβ with βoverβ to rank within groups:
>>> df = pl.DataFrame({"a": [1, 1, 2, 2, 2], "b": [6, 7, 5, 14, 11]}) >>> df.with_columns(pl.col("b").rank().over("a").alias("rank")) shape: (5, 3) βββββββ¬ββββββ¬βββββββ β a β b β rank β β --- β --- β --- β β i64 β i64 β f64 β βββββββͺββββββͺβββββββ‘ β 1 β 6 β 1.0 β β 1 β 7 β 2.0 β β 2 β 5 β 1.0 β β 2 β 14 β 3.0 β β 2 β 11 β 2.0 β βββββββ΄ββββββ΄βββββββ
Divide by the length or number of non-null values to compute the percentile rank.
>>> df = pl.DataFrame({"a": [6, 7, None, 14, 11]}) >>> df.with_columns( ... pct=pl.col("a").rank() / pl.len(), ... pct_valid=pl.col("a").rank() / pl.count("a"), ... ) shape: (5, 3) ββββββββ¬βββββββ¬ββββββββββββ β a β pct β pct_valid β β --- β --- β --- β β i64 β f64 β f64 β ββββββββͺβββββββͺββββββββββββ‘ β 6 β 0.2 β 0.25 β β 7 β 0.4 β 0.5 β β null β null β null β β 14 β 0.8 β 1.0 β β 11 β 0.6 β 0.75 β ββββββββ΄βββββββ΄ββββββββββββ
- rechunk() Expr[source]
Create a single chunk of memory for this Series.
Examples
>>> df = pl.DataFrame({"a": [1, 1, 2]})
Create a Series with 3 nulls, append column
a, then rechunk.>>> df.select(pl.repeat(None, 3).append(pl.col("a")).rechunk()) shape: (6, 1) ββββββββββ β repeat β β --- β β i64 β ββββββββββ‘ β null β β null β β null β β 1 β β 1 β β 2 β ββββββββββ
- register_plugin(
- *,
- lib: str_,
- symbol: str_,
- args: list_[IntoExpr] | None = None,
- kwargs: dict[Any, Any] | None = None,
- is_elementwise: bool = False,
- input_wildcard_expansion: bool = False,
- returns_scalar: bool = False,
- cast_to_supertypes: bool = False,
- pass_name_to_apply: bool = False,
- changes_length: bool = False,
Register a plugin function.
Deprecated since version 0.20.16: Use
polars.plugins.register_plugin_function()instead.See the user guide for more information about plugins.
- Parameters:
- lib
Library to load.
- symbol
Function to load.
- args
Arguments (other than self) passed to this function. These arguments have to be of type Expression.
- kwargs
Non-expression arguments. They must be JSON serializable.
- is_elementwise
If the function only operates on scalars this will trigger fast paths.
- input_wildcard_expansion
Expand expressions as input of this function.
- returns_scalar
Automatically explode on unit length if it ran as final aggregation. this is the case for aggregations like
sum,min,covarianceetc.- cast_to_supertypes
Cast the input datatypes to their supertype.
- pass_name_to_apply
if set, then the
Seriespassed to the function in the group_by operation will ensure the name is set. This is an extra heap allocation per group.- changes_length
For example a
uniqueor aslice
Warning
This method is deprecated. Use the new
polars.plugins.register_plugin_functionfunction instead.This is highly unsafe as this will call the C function loaded by
lib::symbol.The parameters you set dictate how Polars will handle the function. Make sure they are correct!
- reinterpret( ) Expr[source]
Reinterpret the underlying bits as a signed/unsigned integer or float.
This operation is only allowed for numeric types of the same size. For lower bits numbers, you can safely use the cast operation.
Either
signedordtypecan be specified. Defaults tosigned=Trueotherwise.- Parameters:
- signed
If True, reinterpret as signed integer. Otherwise, reinterpret as unsigned integer.
- dtype
DataType to reinterpret to.
Examples
>>> s = pl.Series("a", [1, 1, 2], dtype=pl.UInt64) >>> df = pl.DataFrame([s]) >>> df.select( ... [ ... pl.col("a").reinterpret(dtype=pl.Int64).alias("reinterpreted"), ... pl.col("a").alias("original"), ... ] ... ) shape: (3, 2) βββββββββββββββββ¬βββββββββββ β reinterpreted β original β β --- β --- β β i64 β u64 β βββββββββββββββββͺβββββββββββ‘ β 1 β 1 β β 1 β 1 β β 2 β 2 β βββββββββββββββββ΄βββββββββββ
- repeat_by(by: Series | Expr | str_ | int) Expr[source]
Repeat the elements in this Series as specified in the given expression.
The repeated elements are expanded into a
List.- Parameters:
- by
Numeric column that determines how often the values will be repeated. The column will be coerced to UInt32. Give this dtype to make the coercion a no-op.
- Returns:
- Expr
Expression of data type
List, where the inner data type is equal to the original data type.
Examples
>>> df = pl.DataFrame( ... { ... "a": ["x", "y", "z"], ... "n": [1, 2, 3], ... } ... ) >>> df.select(pl.col("a").repeat_by("n")) shape: (3, 1) βββββββββββββββββββ β a β β --- β β list[str] β βββββββββββββββββββ‘ β ["x"] β β ["y", "y"] β β ["z", "z", "z"] β βββββββββββββββββββ
- replace(old: IntoExpr | Sequence[Any] | Mapping[Any, Any], new: IntoExpr | Sequence[Any] | NoDefault = <no_default>, *, default: IntoExpr | NoDefault = <no_default>, return_dtype: PolarsDataType | None = None) Expr[source]
Replace the given values by different values of the same data type.
- Parameters:
- old
Value or sequence of values to replace. Accepts expression input. Sequences are parsed as Series, other non-expression inputs are parsed as literals. Also accepts a mapping of values to their replacement as syntactic sugar for
replace(old=Series(mapping.keys()), new=Series(mapping.values())).- new
Value or sequence of values to replace by. Accepts expression input. Sequences are parsed as Series, other non-expression inputs are parsed as literals. Length must match the length of
oldor have length 1.- default
Set values that were not replaced to this value. Defaults to keeping the original value. Accepts expression input. Non-expression inputs are parsed as literals.
Deprecated since version 1.0.0: Use
replace_strict()instead to set a default while replacing values.- return_dtype
The data type of the resulting expression. If set to
None(default), the data type of the original column is preserved.Deprecated since version 1.0.0: Use
replace_strict()instead to set a return data type while replacing values, or explicitly callcast()on the output.
See also
Examples
Replace a single value by another value. Values that were not replaced remain unchanged.
>>> df = pl.DataFrame({"a": [1, 2, 2, 3]}) >>> df.with_columns(replaced=pl.col("a").replace(2, 100)) shape: (4, 2) βββββββ¬βββββββββββ β a β replaced β β --- β --- β β i64 β i64 β βββββββͺβββββββββββ‘ β 1 β 1 β β 2 β 100 β β 2 β 100 β β 3 β 3 β βββββββ΄βββββββββββ
Replace multiple values by passing sequences to the
oldandnewparameters.>>> df.with_columns(replaced=pl.col("a").replace([2, 3], [100, 200])) shape: (4, 2) βββββββ¬βββββββββββ β a β replaced β β --- β --- β β i64 β i64 β βββββββͺβββββββββββ‘ β 1 β 1 β β 2 β 100 β β 2 β 100 β β 3 β 200 β βββββββ΄βββββββββββ
Passing a mapping with replacements is also supported as syntactic sugar.
>>> mapping = {2: 100, 3: 200} >>> df.with_columns(replaced=pl.col("a").replace(mapping)) shape: (4, 2) βββββββ¬βββββββββββ β a β replaced β β --- β --- β β i64 β i64 β βββββββͺβββββββββββ‘ β 1 β 1 β β 2 β 100 β β 2 β 100 β β 3 β 200 β βββββββ΄βββββββββββ
The original data type is preserved when replacing by values of a different data type. Use
replace_strict()to replace and change the return data type.>>> df = pl.DataFrame({"a": ["x", "y", "z"]}) >>> mapping = {"x": 1, "y": 2, "z": 3} >>> df.with_columns(replaced=pl.col("a").replace(mapping)) shape: (3, 2) βββββββ¬βββββββββββ β a β replaced β β --- β --- β β str β str β βββββββͺβββββββββββ‘ β x β 1 β β y β 2 β β z β 3 β βββββββ΄βββββββββββ
Expression input is supported.
>>> df = pl.DataFrame({"a": [1, 2, 2, 3], "b": [1.5, 2.5, 5.0, 1.0]}) >>> df.with_columns( ... replaced=pl.col("a").replace( ... old=pl.col("a").max(), ... new=pl.col("b").sum(), ... ) ... ) shape: (4, 3) βββββββ¬ββββββ¬βββββββββββ β a β b β replaced β β --- β --- β --- β β i64 β f64 β i64 β βββββββͺββββββͺβββββββββββ‘ β 1 β 1.5 β 1 β β 2 β 2.5 β 2 β β 2 β 5.0 β 2 β β 3 β 1.0 β 10 β βββββββ΄ββββββ΄βββββββββββ
- replace_strict(old: IntoExpr | Sequence[Any] | Mapping[Any, Any], new: IntoExpr | Sequence[Any] | NoDefault = <no_default>, *, default: IntoExpr | NoDefault = <no_default>, return_dtype: PolarsDataType | DataTypeExpr | None = None) Expr[source]
Replace all values by different values.
- Parameters:
- old
Value or sequence of values to replace. Accepts expression input. Sequences are parsed as Series, other non-expression inputs are parsed as literals. Also accepts a mapping of values to their replacement as syntactic sugar for
replace_strict(old=Series(mapping.keys()), new=Series(mapping.values())).- new
Value or sequence of values to replace by. Accepts expression input. Sequences are parsed as Series, other non-expression inputs are parsed as literals. Length must match the length of
oldor have length 1.- default
Set values that were not replaced to this value. If no default is specified, (default), an error is raised if any values were not replaced. Accepts expression input. Non-expression inputs are parsed as literals.
- return_dtype
The data type of the resulting expression. If set to
None(default), the data type is determined automatically based on the other inputs.
- Raises:
- InvalidOperationError
If any non-null values in the original column were not replaced, and no
defaultwas specified.
See also
Examples
Replace values by passing sequences to the
oldandnewparameters.>>> df = pl.DataFrame({"a": [1, 2, 2, 3]}) >>> df.with_columns( ... replaced=pl.col("a").replace_strict([1, 2, 3], [100, 200, 300]) ... ) shape: (4, 2) βββββββ¬βββββββββββ β a β replaced β β --- β --- β β i64 β i64 β βββββββͺβββββββββββ‘ β 1 β 100 β β 2 β 200 β β 2 β 200 β β 3 β 300 β βββββββ΄βββββββββββ
Passing a mapping with replacements is also supported as syntactic sugar.
>>> mapping = {1: 100, 2: 200, 3: 300} >>> df.with_columns(replaced=pl.col("a").replace_strict(mapping)) shape: (4, 2) βββββββ¬βββββββββββ β a β replaced β β --- β --- β β i64 β i64 β βββββββͺβββββββββββ‘ β 1 β 100 β β 2 β 200 β β 2 β 200 β β 3 β 300 β βββββββ΄βββββββββββ
By default, an error is raised if any non-null values were not replaced. Specify a default to set all values that were not matched.
>>> mapping = {2: 200, 3: 300} >>> df.with_columns( ... replaced=pl.col("a").replace_strict(mapping) ... ) Traceback (most recent call last): ... polars.exceptions.InvalidOperationError: incomplete mapping specified for `replace_strict` >>> df.with_columns(replaced=pl.col("a").replace_strict(mapping, default=-1)) shape: (4, 2) βββββββ¬βββββββββββ β a β replaced β β --- β --- β β i64 β i64 β βββββββͺβββββββββββ‘ β 1 β -1 β β 2 β 200 β β 2 β 200 β β 3 β 300 β βββββββ΄βββββββββββ
Replacing by values of a different data type sets the return type based on a combination of the
newdata type and thedefaultdata type.>>> df = pl.DataFrame({"a": ["x", "y", "z"]}) >>> mapping = {"x": 1, "y": 2, "z": 3} >>> df.with_columns(replaced=pl.col("a").replace_strict(mapping)) shape: (3, 2) βββββββ¬βββββββββββ β a β replaced β β --- β --- β β str β i64 β βββββββͺβββββββββββ‘ β x β 1 β β y β 2 β β z β 3 β βββββββ΄βββββββββββ >>> df.with_columns(replaced=pl.col("a").replace_strict(mapping, default="x")) shape: (3, 2) βββββββ¬βββββββββββ β a β replaced β β --- β --- β β str β str β βββββββͺβββββββββββ‘ β x β 1 β β y β 2 β β z β 3 β βββββββ΄βββββββββββ
Set the
return_dtypeparameter to control the resulting data type directly.>>> df.with_columns( ... replaced=pl.col("a").replace_strict(mapping, return_dtype=pl.UInt8) ... ) shape: (3, 2) βββββββ¬βββββββββββ β a β replaced β β --- β --- β β str β u8 β βββββββͺβββββββββββ‘ β x β 1 β β y β 2 β β z β 3 β βββββββ΄βββββββββββ
Expression input is supported for all parameters.
>>> df = pl.DataFrame({"a": [1, 2, 2, 3], "b": [1.5, 2.5, 5.0, 1.0]}) >>> df.with_columns( ... replaced=pl.col("a").replace_strict( ... old=pl.col("a").max(), ... new=pl.col("b").sum(), ... default=pl.col("b"), ... ) ... ) shape: (4, 3) βββββββ¬ββββββ¬βββββββββββ β a β b β replaced β β --- β --- β --- β β i64 β f64 β f64 β βββββββͺββββββͺβββββββββββ‘ β 1 β 1.5 β 1.5 β β 2 β 2.5 β 2.5 β β 2 β 5.0 β 5.0 β β 3 β 1.0 β 10.0 β βββββββ΄ββββββ΄βββββββββββ
- reshape(dimensions: tuple[int, ...]) Expr[source]
Reshape this Expr to a flat column or an Array column.
- Parameters:
- dimensions
Tuple of the dimension sizes. If -1 is used as the value for the first dimension, that dimension is inferred. Because the size of the Column may not be known in advance, it is only possible to use -1 for the first dimension.
- Returns:
- Expr
If a single dimension is given, results in an expression of the original data type. If a multiple dimensions are given, results in an expression of data type
Arraywith shapedimensions.
See also
Expr.list.explodeExplode a list column.
Examples
>>> df = pl.DataFrame({"foo": [1, 2, 3, 4, 5, 6, 7, 8, 9]}) >>> square = df.select(pl.col("foo").reshape((3, 3))) >>> square shape: (3, 1) βββββββββββββββββ β foo β β --- β β array[i64, 3] β βββββββββββββββββ‘ β [1, 2, 3] β β [4, 5, 6] β β [7, 8, 9] β βββββββββββββββββ >>> square.select(pl.col("foo").reshape((9,))) shape: (9, 1) βββββββ β foo β β --- β β i64 β βββββββ‘ β 1 β β 2 β β 3 β β 4 β β 5 β β 6 β β 7 β β 8 β β 9 β βββββββ
- reverse() Expr[source]
Reverse the selection.
Examples
>>> df = pl.DataFrame( ... { ... "A": [1, 2, 3, 4, 5], ... "fruits": ["banana", "banana", "apple", "apple", "banana"], ... "B": [5, 4, 3, 2, 1], ... "cars": ["beetle", "audi", "beetle", "beetle", "beetle"], ... } ... ) >>> df.select( ... [ ... pl.all(), ... pl.all().reverse().name.suffix("_reverse"), ... ] ... ) shape: (5, 8) βββββββ¬βββββββββ¬ββββββ¬βββββββββ¬ββββββββββββ¬βββββββββββββββββ¬ββββββββββββ¬βββββββββββββββ β A β fruits β B β cars β A_reverse β fruits_reverse β B_reverse β cars_reverse β β --- β --- β --- β --- β --- β --- β --- β --- β β i64 β str β i64 β str β i64 β str β i64 β str β βββββββͺβββββββββͺββββββͺβββββββββͺββββββββββββͺβββββββββββββββββͺββββββββββββͺβββββββββββββββ‘ β 1 β banana β 5 β beetle β 5 β banana β 1 β beetle β β 2 β banana β 4 β audi β 4 β apple β 2 β beetle β β 3 β apple β 3 β beetle β 3 β apple β 3 β beetle β β 4 β apple β 2 β beetle β 2 β banana β 4 β audi β β 5 β banana β 1 β beetle β 1 β banana β 5 β beetle β βββββββ΄βββββββββ΄ββββββ΄βββββββββ΄ββββββββββββ΄βββββββββββββββββ΄ββββββββββββ΄βββββββββββββββ
- rle() Expr[source]
Compress the column data using run-length encoding.
Run-length encoding (RLE) encodes data by storing each run of identical values as a single value and its length.
- Returns:
- Expr
Expression of data type
Structwith fieldslenof data typeUInt32andvalueof the original data type.
See also
Examples
>>> df = pl.DataFrame({"a": [1, 1, 2, 1, None, 1, 3, 3]}) >>> df.select(pl.col("a").rle()).unnest("a") shape: (6, 2) βββββββ¬ββββββββ β len β value β β --- β --- β β u32 β i64 β βββββββͺββββββββ‘ β 2 β 1 β β 1 β 2 β β 1 β 1 β β 1 β null β β 1 β 1 β β 2 β 3 β βββββββ΄ββββββββ
- rle_id() Expr[source]
Get a distinct integer ID for each run of identical values.
The ID starts at 0 and increases by one each time the value of the column changes.
- Returns:
- Expr
Expression of data type
UInt32.
See also
Notes
This functionality is especially useful for defining a new group for every time a columnβs value changes, rather than for every distinct value of that column.
Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 2, 1, 1, 1], ... "b": ["x", "x", None, "y", "y"], ... } ... ) >>> df.with_columns( ... rle_id_a=pl.col("a").rle_id(), ... rle_id_ab=pl.struct("a", "b").rle_id(), ... ) shape: (5, 4) βββββββ¬βββββββ¬βββββββββββ¬ββββββββββββ β a β b β rle_id_a β rle_id_ab β β --- β --- β --- β --- β β i64 β str β u32 β u32 β βββββββͺβββββββͺβββββββββββͺββββββββββββ‘ β 1 β x β 0 β 0 β β 2 β x β 1 β 1 β β 1 β null β 2 β 2 β β 1 β y β 2 β 3 β β 1 β y β 2 β 3 β βββββββ΄βββββββ΄βββββββββββ΄ββββββββββββ
- rolling(
- index_column: IntoExprColumn,
- *,
- period: str_ | timedelta,
- offset: str_ | timedelta | None = None,
- closed: ClosedInterval = 'right',
Create rolling groups based on a temporal or integer column.
If you have a time series
<t_0, t_1, ..., t_n>, then by default the windows created will be(t_0 - period, t_0]
(t_1 - period, t_1]
β¦
(t_n - period, t_n]
whereas if you pass a non-default
offset, then the windows will be(t_0 + offset, t_0 + offset + period]
(t_1 + offset, t_1 + offset + period]
β¦
(t_n + offset, t_n + offset + period]
The
periodandoffsetarguments are created either from a timedelta, or by using the following string language:1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
1i (1 index count)
Or combine them: β3d12h4m25sβ # 3 days, 12 hours, 4 minutes, and 25 seconds
By βcalendar dayβ, we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for βcalendar weekβ, βcalendar monthβ, βcalendar quarterβ, and βcalendar yearβ.
- Parameters:
- index_column
Column used to group based on the time window. Often of type Date/Datetime. This column must be sorted in ascending order. In case of a rolling group by on indices, dtype needs to be one of {UInt32, UInt64, Int32, Int64}. Note that the first three get temporarily cast to Int64, so if performance matters use an Int64 column.
- period
Length of the window - must be non-negative.
- offset
Offset of the window. Default is
-period.- closed{βrightβ, βleftβ, βbothβ, βnoneβ}
Define which sides of the temporal interval are closed (inclusive).
Examples
>>> dates = [ ... "2020-01-01 13:45:48", ... "2020-01-01 16:42:13", ... "2020-01-01 16:45:09", ... "2020-01-02 18:12:48", ... "2020-01-03 19:45:32", ... "2020-01-08 23:16:43", ... ] >>> df = pl.DataFrame({"dt": dates, "a": [3, 7, 5, 9, 2, 1]}).with_columns( ... pl.col("dt").str.strptime(pl.Datetime).set_sorted() ... ) >>> df.with_columns( ... sum_a=pl.sum("a").rolling(index_column="dt", period="2d"), ... min_a=pl.min("a").rolling(index_column="dt", period="2d"), ... max_a=pl.max("a").rolling(index_column="dt", period="2d"), ... ) shape: (6, 5) βββββββββββββββββββββββ¬ββββββ¬ββββββββ¬ββββββββ¬ββββββββ β dt β a β sum_a β min_a β max_a β β --- β --- β --- β --- β --- β β datetime[ΞΌs] β i64 β i64 β i64 β i64 β βββββββββββββββββββββββͺββββββͺββββββββͺββββββββͺββββββββ‘ β 2020-01-01 13:45:48 β 3 β 3 β 3 β 3 β β 2020-01-01 16:42:13 β 7 β 10 β 3 β 7 β β 2020-01-01 16:45:09 β 5 β 15 β 3 β 7 β β 2020-01-02 18:12:48 β 9 β 24 β 3 β 9 β β 2020-01-03 19:45:32 β 2 β 11 β 2 β 9 β β 2020-01-08 23:16:43 β 1 β 1 β 1 β 1 β βββββββββββββββββββββββ΄ββββββ΄ββββββββ΄ββββββββ΄ββββββββ
- rolling_kurtosis(
- window_size: int,
- *,
- fisher: bool = True,
- bias: bool = True,
- min_samples: int | None = None,
- center: bool = False,
Compute a rolling kurtosis.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
The window at a given row will include the row itself, and the
window_size - 1elements before it.- Parameters:
- window_size
Integer size of the rolling window.
- fisherbool, optional
If True, Fisherβs definition is used (normal ==> 0.0). If False, Pearsonβs definition is used (normal ==> 3.0).
- biasbool, optional
If False, the calculations are corrected for statistical bias.
- min_samples
The number of values in the window that should be non-null before computing a result. If set to
None(default), it will be set equal towindow_size.- center
Set the labels at the center of the window.
See also
Examples
>>> df = pl.DataFrame({"a": [1, 4, 2, 9]}) >>> df.select(pl.col("a").rolling_kurtosis(3)) shape: (4, 1) ββββββββ β a β β --- β β f64 β ββββββββ‘ β null β β null β β -1.5 β β -1.5 β ββββββββ
- rolling_map(
- function: Callable[[Series], Any],
- window_size: int,
- weights: list_[float] | None = None,
- *,
- min_samples: int | None = None,
- center: bool = False,
Compute a custom rolling window function.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
Changed in version 1.21.0: The
min_periodsparameter was renamedmin_samples.- Parameters:
- function
Custom aggregation function.
- window_size
The length of the window in number of elements.
- weights
An optional slice with the same length as the window that will be multiplied elementwise with the values in the window.
- min_samples
The number of values in the window that should be non-null before computing a result. If set to
None(default), it will be set equal towindow_size.- center
Set the labels at the center of the window.
Warning
Computing custom functions is extremely slow. Use specialized rolling functions such as
Expr.rolling_sum()if at all possible.Examples
>>> from numpy import nansum >>> df = pl.DataFrame({"a": [11.0, 2.0, 9.0, float("nan"), 8.0]}) >>> df.select(pl.col("a").rolling_map(nansum, window_size=3)) shape: (5, 1) ββββββββ β a β β --- β β f64 β ββββββββ‘ β null β β null β β 22.0 β β 11.0 β β 17.0 β ββββββββ
- rolling_max(
- window_size: int,
- weights: list_[float] | None = None,
- *,
- min_samples: int | None = None,
- center: bool = False,
Apply a rolling max (moving max) over the values in this array.
A window of length
window_sizewill traverse the array. The values that fill this window will (optionally) be multiplied with the weights given by theweightsvector. The resulting values will be aggregated to their max.The window at a given row will include the row itself, and the
window_size - 1elements before it.Changed in version 1.21.0: The
min_periodsparameter was renamedmin_samples.- Parameters:
- window_size
The length of the window in number of elements.
- weights
An optional slice with the same length as the window that will be multiplied elementwise with the values in the window.
- min_samples
The number of values in the window that should be non-null before computing a result. If set to
None(default), it will be set equal towindow_size.- center
Set the labels at the center of the window.
Notes
If you want to compute multiple aggregation statistics over the same dynamic window, consider using
rolling- this method can cache the window size computation.Examples
>>> df = pl.DataFrame({"A": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]}) >>> df.with_columns( ... rolling_max=pl.col("A").rolling_max(window_size=2), ... ) shape: (6, 2) βββββββ¬ββββββββββββββ β A β rolling_max β β --- β --- β β f64 β f64 β βββββββͺββββββββββββββ‘ β 1.0 β null β β 2.0 β 2.0 β β 3.0 β 3.0 β β 4.0 β 4.0 β β 5.0 β 5.0 β β 6.0 β 6.0 β βββββββ΄ββββββββββββββ
Specify weights to multiply the values in the window with:
>>> df.with_columns( ... rolling_max=pl.col("A").rolling_max( ... window_size=2, weights=[0.25, 0.75] ... ), ... ) shape: (6, 2) βββββββ¬ββββββββββββββ β A β rolling_max β β --- β --- β β f64 β f64 β βββββββͺββββββββββββββ‘ β 1.0 β null β β 2.0 β 1.5 β β 3.0 β 2.25 β β 4.0 β 3.0 β β 5.0 β 3.75 β β 6.0 β 4.5 β βββββββ΄ββββββββββββββ
Center the values in the window
>>> df.with_columns( ... rolling_max=pl.col("A").rolling_max(window_size=3, center=True), ... ) shape: (6, 2) βββββββ¬ββββββββββββββ β A β rolling_max β β --- β --- β β f64 β f64 β βββββββͺββββββββββββββ‘ β 1.0 β null β β 2.0 β 3.0 β β 3.0 β 4.0 β β 4.0 β 5.0 β β 5.0 β 6.0 β β 6.0 β null β βββββββ΄ββββββββββββββ
- rolling_max_by(
- by: IntoExpr,
- window_size: timedelta | str_,
- *,
- min_samples: int = 1,
- closed: ClosedInterval = 'right',
Apply a rolling max based on another column.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
Given a
bycolumn<t_0, t_1, ..., t_n>, thenclosed="right"(the default) means the windows will be:(t_0 - window_size, t_0]
(t_1 - window_size, t_1]
β¦
(t_n - window_size, t_n]
Changed in version 1.21.0: The
min_periodsparameter was renamedmin_samples.- Parameters:
- by
Should be
DateTime,Date,UInt64,UInt32,Int64, orInt32data type (note that the integral ones require using'i'inwindow size).- window_size
The length of the window. Can be a dynamic temporal size indicated by a timedelta or the following string language:
1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
1i (1 index count)
By βcalendar dayβ, we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings - in cases of ambiguity, we follow RFC-5545 and preserve the DST fold of the original datetime). Similarly for βcalendar weekβ, βcalendar monthβ, βcalendar quarterβ, and βcalendar yearβ.
- min_samples
The number of values in the window that should be non-null before computing a result.
- closed{βleftβ, βrightβ, βbothβ, βnoneβ}
Define which sides of the temporal interval are closed (inclusive), defaults to
'right'.
Notes
If you want to compute multiple aggregation statistics over the same dynamic window, consider using
rolling- this method can cache the window size computation.Examples
Create a DataFrame with a datetime column and a row number column
>>> from datetime import timedelta, datetime >>> start = datetime(2001, 1, 1) >>> stop = datetime(2001, 1, 2) >>> df_temporal = pl.DataFrame( ... {"date": pl.datetime_range(start, stop, "1h", eager=True)} ... ).with_row_index() >>> df_temporal shape: (25, 2) βββββββββ¬ββββββββββββββββββββββ β index β date β β --- β --- β β u32 β datetime[ΞΌs] β βββββββββͺββββββββββββββββββββββ‘ β 0 β 2001-01-01 00:00:00 β β 1 β 2001-01-01 01:00:00 β β 2 β 2001-01-01 02:00:00 β β 3 β 2001-01-01 03:00:00 β β 4 β 2001-01-01 04:00:00 β β β¦ β β¦ β β 20 β 2001-01-01 20:00:00 β β 21 β 2001-01-01 21:00:00 β β 22 β 2001-01-01 22:00:00 β β 23 β 2001-01-01 23:00:00 β β 24 β 2001-01-02 00:00:00 β βββββββββ΄ββββββββββββββββββββββ
Compute the rolling max with the temporal windows closed on the right (default)
>>> df_temporal.with_columns( ... rolling_row_max=pl.col("index").rolling_max_by("date", window_size="2h") ... ) shape: (25, 3) βββββββββ¬ββββββββββββββββββββββ¬ββββββββββββββββββ β index β date β rolling_row_max β β --- β --- β --- β β u32 β datetime[ΞΌs] β u32 β βββββββββͺββββββββββββββββββββββͺββββββββββββββββββ‘ β 0 β 2001-01-01 00:00:00 β 0 β β 1 β 2001-01-01 01:00:00 β 1 β β 2 β 2001-01-01 02:00:00 β 2 β β 3 β 2001-01-01 03:00:00 β 3 β β 4 β 2001-01-01 04:00:00 β 4 β β β¦ β β¦ β β¦ β β 20 β 2001-01-01 20:00:00 β 20 β β 21 β 2001-01-01 21:00:00 β 21 β β 22 β 2001-01-01 22:00:00 β 22 β β 23 β 2001-01-01 23:00:00 β 23 β β 24 β 2001-01-02 00:00:00 β 24 β βββββββββ΄ββββββββββββββββββββββ΄ββββββββββββββββββ
Compute the rolling max with the closure of windows on both sides
>>> df_temporal.with_columns( ... rolling_row_max=pl.col("index").rolling_max_by( ... "date", window_size="2h", closed="both" ... ) ... ) shape: (25, 3) βββββββββ¬ββββββββββββββββββββββ¬ββββββββββββββββββ β index β date β rolling_row_max β β --- β --- β --- β β u32 β datetime[ΞΌs] β u32 β βββββββββͺββββββββββββββββββββββͺββββββββββββββββββ‘ β 0 β 2001-01-01 00:00:00 β 0 β β 1 β 2001-01-01 01:00:00 β 1 β β 2 β 2001-01-01 02:00:00 β 2 β β 3 β 2001-01-01 03:00:00 β 3 β β 4 β 2001-01-01 04:00:00 β 4 β β β¦ β β¦ β β¦ β β 20 β 2001-01-01 20:00:00 β 20 β β 21 β 2001-01-01 21:00:00 β 21 β β 22 β 2001-01-01 22:00:00 β 22 β β 23 β 2001-01-01 23:00:00 β 23 β β 24 β 2001-01-02 00:00:00 β 24 β βββββββββ΄ββββββββββββββββββββββ΄ββββββββββββββββββ
- rolling_mean(
- window_size: int,
- weights: list_[float] | None = None,
- *,
- min_samples: int | None = None,
- center: bool = False,
Apply a rolling mean (moving mean) over the values in this array.
A window of length
window_sizewill traverse the array. The values that fill this window will (optionally) be multiplied with the weights given by theweightsvector. The resulting values will be aggregated to their mean. Weights are normalized to sum to 1.The window at a given row will include the row itself, and the
window_size - 1elements before it.Changed in version 1.21.0: The
min_periodsparameter was renamedmin_samples.- Parameters:
- window_size
The length of the window in number of elements.
- weights
An optional slice with the same length as the window that will be multiplied elementwise with the values in the window, after being normalized to sum to 1.
- min_samples
The number of values in the window that should be non-null before computing a result. If set to
None(default), it will be set equal towindow_size.- center
Set the labels at the center of the window.
Notes
If you want to compute multiple aggregation statistics over the same dynamic window, consider using
rolling- this method can cache the window size computation.Examples
>>> df = pl.DataFrame({"A": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]}) >>> df.with_columns( ... rolling_mean=pl.col("A").rolling_mean(window_size=2), ... ) shape: (6, 2) βββββββ¬βββββββββββββββ β A β rolling_mean β β --- β --- β β f64 β f64 β βββββββͺβββββββββββββββ‘ β 1.0 β null β β 2.0 β 1.5 β β 3.0 β 2.5 β β 4.0 β 3.5 β β 5.0 β 4.5 β β 6.0 β 5.5 β βββββββ΄βββββββββββββββ
Specify weights to multiply the values in the window with:
>>> df.with_columns( ... rolling_mean=pl.col("A").rolling_mean( ... window_size=2, weights=[0.25, 0.75] ... ), ... ) shape: (6, 2) βββββββ¬βββββββββββββββ β A β rolling_mean β β --- β --- β β f64 β f64 β βββββββͺβββββββββββββββ‘ β 1.0 β null β β 2.0 β 1.75 β β 3.0 β 2.75 β β 4.0 β 3.75 β β 5.0 β 4.75 β β 6.0 β 5.75 β βββββββ΄βββββββββββββββ
Center the values in the window
>>> df.with_columns( ... rolling_mean=pl.col("A").rolling_mean(window_size=3, center=True), ... ) shape: (6, 2) βββββββ¬βββββββββββββββ β A β rolling_mean β β --- β --- β β f64 β f64 β βββββββͺβββββββββββββββ‘ β 1.0 β null β β 2.0 β 2.0 β β 3.0 β 3.0 β β 4.0 β 4.0 β β 5.0 β 5.0 β β 6.0 β null β βββββββ΄βββββββββββββββ
- rolling_mean_by(
- by: IntoExpr,
- window_size: timedelta | str_,
- *,
- min_samples: int = 1,
- closed: ClosedInterval = 'right',
Apply a rolling mean based on another column.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
Given a
bycolumn<t_0, t_1, ..., t_n>, thenclosed="right"(the default) means the windows will be:(t_0 - window_size, t_0]
(t_1 - window_size, t_1]
β¦
(t_n - window_size, t_n]
Changed in version 1.21.0: The
min_periodsparameter was renamedmin_samples.- Parameters:
- by
Should be
DateTime,Date,UInt64,UInt32,Int64, orInt32data type (note that the integral ones require using'i'inwindow size).- window_size
The length of the window. Can be a dynamic temporal size indicated by a timedelta or the following string language:
1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
1i (1 index count)
By βcalendar dayβ, we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings - in cases of ambiguity, we follow RFC-5545 and preserve the DST fold of the original datetime). Similarly for βcalendar weekβ, βcalendar monthβ, βcalendar quarterβ, and βcalendar yearβ.
- min_samples
The number of values in the window that should be non-null before computing a result.
- closed{βleftβ, βrightβ, βbothβ, βnoneβ}
Define which sides of the temporal interval are closed (inclusive), defaults to
'right'.
Notes
If you want to compute multiple aggregation statistics over the same dynamic window, consider using
rolling- this method can cache the window size computation.Examples
Create a DataFrame with a datetime column and a row number column
>>> from datetime import timedelta, datetime >>> start = datetime(2001, 1, 1) >>> stop = datetime(2001, 1, 2) >>> df_temporal = pl.DataFrame( ... {"date": pl.datetime_range(start, stop, "1h", eager=True)} ... ).with_row_index() >>> df_temporal shape: (25, 2) βββββββββ¬ββββββββββββββββββββββ β index β date β β --- β --- β β u32 β datetime[ΞΌs] β βββββββββͺββββββββββββββββββββββ‘ β 0 β 2001-01-01 00:00:00 β β 1 β 2001-01-01 01:00:00 β β 2 β 2001-01-01 02:00:00 β β 3 β 2001-01-01 03:00:00 β β 4 β 2001-01-01 04:00:00 β β β¦ β β¦ β β 20 β 2001-01-01 20:00:00 β β 21 β 2001-01-01 21:00:00 β β 22 β 2001-01-01 22:00:00 β β 23 β 2001-01-01 23:00:00 β β 24 β 2001-01-02 00:00:00 β βββββββββ΄ββββββββββββββββββββββ
Compute the rolling mean with the temporal windows closed on the right (default)
>>> df_temporal.with_columns( ... rolling_row_mean=pl.col("index").rolling_mean_by( ... "date", window_size="2h" ... ) ... ) shape: (25, 3) βββββββββ¬ββββββββββββββββββββββ¬βββββββββββββββββββ β index β date β rolling_row_mean β β --- β --- β --- β β u32 β datetime[ΞΌs] β f64 β βββββββββͺββββββββββββββββββββββͺβββββββββββββββββββ‘ β 0 β 2001-01-01 00:00:00 β 0.0 β β 1 β 2001-01-01 01:00:00 β 0.5 β β 2 β 2001-01-01 02:00:00 β 1.5 β β 3 β 2001-01-01 03:00:00 β 2.5 β β 4 β 2001-01-01 04:00:00 β 3.5 β β β¦ β β¦ β β¦ β β 20 β 2001-01-01 20:00:00 β 19.5 β β 21 β 2001-01-01 21:00:00 β 20.5 β β 22 β 2001-01-01 22:00:00 β 21.5 β β 23 β 2001-01-01 23:00:00 β 22.5 β β 24 β 2001-01-02 00:00:00 β 23.5 β βββββββββ΄ββββββββββββββββββββββ΄βββββββββββββββββββ
Compute the rolling mean with the closure of windows on both sides
>>> df_temporal.with_columns( ... rolling_row_mean=pl.col("index").rolling_mean_by( ... "date", window_size="2h", closed="both" ... ) ... ) shape: (25, 3) βββββββββ¬ββββββββββββββββββββββ¬βββββββββββββββββββ β index β date β rolling_row_mean β β --- β --- β --- β β u32 β datetime[ΞΌs] β f64 β βββββββββͺββββββββββββββββββββββͺβββββββββββββββββββ‘ β 0 β 2001-01-01 00:00:00 β 0.0 β β 1 β 2001-01-01 01:00:00 β 0.5 β β 2 β 2001-01-01 02:00:00 β 1.0 β β 3 β 2001-01-01 03:00:00 β 2.0 β β 4 β 2001-01-01 04:00:00 β 3.0 β β β¦ β β¦ β β¦ β β 20 β 2001-01-01 20:00:00 β 19.0 β β 21 β 2001-01-01 21:00:00 β 20.0 β β 22 β 2001-01-01 22:00:00 β 21.0 β β 23 β 2001-01-01 23:00:00 β 22.0 β β 24 β 2001-01-02 00:00:00 β 23.0 β βββββββββ΄ββββββββββββββββββββββ΄βββββββββββββββββββ
- rolling_median(
- window_size: int,
- weights: list_[float] | None = None,
- *,
- min_samples: int | None = None,
- center: bool = False,
Compute a rolling median.
A window of length
window_sizewill traverse the array. The values that fill this window will (optionally) be multiplied with the weights given by theweightsvector. The resulting values will be aggregated to their median.The window at a given row will include the row itself, and the
window_size - 1elements before it.Changed in version 1.21.0: The
min_periodsparameter was renamedmin_samples.- Parameters:
- window_size
The length of the window in number of elements.
- weights
An optional slice with the same length as the window that will be multiplied elementwise with the values in the window.
- min_samples
The number of values in the window that should be non-null before computing a result. If set to
None(default), it will be set equal towindow_size.- center
Set the labels at the center of the window.
Notes
If you want to compute multiple aggregation statistics over the same dynamic window, consider using
rolling- this method can cache the window size computation.Examples
>>> df = pl.DataFrame({"A": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]}) >>> df.with_columns( ... rolling_median=pl.col("A").rolling_median(window_size=2), ... ) shape: (6, 2) βββββββ¬βββββββββββββββββ β A β rolling_median β β --- β --- β β f64 β f64 β βββββββͺβββββββββββββββββ‘ β 1.0 β null β β 2.0 β 1.5 β β 3.0 β 2.5 β β 4.0 β 3.5 β β 5.0 β 4.5 β β 6.0 β 5.5 β βββββββ΄βββββββββββββββββ
Specify weights for the values in each window:
>>> df.with_columns( ... rolling_median=pl.col("A").rolling_median( ... window_size=2, weights=[0.25, 0.75] ... ), ... ) shape: (6, 2) βββββββ¬βββββββββββββββββ β A β rolling_median β β --- β --- β β f64 β f64 β βββββββͺβββββββββββββββββ‘ β 1.0 β null β β 2.0 β 1.5 β β 3.0 β 2.5 β β 4.0 β 3.5 β β 5.0 β 4.5 β β 6.0 β 5.5 β βββββββ΄βββββββββββββββββ
Center the values in the window
>>> df.with_columns( ... rolling_median=pl.col("A").rolling_median(window_size=3, center=True), ... ) shape: (6, 2) βββββββ¬βββββββββββββββββ β A β rolling_median β β --- β --- β β f64 β f64 β βββββββͺβββββββββββββββββ‘ β 1.0 β null β β 2.0 β 2.0 β β 3.0 β 3.0 β β 4.0 β 4.0 β β 5.0 β 5.0 β β 6.0 β null β βββββββ΄βββββββββββββββββ
- rolling_median_by(
- by: IntoExpr,
- window_size: timedelta | str_,
- *,
- min_samples: int = 1,
- closed: ClosedInterval = 'right',
Compute a rolling median based on another column.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
Given a
bycolumn<t_0, t_1, ..., t_n>, thenclosed="right"(the default) means the windows will be:(t_0 - window_size, t_0]
(t_1 - window_size, t_1]
β¦
(t_n - window_size, t_n]
Changed in version 1.21.0: The
min_periodsparameter was renamedmin_samples.- Parameters:
- by
Should be
DateTime,Date,UInt64,UInt32,Int64, orInt32data type (note that the integral ones require using'i'inwindow size).- window_size
The length of the window. Can be a dynamic temporal size indicated by a timedelta or the following string language:
1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
1i (1 index count)
By βcalendar dayβ, we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings - in cases of ambiguity, we follow RFC-5545 and preserve the DST fold of the original datetime). Similarly for βcalendar weekβ, βcalendar monthβ, βcalendar quarterβ, and βcalendar yearβ.
- min_samples
The number of values in the window that should be non-null before computing a result.
- closed{βleftβ, βrightβ, βbothβ, βnoneβ}
Define which sides of the temporal interval are closed (inclusive), defaults to
'right'.
Notes
If you want to compute multiple aggregation statistics over the same dynamic window, consider using
rolling- this method can cache the window size computation.Examples
Create a DataFrame with a datetime column and a row number column
>>> from datetime import timedelta, datetime >>> start = datetime(2001, 1, 1) >>> stop = datetime(2001, 1, 2) >>> df_temporal = pl.DataFrame( ... {"date": pl.datetime_range(start, stop, "1h", eager=True)} ... ).with_row_index() >>> df_temporal shape: (25, 2) βββββββββ¬ββββββββββββββββββββββ β index β date β β --- β --- β β u32 β datetime[ΞΌs] β βββββββββͺββββββββββββββββββββββ‘ β 0 β 2001-01-01 00:00:00 β β 1 β 2001-01-01 01:00:00 β β 2 β 2001-01-01 02:00:00 β β 3 β 2001-01-01 03:00:00 β β 4 β 2001-01-01 04:00:00 β β β¦ β β¦ β β 20 β 2001-01-01 20:00:00 β β 21 β 2001-01-01 21:00:00 β β 22 β 2001-01-01 22:00:00 β β 23 β 2001-01-01 23:00:00 β β 24 β 2001-01-02 00:00:00 β βββββββββ΄ββββββββββββββββββββββ
Compute the rolling median with the temporal windows closed on the right:
>>> df_temporal.with_columns( ... rolling_row_median=pl.col("index").rolling_median_by( ... "date", window_size="2h" ... ) ... ) shape: (25, 3) βββββββββ¬ββββββββββββββββββββββ¬βββββββββββββββββββββ β index β date β rolling_row_median β β --- β --- β --- β β u32 β datetime[ΞΌs] β f64 β βββββββββͺββββββββββββββββββββββͺβββββββββββββββββββββ‘ β 0 β 2001-01-01 00:00:00 β 0.0 β β 1 β 2001-01-01 01:00:00 β 0.5 β β 2 β 2001-01-01 02:00:00 β 1.5 β β 3 β 2001-01-01 03:00:00 β 2.5 β β 4 β 2001-01-01 04:00:00 β 3.5 β β β¦ β β¦ β β¦ β β 20 β 2001-01-01 20:00:00 β 19.5 β β 21 β 2001-01-01 21:00:00 β 20.5 β β 22 β 2001-01-01 22:00:00 β 21.5 β β 23 β 2001-01-01 23:00:00 β 22.5 β β 24 β 2001-01-02 00:00:00 β 23.5 β βββββββββ΄ββββββββββββββββββββββ΄βββββββββββββββββββββ
- rolling_min(
- window_size: int,
- weights: list_[float] | None = None,
- *,
- min_samples: int | None = None,
- center: bool = False,
Apply a rolling min (moving min) over the values in this array.
A window of length
window_sizewill traverse the array. The values that fill this window will (optionally) be multiplied with the weights given by theweightsvector. The resulting values will be aggregated to their min.The window at a given row will include the row itself, and the
window_size - 1elements before it.Changed in version 1.21.0: The
min_periodsparameter was renamedmin_samples.- Parameters:
- window_size
The length of the window in number of elements.
- weights
An optional slice with the same length as the window that will be multiplied elementwise with the values in the window.
- min_samples
The number of values in the window that should be non-null before computing a result. If set to
None(default), it will be set equal towindow_size.- center
Set the labels at the center of the window.
Notes
If you want to compute multiple aggregation statistics over the same dynamic window, consider using
rolling- this method can cache the window size computation.Examples
>>> df = pl.DataFrame({"A": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]}) >>> df.with_columns( ... rolling_min=pl.col("A").rolling_min(window_size=2), ... ) shape: (6, 2) βββββββ¬ββββββββββββββ β A β rolling_min β β --- β --- β β f64 β f64 β βββββββͺββββββββββββββ‘ β 1.0 β null β β 2.0 β 1.0 β β 3.0 β 2.0 β β 4.0 β 3.0 β β 5.0 β 4.0 β β 6.0 β 5.0 β βββββββ΄ββββββββββββββ
Specify weights to multiply the values in the window with:
>>> df.with_columns( ... rolling_min=pl.col("A").rolling_min( ... window_size=2, weights=[0.25, 0.75] ... ), ... ) shape: (6, 2) βββββββ¬ββββββββββββββ β A β rolling_min β β --- β --- β β f64 β f64 β βββββββͺββββββββββββββ‘ β 1.0 β null β β 2.0 β 0.25 β β 3.0 β 0.5 β β 4.0 β 0.75 β β 5.0 β 1.0 β β 6.0 β 1.25 β βββββββ΄ββββββββββββββ
Center the values in the window
>>> df.with_columns( ... rolling_min=pl.col("A").rolling_min(window_size=3, center=True), ... ) shape: (6, 2) βββββββ¬ββββββββββββββ β A β rolling_min β β --- β --- β β f64 β f64 β βββββββͺββββββββββββββ‘ β 1.0 β null β β 2.0 β 1.0 β β 3.0 β 2.0 β β 4.0 β 3.0 β β 5.0 β 4.0 β β 6.0 β null β βββββββ΄ββββββββββββββ
- rolling_min_by(
- by: IntoExpr,
- window_size: timedelta | str_,
- *,
- min_samples: int = 1,
- closed: ClosedInterval = 'right',
Apply a rolling min based on another column.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
Given a
bycolumn<t_0, t_1, ..., t_n>, thenclosed="right"(the default) means the windows will be:(t_0 - window_size, t_0]
(t_1 - window_size, t_1]
β¦
(t_n - window_size, t_n]
Changed in version 1.21.0: The
min_periodsparameter was renamedmin_samples.- Parameters:
- by
Should be
DateTime,Date,UInt64,UInt32,Int64, orInt32data type (note that the integral ones require using'i'inwindow size).- window_size
The length of the window. Can be a dynamic temporal size indicated by a timedelta or the following string language:
1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
1i (1 index count)
By βcalendar dayβ, we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings - in cases of ambiguity, we follow RFC-5545 and preserve the DST fold of the original datetime). Similarly for βcalendar weekβ, βcalendar monthβ, βcalendar quarterβ, and βcalendar yearβ.
- min_samples
The number of values in the window that should be non-null before computing a result.
- closed{βleftβ, βrightβ, βbothβ, βnoneβ}
Define which sides of the temporal interval are closed (inclusive), defaults to
'right'.
Notes
If you want to compute multiple aggregation statistics over the same dynamic window, consider using
rolling- this method can cache the window size computation.Examples
Create a DataFrame with a datetime column and a row number column
>>> from datetime import timedelta, datetime >>> start = datetime(2001, 1, 1) >>> stop = datetime(2001, 1, 2) >>> df_temporal = pl.DataFrame( ... {"date": pl.datetime_range(start, stop, "1h", eager=True)} ... ).with_row_index() >>> df_temporal shape: (25, 2) βββββββββ¬ββββββββββββββββββββββ β index β date β β --- β --- β β u32 β datetime[ΞΌs] β βββββββββͺββββββββββββββββββββββ‘ β 0 β 2001-01-01 00:00:00 β β 1 β 2001-01-01 01:00:00 β β 2 β 2001-01-01 02:00:00 β β 3 β 2001-01-01 03:00:00 β β 4 β 2001-01-01 04:00:00 β β β¦ β β¦ β β 20 β 2001-01-01 20:00:00 β β 21 β 2001-01-01 21:00:00 β β 22 β 2001-01-01 22:00:00 β β 23 β 2001-01-01 23:00:00 β β 24 β 2001-01-02 00:00:00 β βββββββββ΄ββββββββββββββββββββββ
Compute the rolling min with the temporal windows closed on the right (default)
>>> df_temporal.with_columns( ... rolling_row_min=pl.col("index").rolling_min_by("date", window_size="2h") ... ) shape: (25, 3) βββββββββ¬ββββββββββββββββββββββ¬ββββββββββββββββββ β index β date β rolling_row_min β β --- β --- β --- β β u32 β datetime[ΞΌs] β u32 β βββββββββͺββββββββββββββββββββββͺββββββββββββββββββ‘ β 0 β 2001-01-01 00:00:00 β 0 β β 1 β 2001-01-01 01:00:00 β 0 β β 2 β 2001-01-01 02:00:00 β 1 β β 3 β 2001-01-01 03:00:00 β 2 β β 4 β 2001-01-01 04:00:00 β 3 β β β¦ β β¦ β β¦ β β 20 β 2001-01-01 20:00:00 β 19 β β 21 β 2001-01-01 21:00:00 β 20 β β 22 β 2001-01-01 22:00:00 β 21 β β 23 β 2001-01-01 23:00:00 β 22 β β 24 β 2001-01-02 00:00:00 β 23 β βββββββββ΄ββββββββββββββββββββββ΄ββββββββββββββββββ
- rolling_quantile(
- quantile: float,
- interpolation: QuantileMethod = 'nearest',
- window_size: int = 2,
- weights: list_[float] | None = None,
- *,
- min_samples: int | None = None,
- center: bool = False,
Compute a rolling quantile.
A window of length
window_sizewill traverse the array. The values that fill this window will (optionally) be multiplied with the weights given by theweightsvector. The resulting values will be aggregated to their quantile.The window at a given row will include the row itself, and the
window_size - 1elements before it.Changed in version 1.21.0: The
min_periodsparameter was renamedmin_samples.- Parameters:
- quantile
Quantile between 0.0 and 1.0.
- interpolation{βnearestβ, βhigherβ, βlowerβ, βmidpointβ, βlinearβ, βequiprobableβ}
Interpolation method.
- window_size
The length of the window in number of elements.
- weights
An optional slice with the same length as the window that will be multiplied elementwise with the values in the window.
- min_samples
The number of values in the window that should be non-null before computing a result. If set to
None(default), it will be set equal towindow_size.- center
Set the labels at the center of the window.
Notes
If you want to compute multiple aggregation statistics over the same dynamic window, consider using
rolling- this method can cache the window size computation.Examples
>>> df = pl.DataFrame({"A": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]}) >>> df.with_columns( ... rolling_quantile=pl.col("A").rolling_quantile( ... quantile=0.25, window_size=4 ... ), ... ) shape: (6, 2) βββββββ¬βββββββββββββββββββ β A β rolling_quantile β β --- β --- β β f64 β f64 β βββββββͺβββββββββββββββββββ‘ β 1.0 β null β β 2.0 β null β β 3.0 β null β β 4.0 β 2.0 β β 5.0 β 3.0 β β 6.0 β 4.0 β βββββββ΄βββββββββββββββββββ
Specify weights for the values in each window:
>>> df.with_columns( ... rolling_quantile=pl.col("A").rolling_quantile( ... quantile=0.25, window_size=4, weights=[0.2, 0.4, 0.4, 0.2] ... ), ... ) shape: (6, 2) βββββββ¬βββββββββββββββββββ β A β rolling_quantile β β --- β --- β β f64 β f64 β βββββββͺβββββββββββββββββββ‘ β 1.0 β null β β 2.0 β null β β 3.0 β null β β 4.0 β 2.0 β β 5.0 β 3.0 β β 6.0 β 4.0 β βββββββ΄βββββββββββββββββββ
Specify weights and interpolation method
>>> df.with_columns( ... rolling_quantile=pl.col("A").rolling_quantile( ... quantile=0.25, ... window_size=4, ... weights=[0.2, 0.4, 0.4, 0.2], ... interpolation="linear", ... ), ... ) shape: (6, 2) βββββββ¬βββββββββββββββββββ β A β rolling_quantile β β --- β --- β β f64 β f64 β βββββββͺβββββββββββββββββββ‘ β 1.0 β null β β 2.0 β null β β 3.0 β null β β 4.0 β 1.625 β β 5.0 β 2.625 β β 6.0 β 3.625 β βββββββ΄βββββββββββββββββββ
Center the values in the window
>>> df.with_columns( ... rolling_quantile=pl.col("A").rolling_quantile( ... quantile=0.2, window_size=5, center=True ... ), ... ) shape: (6, 2) βββββββ¬βββββββββββββββββββ β A β rolling_quantile β β --- β --- β β f64 β f64 β βββββββͺβββββββββββββββββββ‘ β 1.0 β null β β 2.0 β null β β 3.0 β 2.0 β β 4.0 β 3.0 β β 5.0 β null β β 6.0 β null β βββββββ΄βββββββββββββββββββ
- rolling_quantile_by(
- by: IntoExpr,
- window_size: timedelta | str_,
- *,
- quantile: float,
- interpolation: QuantileMethod = 'nearest',
- min_samples: int = 1,
- closed: ClosedInterval = 'right',
Compute a rolling quantile based on another column.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
Given a
bycolumn<t_0, t_1, ..., t_n>, thenclosed="right"(the default) means the windows will be:(t_0 - window_size, t_0]
(t_1 - window_size, t_1]
β¦
(t_n - window_size, t_n]
Changed in version 1.21.0: The
min_periodsparameter was renamedmin_samples.- Parameters:
- by
Should be
DateTime,Date,UInt64,UInt32,Int64, orInt32data type (note that the integral ones require using'i'inwindow size).- quantile
Quantile between 0.0 and 1.0.
- interpolation{βnearestβ, βhigherβ, βlowerβ, βmidpointβ, βlinearβ, βequiprobableβ}
Interpolation method.
- window_size
The length of the window. Can be a dynamic temporal size indicated by a timedelta or the following string language:
1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
1i (1 index count)
By βcalendar dayβ, we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings - in cases of ambiguity, we follow RFC-5545 and preserve the DST fold of the original datetime). Similarly for βcalendar weekβ, βcalendar monthβ, βcalendar quarterβ, and βcalendar yearβ.
- min_samples
The number of values in the window that should be non-null before computing a result.
- closed{βleftβ, βrightβ, βbothβ, βnoneβ}
Define which sides of the temporal interval are closed (inclusive), defaults to
'right'.
Notes
If you want to compute multiple aggregation statistics over the same dynamic window, consider using
rolling- this method can cache the window size computation.Examples
Create a DataFrame with a datetime column and a row number column
>>> from datetime import timedelta, datetime >>> start = datetime(2001, 1, 1) >>> stop = datetime(2001, 1, 2) >>> df_temporal = pl.DataFrame( ... {"date": pl.datetime_range(start, stop, "1h", eager=True)} ... ).with_row_index() >>> df_temporal shape: (25, 2) βββββββββ¬ββββββββββββββββββββββ β index β date β β --- β --- β β u32 β datetime[ΞΌs] β βββββββββͺββββββββββββββββββββββ‘ β 0 β 2001-01-01 00:00:00 β β 1 β 2001-01-01 01:00:00 β β 2 β 2001-01-01 02:00:00 β β 3 β 2001-01-01 03:00:00 β β 4 β 2001-01-01 04:00:00 β β β¦ β β¦ β β 20 β 2001-01-01 20:00:00 β β 21 β 2001-01-01 21:00:00 β β 22 β 2001-01-01 22:00:00 β β 23 β 2001-01-01 23:00:00 β β 24 β 2001-01-02 00:00:00 β βββββββββ΄ββββββββββββββββββββββ
Compute the rolling quantile with the temporal windows closed on the right:
>>> df_temporal.with_columns( ... rolling_row_quantile=pl.col("index").rolling_quantile_by( ... "date", window_size="2h", quantile=0.3 ... ) ... ) shape: (25, 3) βββββββββ¬ββββββββββββββββββββββ¬βββββββββββββββββββββββ β index β date β rolling_row_quantile β β --- β --- β --- β β u32 β datetime[ΞΌs] β f64 β βββββββββͺββββββββββββββββββββββͺβββββββββββββββββββββββ‘ β 0 β 2001-01-01 00:00:00 β 0.0 β β 1 β 2001-01-01 01:00:00 β 0.0 β β 2 β 2001-01-01 02:00:00 β 1.0 β β 3 β 2001-01-01 03:00:00 β 2.0 β β 4 β 2001-01-01 04:00:00 β 3.0 β β β¦ β β¦ β β¦ β β 20 β 2001-01-01 20:00:00 β 19.0 β β 21 β 2001-01-01 21:00:00 β 20.0 β β 22 β 2001-01-01 22:00:00 β 21.0 β β 23 β 2001-01-01 23:00:00 β 22.0 β β 24 β 2001-01-02 00:00:00 β 23.0 β βββββββββ΄ββββββββββββββββββββββ΄βββββββββββββββββββββββ
- rolling_rank(
- window_size: int,
- method: RankMethod = 'average',
- *,
- seed: int | None = None,
- min_samples: int | None = None,
- center: bool = False,
Compute a rolling rank.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
A window of length
window_sizewill traverse the array. The values that fill this window will be ranked according to themethodparameter. The resulting values will be the rank of the value that is at the end of the sliding window.- Parameters:
- window_size
Integer size of the rolling window.
- method{βaverageβ, βminβ, βmaxβ, βdenseβ, βrandomβ}
The method used to assign ranks to tied elements. The following methods are available (default is βaverageβ):
βaverageβ : The average of the ranks that would have been assigned to all the tied values is assigned to each value.
βminβ : The minimum of the ranks that would have been assigned to all the tied values is assigned to each value. (This is also referred to as βcompetitionβ ranking.)
βmaxβ : The maximum of the ranks that would have been assigned to all the tied values is assigned to each value.
βdenseβ : Like βminβ, but the rank of the next highest element is assigned the rank immediately after those assigned to the tied elements.
βrandomβ : Choose a random rank for each value in a tie.
- seed
Random seed used when
method='random'. If set to None (default), a random seed is generated for each rolling rank operation.- min_samples
The number of values in the window that should be non-null before computing a result. If set to
None(default), it will be set equal towindow_size.- center
Set the labels at the center of the window.
- Returns:
- Expr
An Expr of data
Float64ifmethodis"average"or, the index size (seeget_index_type()) otherwise.
Examples
>>> df = pl.DataFrame({"a": [1, 4, 4, 1, 9]}) >>> df.select(pl.col("a").rolling_rank(3, method="average")) shape: (5, 1) ββββββββ β a β β --- β β f64 β ββββββββ‘ β null β β null β β 2.5 β β 1.0 β β 3.0 β ββββββββ
- rolling_rank_by(
- by: IntoExpr,
- window_size: timedelta | str_,
- method: RankMethod = 'average',
- *,
- seed: int | None = None,
- min_samples: int = 1,
- closed: ClosedInterval = 'right',
Compute a rolling rank based on another column.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
Given a
bycolumn<t_0, t_1, ..., t_n>, thenclosed="right"(the default) means the windows will be:(t_0 - window_size, t_0]
(t_1 - window_size, t_1]
β¦
(t_n - window_size, t_n]
- Parameters:
- by
Should be
DateTime,Date,UInt64,UInt32,Int64, orInt32data type (note that the integral ones require using'i'inwindow size).- window_size
The length of the window. Can be a dynamic temporal size indicated by a timedelta or the following string language:
1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
1i (1 index count)
By βcalendar dayβ, we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings - in cases of ambiguity, we follow RFC-5545 and preserve the DST fold of the original datetime). Similarly for βcalendar weekβ, βcalendar monthβ, βcalendar quarterβ, and βcalendar yearβ.
- method{βaverageβ, βminβ, βmaxβ, βdenseβ, βrandomβ}
The method used to assign ranks to tied elements. The following methods are available (default is βaverageβ):
βaverageβ : The average of the ranks that would have been assigned to all the tied values is assigned to each value.
βminβ : The minimum of the ranks that would have been assigned to all the tied values is assigned to each value. (This is also referred to as βcompetitionβ ranking.)
βmaxβ : The maximum of the ranks that would have been assigned to all the tied values is assigned to each value.
βdenseβ : Like βminβ, but the rank of the next highest element is assigned the rank immediately after those assigned to the tied elements.
βrandomβ : Choose a random rank for each value in a tie.
- seed
Random seed used when
method='random'. If set to None (default), a random seed is generated for each rolling rank operation.- min_samples
The number of values in the window that should be non-null before computing a result.
- closed{βleftβ, βrightβ, βbothβ, βnoneβ}
Define which sides of the temporal interval are closed (inclusive), defaults to
'right'.
- Returns:
- Expr
An Expr of data
Float64ifmethodis"average"or, the index size (seeget_index_type()) otherwise.
- rolling_skew( ) Expr[source]
Compute a rolling skew.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
The window at a given row will include the row itself, and the
window_size - 1elements before it.- Parameters:
- window_size
Integer size of the rolling window.
- bias
- If False, the calculations are corrected for statistical bias.
bias: bool = True,
- min_samples
The number of values in the window that should be non-null before computing a result. If set to
None(default), it will be set equal towindow_size.- center
Set the labels at the center of the window.
See also
Examples
>>> df = pl.DataFrame({"a": [1, 4, 2, 9]}) >>> df.select(pl.col("a").rolling_skew(3)) shape: (4, 1) ββββββββββββ β a β β --- β β f64 β ββββββββββββ‘ β null β β null β β 0.381802 β β 0.47033 β ββββββββββββ
Note how the values match the following:
>>> pl.Series([1, 4, 2]).skew(), pl.Series([4, 2, 9]).skew() (0.38180177416060584, 0.47033046033698594)
- rolling_std(
- window_size: int,
- weights: list_[float] | None = None,
- *,
- min_samples: int | None = None,
- center: bool = False,
- ddof: int = 1,
Compute a rolling standard deviation.
A window of length
window_sizewill traverse the array. The values that fill this window will (optionally) be multiplied with the weights given by theweightsvector. The resulting values will be aggregated to their std. Weights are normalized to sum to 1.The window at a given row will include the row itself, and the
window_size - 1elements before it.Changed in version 1.21.0: The
min_periodsparameter was renamedmin_samples.- Parameters:
- window_size
The length of the window in number of elements.
- weights
An optional slice with the same length as the window that will be multiplied elementwise with the values in the window after being normalized to sum to 1.
- min_samples
The number of values in the window that should be non-null before computing a result. If set to
None(default), it will be set equal towindow_size.- center
Set the labels at the center of the window.
- ddof
βDelta Degrees of Freedomβ: The divisor for a length N window is N - ddof
Notes
If you want to compute multiple aggregation statistics over the same dynamic window, consider using
rolling- this method can cache the window size computation.Examples
>>> df = pl.DataFrame({"A": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]}) >>> df.with_columns( ... rolling_std=pl.col("A").rolling_std(window_size=2), ... ) shape: (6, 2) βββββββ¬ββββββββββββββ β A β rolling_std β β --- β --- β β f64 β f64 β βββββββͺββββββββββββββ‘ β 1.0 β null β β 2.0 β 0.707107 β β 3.0 β 0.707107 β β 4.0 β 0.707107 β β 5.0 β 0.707107 β β 6.0 β 0.707107 β βββββββ΄ββββββββββββββ
Specify weights to multiply the values in the window with:
>>> df.with_columns( ... rolling_std=pl.col("A").rolling_std( ... window_size=2, weights=[0.25, 0.75] ... ), ... ) shape: (6, 2) βββββββ¬ββββββββββββββ β A β rolling_std β β --- β --- β β f64 β f64 β βββββββͺββββββββββββββ‘ β 1.0 β null β β 2.0 β 0.433013 β β 3.0 β 0.433013 β β 4.0 β 0.433013 β β 5.0 β 0.433013 β β 6.0 β 0.433013 β βββββββ΄ββββββββββββββ
Center the values in the window
>>> df.with_columns( ... rolling_std=pl.col("A").rolling_std(window_size=3, center=True), ... ) shape: (6, 2) βββββββ¬ββββββββββββββ β A β rolling_std β β --- β --- β β f64 β f64 β βββββββͺββββββββββββββ‘ β 1.0 β null β β 2.0 β 1.0 β β 3.0 β 1.0 β β 4.0 β 1.0 β β 5.0 β 1.0 β β 6.0 β null β βββββββ΄ββββββββββββββ
- rolling_std_by(
- by: IntoExpr,
- window_size: timedelta | str_,
- *,
- min_samples: int = 1,
- closed: ClosedInterval = 'right',
- ddof: int = 1,
Compute a rolling standard deviation based on another column.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
Given a
bycolumn<t_0, t_1, ..., t_n>, thenclosed="right"(the default) means the windows will be:(t_0 - window_size, t_0]
(t_1 - window_size, t_1]
β¦
(t_n - window_size, t_n]
Changed in version 1.21.0: The
min_periodsparameter was renamedmin_samples.- Parameters:
- by
Should be
DateTime,Date,UInt64,UInt32,Int64, orInt32data type (note that the integral ones require using'i'inwindow size).- window_size
The length of the window. Can be a dynamic temporal size indicated by a timedelta or the following string language:
1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
1i (1 index count)
By βcalendar dayβ, we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings - in cases of ambiguity, we follow RFC-5545 and preserve the DST fold of the original datetime). Similarly for βcalendar weekβ, βcalendar monthβ, βcalendar quarterβ, and βcalendar yearβ.
- min_samples
The number of values in the window that should be non-null before computing a result.
- closed{βleftβ, βrightβ, βbothβ, βnoneβ}
Define which sides of the temporal interval are closed (inclusive), defaults to
'right'.- ddof
βDelta Degrees of Freedomβ: The divisor for a length N window is N - ddof
Notes
If you want to compute multiple aggregation statistics over the same dynamic window, consider using
rolling- this method can cache the window size computation.Examples
Create a DataFrame with a datetime column and a row number column
>>> from datetime import timedelta, datetime >>> start = datetime(2001, 1, 1) >>> stop = datetime(2001, 1, 2) >>> df_temporal = pl.DataFrame( ... {"date": pl.datetime_range(start, stop, "1h", eager=True)} ... ).with_row_index() >>> df_temporal shape: (25, 2) βββββββββ¬ββββββββββββββββββββββ β index β date β β --- β --- β β u32 β datetime[ΞΌs] β βββββββββͺββββββββββββββββββββββ‘ β 0 β 2001-01-01 00:00:00 β β 1 β 2001-01-01 01:00:00 β β 2 β 2001-01-01 02:00:00 β β 3 β 2001-01-01 03:00:00 β β 4 β 2001-01-01 04:00:00 β β β¦ β β¦ β β 20 β 2001-01-01 20:00:00 β β 21 β 2001-01-01 21:00:00 β β 22 β 2001-01-01 22:00:00 β β 23 β 2001-01-01 23:00:00 β β 24 β 2001-01-02 00:00:00 β βββββββββ΄ββββββββββββββββββββββ
Compute the rolling std with the temporal windows closed on the right (default)
>>> df_temporal.with_columns( ... rolling_row_std=pl.col("index").rolling_std_by("date", window_size="2h") ... ) shape: (25, 3) βββββββββ¬ββββββββββββββββββββββ¬ββββββββββββββββββ β index β date β rolling_row_std β β --- β --- β --- β β u32 β datetime[ΞΌs] β f64 β βββββββββͺββββββββββββββββββββββͺββββββββββββββββββ‘ β 0 β 2001-01-01 00:00:00 β null β β 1 β 2001-01-01 01:00:00 β 0.707107 β β 2 β 2001-01-01 02:00:00 β 0.707107 β β 3 β 2001-01-01 03:00:00 β 0.707107 β β 4 β 2001-01-01 04:00:00 β 0.707107 β β β¦ β β¦ β β¦ β β 20 β 2001-01-01 20:00:00 β 0.707107 β β 21 β 2001-01-01 21:00:00 β 0.707107 β β 22 β 2001-01-01 22:00:00 β 0.707107 β β 23 β 2001-01-01 23:00:00 β 0.707107 β β 24 β 2001-01-02 00:00:00 β 0.707107 β βββββββββ΄ββββββββββββββββββββββ΄ββββββββββββββββββ
Compute the rolling std with the closure of windows on both sides
>>> df_temporal.with_columns( ... rolling_row_std=pl.col("index").rolling_std_by( ... "date", window_size="2h", closed="both" ... ) ... ) shape: (25, 3) βββββββββ¬ββββββββββββββββββββββ¬ββββββββββββββββββ β index β date β rolling_row_std β β --- β --- β --- β β u32 β datetime[ΞΌs] β f64 β βββββββββͺββββββββββββββββββββββͺββββββββββββββββββ‘ β 0 β 2001-01-01 00:00:00 β null β β 1 β 2001-01-01 01:00:00 β 0.707107 β β 2 β 2001-01-01 02:00:00 β 1.0 β β 3 β 2001-01-01 03:00:00 β 1.0 β β 4 β 2001-01-01 04:00:00 β 1.0 β β β¦ β β¦ β β¦ β β 20 β 2001-01-01 20:00:00 β 1.0 β β 21 β 2001-01-01 21:00:00 β 1.0 β β 22 β 2001-01-01 22:00:00 β 1.0 β β 23 β 2001-01-01 23:00:00 β 1.0 β β 24 β 2001-01-02 00:00:00 β 1.0 β βββββββββ΄ββββββββββββββββββββββ΄ββββββββββββββββββ
- rolling_sum(
- window_size: int,
- weights: list_[float] | None = None,
- *,
- min_samples: int | None = None,
- center: bool = False,
Apply a rolling sum (moving sum) over the values in this array.
A window of length
window_sizewill traverse the array. The values that fill this window will (optionally) be multiplied with the weights given by theweightsvector. The resulting values will be aggregated to their sum.The window at a given row will include the row itself, and the
window_size - 1elements before it.Changed in version 1.21.0: The
min_periodsparameter was renamedmin_samples.- Parameters:
- window_size
The length of the window in number of elements.
- weights
An optional slice with the same length as the window that will be multiplied elementwise with the values in the window.
- min_samples
The number of values in the window that should be non-null before computing a result. If set to
None(default), it will be set equal towindow_size.- center
Set the labels at the center of the window.
Notes
If you want to compute multiple aggregation statistics over the same dynamic window, consider using
rolling- this method can cache the window size computation.Examples
>>> df = pl.DataFrame({"A": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]}) >>> df.with_columns( ... rolling_sum=pl.col("A").rolling_sum(window_size=2), ... ) shape: (6, 2) βββββββ¬ββββββββββββββ β A β rolling_sum β β --- β --- β β f64 β f64 β βββββββͺββββββββββββββ‘ β 1.0 β null β β 2.0 β 3.0 β β 3.0 β 5.0 β β 4.0 β 7.0 β β 5.0 β 9.0 β β 6.0 β 11.0 β βββββββ΄ββββββββββββββ
Specify weights to multiply the values in the window with:
>>> df.with_columns( ... rolling_sum=pl.col("A").rolling_sum( ... window_size=2, weights=[0.25, 0.75] ... ), ... ) shape: (6, 2) βββββββ¬ββββββββββββββ β A β rolling_sum β β --- β --- β β f64 β f64 β βββββββͺββββββββββββββ‘ β 1.0 β null β β 2.0 β 1.75 β β 3.0 β 2.75 β β 4.0 β 3.75 β β 5.0 β 4.75 β β 6.0 β 5.75 β βββββββ΄ββββββββββββββ
Center the values in the window
>>> df.with_columns( ... rolling_sum=pl.col("A").rolling_sum(window_size=3, center=True), ... ) shape: (6, 2) βββββββ¬ββββββββββββββ β A β rolling_sum β β --- β --- β β f64 β f64 β βββββββͺββββββββββββββ‘ β 1.0 β null β β 2.0 β 6.0 β β 3.0 β 9.0 β β 4.0 β 12.0 β β 5.0 β 15.0 β β 6.0 β null β βββββββ΄ββββββββββββββ
- rolling_sum_by(
- by: IntoExpr,
- window_size: timedelta | str_,
- *,
- min_samples: int = 1,
- closed: ClosedInterval = 'right',
Apply a rolling sum based on another column.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
Given a
bycolumn<t_0, t_1, ..., t_n>, thenclosed="right"(the default) means the windows will be:(t_0 - window_size, t_0]
(t_1 - window_size, t_1]
β¦
(t_n - window_size, t_n]
Changed in version 1.21.0: The
min_periodsparameter was renamedmin_samples.- Parameters:
- window_size
The length of the window. Can be a dynamic temporal size indicated by a timedelta or the following string language:
1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
1i (1 index count)
By βcalendar dayβ, we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings - in cases of ambiguity, we follow RFC-5545 and preserve the DST fold of the original datetime). Similarly for βcalendar weekβ, βcalendar monthβ, βcalendar quarterβ, and βcalendar yearβ.
- min_samples
The number of values in the window that should be non-null before computing a result.
- by
Should be
DateTime,Date,UInt64,UInt32,Int64, orInt32data type (note that the integral ones require using'i'inwindow size).- closed{βleftβ, βrightβ, βbothβ, βnoneβ}
Define which sides of the temporal interval are closed (inclusive), defaults to
'right'.
Notes
If you want to compute multiple aggregation statistics over the same dynamic window, consider using
rolling- this method can cache the window size computation.Examples
Create a DataFrame with a datetime column and a row number column
>>> from datetime import timedelta, datetime >>> start = datetime(2001, 1, 1) >>> stop = datetime(2001, 1, 2) >>> df_temporal = pl.DataFrame( ... {"date": pl.datetime_range(start, stop, "1h", eager=True)} ... ).with_row_index() >>> df_temporal shape: (25, 2) βββββββββ¬ββββββββββββββββββββββ β index β date β β --- β --- β β u32 β datetime[ΞΌs] β βββββββββͺββββββββββββββββββββββ‘ β 0 β 2001-01-01 00:00:00 β β 1 β 2001-01-01 01:00:00 β β 2 β 2001-01-01 02:00:00 β β 3 β 2001-01-01 03:00:00 β β 4 β 2001-01-01 04:00:00 β β β¦ β β¦ β β 20 β 2001-01-01 20:00:00 β β 21 β 2001-01-01 21:00:00 β β 22 β 2001-01-01 22:00:00 β β 23 β 2001-01-01 23:00:00 β β 24 β 2001-01-02 00:00:00 β βββββββββ΄ββββββββββββββββββββββ
Compute the rolling sum with the temporal windows closed on the right (default)
>>> df_temporal.with_columns( ... rolling_row_sum=pl.col("index").rolling_sum_by("date", window_size="2h") ... ) shape: (25, 3) βββββββββ¬ββββββββββββββββββββββ¬ββββββββββββββββββ β index β date β rolling_row_sum β β --- β --- β --- β β u32 β datetime[ΞΌs] β u32 β βββββββββͺββββββββββββββββββββββͺββββββββββββββββββ‘ β 0 β 2001-01-01 00:00:00 β 0 β β 1 β 2001-01-01 01:00:00 β 1 β β 2 β 2001-01-01 02:00:00 β 3 β β 3 β 2001-01-01 03:00:00 β 5 β β 4 β 2001-01-01 04:00:00 β 7 β β β¦ β β¦ β β¦ β β 20 β 2001-01-01 20:00:00 β 39 β β 21 β 2001-01-01 21:00:00 β 41 β β 22 β 2001-01-01 22:00:00 β 43 β β 23 β 2001-01-01 23:00:00 β 45 β β 24 β 2001-01-02 00:00:00 β 47 β βββββββββ΄ββββββββββββββββββββββ΄ββββββββββββββββββ
Compute the rolling sum with the closure of windows on both sides
>>> df_temporal.with_columns( ... rolling_row_sum=pl.col("index").rolling_sum_by( ... "date", window_size="2h", closed="both" ... ) ... ) shape: (25, 3) βββββββββ¬ββββββββββββββββββββββ¬ββββββββββββββββββ β index β date β rolling_row_sum β β --- β --- β --- β β u32 β datetime[ΞΌs] β u32 β βββββββββͺββββββββββββββββββββββͺββββββββββββββββββ‘ β 0 β 2001-01-01 00:00:00 β 0 β β 1 β 2001-01-01 01:00:00 β 1 β β 2 β 2001-01-01 02:00:00 β 3 β β 3 β 2001-01-01 03:00:00 β 6 β β 4 β 2001-01-01 04:00:00 β 9 β β β¦ β β¦ β β¦ β β 20 β 2001-01-01 20:00:00 β 57 β β 21 β 2001-01-01 21:00:00 β 60 β β 22 β 2001-01-01 22:00:00 β 63 β β 23 β 2001-01-01 23:00:00 β 66 β β 24 β 2001-01-02 00:00:00 β 69 β βββββββββ΄ββββββββββββββββββββββ΄ββββββββββββββββββ
- rolling_var(
- window_size: int,
- weights: list_[float] | None = None,
- *,
- min_samples: int | None = None,
- center: bool = False,
- ddof: int = 1,
Compute a rolling variance.
A window of length
window_sizewill traverse the array. The values that fill this window will (optionally) be multiplied with the weights given by theweightsvector. The resulting values will be aggregated to their var. Weights are normalized to sum to 1.The window at a given row will include the row itself, and the
window_size - 1elements before it.Changed in version 1.21.0: The
min_periodsparameter was renamedmin_samples.- Parameters:
- window_size
The length of the window in number of elements.
- weights
An optional slice with the same length as the window that will be multiplied elementwise with the values in the window after being normalized to sum to 1.
- min_samples
The number of values in the window that should be non-null before computing a result. If set to
None(default), it will be set equal towindow_size.- center
Set the labels at the center of the window.
- ddof
βDelta Degrees of Freedomβ: The divisor for a length N window is N - ddof
Notes
If you want to compute multiple aggregation statistics over the same dynamic window, consider using
rolling- this method can cache the window size computation.Examples
>>> df = pl.DataFrame({"A": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]}) >>> df.with_columns( ... rolling_var=pl.col("A").rolling_var(window_size=2), ... ) shape: (6, 2) βββββββ¬ββββββββββββββ β A β rolling_var β β --- β --- β β f64 β f64 β βββββββͺββββββββββββββ‘ β 1.0 β null β β 2.0 β 0.5 β β 3.0 β 0.5 β β 4.0 β 0.5 β β 5.0 β 0.5 β β 6.0 β 0.5 β βββββββ΄ββββββββββββββ
Specify weights to multiply the values in the window with:
>>> df.with_columns( ... rolling_var=pl.col("A").rolling_var( ... window_size=2, weights=[0.25, 0.75] ... ), ... ) shape: (6, 2) βββββββ¬ββββββββββββββ β A β rolling_var β β --- β --- β β f64 β f64 β βββββββͺββββββββββββββ‘ β 1.0 β null β β 2.0 β 0.1875 β β 3.0 β 0.1875 β β 4.0 β 0.1875 β β 5.0 β 0.1875 β β 6.0 β 0.1875 β βββββββ΄ββββββββββββββ
Center the values in the window
>>> df.with_columns( ... rolling_var=pl.col("A").rolling_var(window_size=3, center=True), ... ) shape: (6, 2) βββββββ¬ββββββββββββββ β A β rolling_var β β --- β --- β β f64 β f64 β βββββββͺββββββββββββββ‘ β 1.0 β null β β 2.0 β 1.0 β β 3.0 β 1.0 β β 4.0 β 1.0 β β 5.0 β 1.0 β β 6.0 β null β βββββββ΄ββββββββββββββ
- rolling_var_by(
- by: IntoExpr,
- window_size: timedelta | str_,
- *,
- min_samples: int = 1,
- closed: ClosedInterval = 'right',
- ddof: int = 1,
Compute a rolling variance based on another column.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
Given a
bycolumn<t_0, t_1, ..., t_n>, thenclosed="right"(the default) means the windows will be:(t_0 - window_size, t_0]
(t_1 - window_size, t_1]
β¦
(t_n - window_size, t_n]
Changed in version 1.21.0: The
min_periodsparameter was renamedmin_samples.- Parameters:
- by
Should be
DateTime,Date,UInt64,UInt32,Int64, orInt32data type (note that the integral ones require using'i'inwindow size).- window_size
The length of the window. Can be a dynamic temporal size indicated by a timedelta or the following string language:
1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
1i (1 index count)
By βcalendar dayβ, we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings - in cases of ambiguity, we follow RFC-5545 and preserve the DST fold of the original datetime). Similarly for βcalendar weekβ, βcalendar monthβ, βcalendar quarterβ, and βcalendar yearβ.
- min_samples
The number of values in the window that should be non-null before computing a result.
- closed{βleftβ, βrightβ, βbothβ, βnoneβ}
Define which sides of the temporal interval are closed (inclusive), defaults to
'right'.- ddof
βDelta Degrees of Freedomβ: The divisor for a length N window is N - ddof
Notes
If you want to compute multiple aggregation statistics over the same dynamic window, consider using
rolling- this method can cache the window size computation.Examples
Create a DataFrame with a datetime column and a row number column
>>> from datetime import timedelta, datetime >>> start = datetime(2001, 1, 1) >>> stop = datetime(2001, 1, 2) >>> df_temporal = pl.DataFrame( ... {"date": pl.datetime_range(start, stop, "1h", eager=True)} ... ).with_row_index() >>> df_temporal shape: (25, 2) βββββββββ¬ββββββββββββββββββββββ β index β date β β --- β --- β β u32 β datetime[ΞΌs] β βββββββββͺββββββββββββββββββββββ‘ β 0 β 2001-01-01 00:00:00 β β 1 β 2001-01-01 01:00:00 β β 2 β 2001-01-01 02:00:00 β β 3 β 2001-01-01 03:00:00 β β 4 β 2001-01-01 04:00:00 β β β¦ β β¦ β β 20 β 2001-01-01 20:00:00 β β 21 β 2001-01-01 21:00:00 β β 22 β 2001-01-01 22:00:00 β β 23 β 2001-01-01 23:00:00 β β 24 β 2001-01-02 00:00:00 β βββββββββ΄ββββββββββββββββββββββ
Compute the rolling var with the temporal windows closed on the right (default)
>>> df_temporal.with_columns( ... rolling_row_var=pl.col("index").rolling_var_by("date", window_size="2h") ... ) shape: (25, 3) βββββββββ¬ββββββββββββββββββββββ¬ββββββββββββββββββ β index β date β rolling_row_var β β --- β --- β --- β β u32 β datetime[ΞΌs] β f64 β βββββββββͺββββββββββββββββββββββͺββββββββββββββββββ‘ β 0 β 2001-01-01 00:00:00 β null β β 1 β 2001-01-01 01:00:00 β 0.5 β β 2 β 2001-01-01 02:00:00 β 0.5 β β 3 β 2001-01-01 03:00:00 β 0.5 β β 4 β 2001-01-01 04:00:00 β 0.5 β β β¦ β β¦ β β¦ β β 20 β 2001-01-01 20:00:00 β 0.5 β β 21 β 2001-01-01 21:00:00 β 0.5 β β 22 β 2001-01-01 22:00:00 β 0.5 β β 23 β 2001-01-01 23:00:00 β 0.5 β β 24 β 2001-01-02 00:00:00 β 0.5 β βββββββββ΄ββββββββββββββββββββββ΄ββββββββββββββββββ
Compute the rolling var with the closure of windows on both sides
>>> df_temporal.with_columns( ... rolling_row_var=pl.col("index").rolling_var_by( ... "date", window_size="2h", closed="both" ... ) ... ) shape: (25, 3) βββββββββ¬ββββββββββββββββββββββ¬ββββββββββββββββββ β index β date β rolling_row_var β β --- β --- β --- β β u32 β datetime[ΞΌs] β f64 β βββββββββͺββββββββββββββββββββββͺββββββββββββββββββ‘ β 0 β 2001-01-01 00:00:00 β null β β 1 β 2001-01-01 01:00:00 β 0.5 β β 2 β 2001-01-01 02:00:00 β 1.0 β β 3 β 2001-01-01 03:00:00 β 1.0 β β 4 β 2001-01-01 04:00:00 β 1.0 β β β¦ β β¦ β β¦ β β 20 β 2001-01-01 20:00:00 β 1.0 β β 21 β 2001-01-01 21:00:00 β 1.0 β β 22 β 2001-01-01 22:00:00 β 1.0 β β 23 β 2001-01-01 23:00:00 β 1.0 β β 24 β 2001-01-02 00:00:00 β 1.0 β βββββββββ΄ββββββββββββββββββββββ΄ββββββββββββββββββ
- round(decimals: int = 0, mode: RoundMode = 'half_to_even') Expr[source]
Round underlying floating point data by
decimalsdigits.- Parameters:
- decimals
Number of decimals to round by.
- mode{βhalf_to_evenβ, βhalf_away_from_zeroβ, βto_zeroβ}
The rounding strategy used. A βrounded valueβ is a value with at most
decimalsdecimal places (e.g. integers whendecimals=0, multiples of 0.1 whendecimals=1, 0.01 whendecimals=2, and so on).Strategies that start with
half_round all values to the nearest rounded value, only using the strategy to break ties when a value falls exactly between two rounded values (e.g. 0.5 whendecimals=0, 0.05 whendecimals=1). Other rounding strategies specify explicitly which rounded value is chosen and always apply (not just for tiebreaks).- half_to_even (default)
Round to the nearest value; break ties by choosing the nearest even value. For example, 0.5 rounds to 0, 1.5 rounds to 2, 2.5 rounds to 2. Also known as βbankerβs roundingβ; this is the default because it tends to minimise cumulative rounding bias.
- half_away_from_zero
Round to the nearest value; break ties by rounding away from zero. For example, 0.5 rounds to 1, -0.5 rounds to -1, 2.5 rounds to 3. Also known as βcommercial roundingβ.
- to_zero
Always round (truncate) towards zero, discarding the fractional part beyond
decimals. For example, 0.9 rounds to 0, -0.9 rounds to 0, 1.29 rounds to 1.2 (withdecimals=1). Equivalent to thetruncate()method.
See also
ceilRound up to the nearest integer.
floorRound down to the nearest integer.
round_sig_figsRound to a given number of significant figures.
truncateTruncate to a given number of decimals.
Examples
>>> df = pl.DataFrame({"a": [0.33, 0.52, 1.02, 1.17]}) >>> df.select(pl.col("a").round(1)) shape: (4, 1) βββββββ β a β β --- β β f64 β βββββββ‘ β 0.3 β β 0.5 β β 1.0 β β 1.2 β βββββββ
>>> df = pl.DataFrame( ... { ... "f64": [-3.5, -2.5, -1.5, -0.5, 0.5, 1.5, 2.5, 3.5], ... "d": ["-3.5", "-2.5", "-1.5", "-0.5", "0.5", "1.5", "2.5", "3.5"], ... }, ... schema_overrides={"d": pl.Decimal(scale=1)}, ... ) >>> df.with_columns( ... pl.all().round(mode="half_away_from_zero").name.suffix("_away"), ... pl.all().round(mode="half_to_even").name.suffix("_to_even"), ... ) shape: (8, 6) ββββββββ¬ββββββββββββββββ¬βββββββββββ¬ββββββββββββββββ¬ββββββββββββββ¬ββββββββββββββββ β f64 β d β f64_away β d_away β f64_to_even β d_to_even β β --- β --- β --- β --- β --- β --- β β f64 β decimal[38,1] β f64 β decimal[38,1] β f64 β decimal[38,1] β ββββββββͺββββββββββββββββͺβββββββββββͺββββββββββββββββͺββββββββββββββͺββββββββββββββββ‘ β -3.5 β -3.5 β -4.0 β -4.0 β -4.0 β -4.0 β β -2.5 β -2.5 β -3.0 β -3.0 β -2.0 β -2.0 β β -1.5 β -1.5 β -2.0 β -2.0 β -2.0 β -2.0 β β -0.5 β -0.5 β -1.0 β -1.0 β -0.0 β 0.0 β β 0.5 β 0.5 β 1.0 β 1.0 β 0.0 β 0.0 β β 1.5 β 1.5 β 2.0 β 2.0 β 2.0 β 2.0 β β 2.5 β 2.5 β 3.0 β 3.0 β 2.0 β 2.0 β β 3.5 β 3.5 β 4.0 β 4.0 β 4.0 β 4.0 β ββββββββ΄ββββββββββββββββ΄βββββββββββ΄ββββββββββββββββ΄ββββββββββββββ΄ββββββββββββββββ
- round_sig_figs(digits: int) Expr[source]
Round to a number of significant figures.
- Parameters:
- digits
Number of significant figures to round to.
See also
Examples
>>> df = pl.DataFrame({"a": [0.01234, 3.333, 1234.0]}) >>> df.with_columns(pl.col("a").round_sig_figs(2).alias("round_sig_figs")) shape: (3, 2) βββββββββββ¬βββββββββββββββββ β a β round_sig_figs β β --- β --- β β f64 β f64 β βββββββββββͺβββββββββββββββββ‘ β 0.01234 β 0.012 β β 3.333 β 3.3 β β 1234.0 β 1200.0 β βββββββββββ΄βββββββββββββββββ
- sample(
- n: int | IntoExprColumn | None = None,
- *,
- fraction: float | IntoExprColumn | None = None,
- with_replacement: bool = False,
- shuffle: bool = False,
- seed: int | None = None,
Sample from this expression.
- Parameters:
- n
Number of items to return. Cannot be used with
fraction. Defaults to 1 iffractionis None.- fraction
Fraction of items to return. Cannot be used with
n.- with_replacement
Allow values to be sampled more than once.
- shuffle
Shuffle the order of sampled data points.
- seed
Seed for the random number generator. If set to None (default), a random seed is generated for each sample operation.
Examples
>>> df = pl.DataFrame({"a": [1, 2, 3]}) >>> df.select(pl.col("a").sample(fraction=1.0, with_replacement=True, seed=1)) shape: (3, 1) βββββββ β a β β --- β β i64 β βββββββ‘ β 3 β β 3 β β 1 β βββββββ
- search_sorted(
- element: IntoExpr | np.ndarray[Any, Any],
- side: SearchSortedSide = 'any',
- *,
- descending: bool = False,
Find indices where elements should be inserted to maintain order.
\[a[i-1] < v <= a[i]\]- Parameters:
- element
Expression or scalar value.
- side{βanyβ, βleftβ, βrightβ}
If βanyβ, the index of the first suitable location found is given. If βleftβ, the index of the leftmost suitable location found is given. If βrightβ, return the rightmost suitable location found is given.
- descending
Boolean indicating whether the values are descending or not (they are required to be sorted either way).
Examples
>>> df = pl.DataFrame( ... { ... "values": [1, 2, 3, 5], ... } ... ) >>> df.select( ... [ ... pl.col("values").search_sorted(0).alias("zero"), ... pl.col("values").search_sorted(3).alias("three"), ... pl.col("values").search_sorted(6).alias("six"), ... ] ... ) shape: (1, 3) ββββββββ¬ββββββββ¬ββββββ β zero β three β six β β --- β --- β --- β β u32 β u32 β u32 β ββββββββͺββββββββͺββββββ‘ β 0 β 2 β 4 β ββββββββ΄ββββββββ΄ββββββ
- set_sorted( ) Expr[source]
Flags the expression as βsortedβ.
Enables downstream code to user fast paths for sorted arrays. It is recommended to also set whether
nulls_lastisTrueorFalse, as this enables many internal optimizations.- Parameters:
- descending
Whether the
Seriesorder is descending.- nulls_last
Whether the nulls are at the end.
Warning
This can lead to incorrect results if the data is NOT sorted!! Use with care!
Examples
>>> df = pl.DataFrame({"values": [1, 2, 3]}) >>> df.select(pl.col("values").set_sorted().max()) shape: (1, 1) ββββββββββ β values β β --- β β i64 β ββββββββββ‘ β 3 β ββββββββββ
- shift(n: int | IntoExprColumn = 1, *, fill_value: IntoExpr | None = None) Expr[source]
Shift values by the given number of indices.
- Parameters:
- n
Number of indices to shift forward. If a negative value is passed, values are shifted in the opposite direction instead.
- fill_value
Fill the resulting null values with this scalar value.
See also
Notes
This method is similar to the
LAGoperation in SQL when the value fornis positive. With a negative value forn, it is similar toLEAD.Examples
By default, values are shifted forward by one index.
>>> df = pl.DataFrame({"a": [1, 2, 3, 4]}) >>> df.with_columns(shift=pl.col("a").shift()) shape: (4, 2) βββββββ¬ββββββββ β a β shift β β --- β --- β β i64 β i64 β βββββββͺββββββββ‘ β 1 β null β β 2 β 1 β β 3 β 2 β β 4 β 3 β βββββββ΄ββββββββ
Pass a negative value to shift in the opposite direction instead.
>>> df.with_columns(shift=pl.col("a").shift(-2)) shape: (4, 2) βββββββ¬ββββββββ β a β shift β β --- β --- β β i64 β i64 β βββββββͺββββββββ‘ β 1 β 3 β β 2 β 4 β β 3 β null β β 4 β null β βββββββ΄ββββββββ
Specify
fill_valueto fill the resulting null values.>>> df.with_columns(shift=pl.col("a").shift(-2, fill_value=100)) shape: (4, 2) βββββββ¬ββββββββ β a β shift β β --- β --- β β i64 β i64 β βββββββͺββββββββ‘ β 1 β 3 β β 2 β 4 β β 3 β 100 β β 4 β 100 β βββββββ΄ββββββββ
- shrink_dtype() Expr[source]
Shrink numeric columns to the minimal required datatype.
Shrink to the dtype needed to fit the extrema of this [
Series]. This can be used to reduce memory pressure.Changed in version 1.33.0: Deprecated and turned into a no-op. The operation does not match the Polars data-model during lazy execution since the output datatype cannot be known without inspecting the data.
Use
Series.shrink_dtypeinstead.Examples
>>> pl.DataFrame( ... { ... "a": [1, 2, 3], ... "b": [1, 2, 2 << 32], ... "c": [-1, 2, 1 << 30], ... "d": [-112, 2, 112], ... "e": [-112, 2, 129], ... "f": ["a", "b", "c"], ... "g": [0.1, 1.32, 0.12], ... "h": [True, None, False], ... } ... ).select(pl.all().shrink_dtype()) shape: (3, 8) βββββββ¬βββββββββββββ¬βββββββββββββ¬βββββββ¬βββββββ¬ββββββ¬βββββββ¬ββββββββ β a β b β c β d β e β f β g β h β β --- β --- β --- β --- β --- β --- β --- β --- β β i8 β i64 β i32 β i8 β i16 β str β f32 β bool β βββββββͺβββββββββββββͺβββββββββββββͺβββββββͺβββββββͺββββββͺβββββββͺββββββββ‘ β 1 β 1 β -1 β -112 β -112 β a β 0.1 β true β β 2 β 2 β 2 β 2 β 2 β b β 1.32 β null β β 3 β 8589934592 β 1073741824 β 112 β 129 β c β 0.12 β false β βββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββ΄βββββββ΄ββββββ΄βββββββ΄ββββββββ
- shuffle(seed: int | None = None) Expr[source]
Shuffle the contents of this expression.
Note this is shuffled independently of any other column or Expression. If you want each row to stay the same use df.sample(shuffle=True)
- Parameters:
- seed
Seed for the random number generator. If set to None (default), a random seed is generated each time the shuffle is called.
Examples
>>> df = pl.DataFrame({"a": [1, 2, 3]}) >>> df.select(pl.col("a").shuffle(seed=1)) shape: (3, 1) βββββββ β a β β --- β β i64 β βββββββ‘ β 2 β β 3 β β 1 β βββββββ
- sign() Expr[source]
Compute the element-wise sign function on numeric types.
The returned value is computed as follows:
-1 if x < 0.
1 if x > 0.
x otherwise (typically 0, but could be NaN if the input is).
Null values are preserved as-is, and the dtype of the input is preserved.
Examples
>>> df = pl.DataFrame({"a": [-9.0, -0.0, 0.0, 4.0, float("nan"), None]}) >>> df.select(pl.col.a.sign()) shape: (6, 1) ββββββββ β a β β --- β β f64 β ββββββββ‘ β -1.0 β β -0.0 β β 0.0 β β 1.0 β β NaN β β null β ββββββββ
- sin() Expr[source]
Compute the element-wise value for the sine.
- Returns:
- Expr
Expression of data type
Float64.
Examples
>>> df = pl.DataFrame({"a": [0.0]}) >>> df.select(pl.col("a").sin()) shape: (1, 1) βββββββ β a β β --- β β f64 β βββββββ‘ β 0.0 β βββββββ
- sinh() Expr[source]
Compute the element-wise value for the hyperbolic sine.
- Returns:
- Expr
Expression of data type
Float64.
Examples
>>> df = pl.DataFrame({"a": [1.0]}) >>> df.select(pl.col("a").sinh()) shape: (1, 1) ββββββββββββ β a β β --- β β f64 β ββββββββββββ‘ β 1.175201 β ββββββββββββ
- skew(*, bias: bool = True) Expr[source]
Compute the sample skewness of a data set.
For normally distributed data, the skewness should be about zero. For unimodal continuous distributions, a skewness value greater than zero means that there is more weight in the right tail of the distribution. The function
skewtestcan be used to determine if the skewness value is close enough to zero, statistically speaking.See scipy.stats for more information.
- Parameters:
- biasbool, optional
If False, the calculations are corrected for statistical bias.
Notes
The sample skewness is computed as the Fisher-Pearson coefficient of skewness, i.e.
\[g_1=\frac{m_3}{m_2^{3/2}}\]where
\[m_i=\frac{1}{N}\sum_{n=1}^N(x[n]-\bar{x})^i\]is the biased sample \(i\texttt{th}\) central moment, and \(\bar{x}\) is the sample mean. If
biasis False, the calculations are corrected for bias and the value computed is the adjusted Fisher-Pearson standardized moment coefficient, i.e.\[G_1 = \frac{k_3}{k_2^{3/2}} = \frac{\sqrt{N(N-1)}}{N-2}\frac{m_3}{m_2^{3/2}}\]Examples
>>> df = pl.DataFrame({"a": [1, 2, 3, 2, 1]}) >>> df.select(pl.col("a").skew()) shape: (1, 1) ββββββββββββ β a β β --- β β f64 β ββββββββββββ‘ β 0.343622 β ββββββββββββ
- slice( ) Expr[source]
Get a slice of this expression.
- Parameters:
- offset
Start index. Negative indexing is supported.
- length
Length of the slice. If set to
None, all rows starting at the offset will be selected.
Examples
>>> df = pl.DataFrame( ... { ... "a": [8, 9, 10, 11], ... "b": [None, 4, 4, 4], ... } ... ) >>> df.select(pl.all().slice(1, 2)) shape: (2, 2) βββββββ¬ββββββ β a β b β β --- β --- β β i64 β i64 β βββββββͺββββββ‘ β 9 β 4 β β 10 β 4 β βββββββ΄ββββββ
- sort( ) Expr[source]
Sort this column.
When used in a projection/selection context, the whole column is sorted. When used in a group by context, the groups are sorted.
- Parameters:
- descending
Sort in descending order.
- nulls_last
Place null values last.
Examples
>>> df = pl.DataFrame( ... { ... "a": [1, None, 3, 2], ... } ... ) >>> df.select(pl.col("a").sort()) shape: (4, 1) ββββββββ β a β β --- β β i64 β ββββββββ‘ β null β β 1 β β 2 β β 3 β ββββββββ >>> df.select(pl.col("a").sort(descending=True)) shape: (4, 1) ββββββββ β a β β --- β β i64 β ββββββββ‘ β null β β 3 β β 2 β β 1 β ββββββββ >>> df.select(pl.col("a").sort(nulls_last=True)) shape: (4, 1) ββββββββ β a β β --- β β i64 β ββββββββ‘ β 1 β β 2 β β 3 β β null β ββββββββ
When sorting in a group by context, the groups are sorted.
>>> df = pl.DataFrame( ... { ... "group": ["one", "one", "one", "two", "two", "two"], ... "value": [1, 98, 2, 3, 99, 4], ... } ... ) >>> df.group_by("group").agg(pl.col("value").sort()) shape: (2, 2) βββββββββ¬βββββββββββββ β group β value β β --- β --- β β str β list[i64] β βββββββββͺβββββββββββββ‘ β two β [3, 4, 99] β β one β [1, 2, 98] β βββββββββ΄βββββββββββββ
- sort_by(
- by: IntoExpr | Iterable[IntoExpr],
- *more_by: IntoExpr,
- descending: bool | Sequence[bool] = False,
- nulls_last: bool | Sequence[bool] = False,
- multithreaded: bool = True,
- maintain_order: bool = False,
Sort this column by the ordering of other columns.
When used in a projection/selection context, the whole column is sorted. When used in a group by context, the groups are sorted.
- Parameters:
- by
Column(s) to sort by. Accepts expression input. Strings are parsed as column names.
- *more_by
Additional columns to sort by, specified as positional arguments.
- descending
Sort in descending order. When sorting by multiple columns, can be specified per column by passing a sequence of booleans.
- nulls_last
Place null values last; can specify a single boolean applying to all columns or a sequence of booleans for per-column control.
- multithreaded
Sort using multiple threads.
- maintain_order
Whether the order should be maintained if elements are equal.
Examples
Pass a single column name to sort by that column.
>>> df = pl.DataFrame( ... { ... "group": ["a", "a", "b", "b"], ... "value1": [1, 3, 4, 2], ... "value2": [8, 7, 6, 5], ... } ... ) >>> df.select(pl.col("group").sort_by("value1")) shape: (4, 1) βββββββββ β group β β --- β β str β βββββββββ‘ β a β β b β β a β β b β βββββββββ
Sorting by expressions is also supported.
>>> df.select(pl.col("group").sort_by(pl.col("value1") + pl.col("value2"))) shape: (4, 1) βββββββββ β group β β --- β β str β βββββββββ‘ β b β β a β β a β β b β βββββββββ
Sort by multiple columns by passing a list of columns.
>>> df.select(pl.col("group").sort_by(["value1", "value2"], descending=True)) shape: (4, 1) βββββββββ β group β β --- β β str β βββββββββ‘ β b β β a β β b β β a β βββββββββ
Or use positional arguments to sort by multiple columns in the same way.
>>> df.select(pl.col("group").sort_by("value1", "value2")) shape: (4, 1) βββββββββ β group β β --- β β str β βββββββββ‘ β a β β b β β a β β b β βββββββββ
When sorting in a group by context, the groups are sorted.
>>> df.group_by("group").agg( ... pl.col("value1").sort_by("value2") ... ) shape: (2, 2) βββββββββ¬ββββββββββββ β group β value1 β β --- β --- β β str β list[i64] β βββββββββͺββββββββββββ‘ β a β [3, 1] β β b β [2, 4] β βββββββββ΄ββββββββββββ
Take a single row from each group where a column attains its minimal value within that group.
>>> df.group_by("group").agg( ... pl.all().sort_by("value2").first() ... ) shape: (2, 3) βββββββββ¬βββββββββ¬βββββββββ β group β value1 β value2 | β --- β --- β --- β β str β i64 β i64 | βββββββββͺβββββββββͺβββββββββ‘ β a β 3 β 7 | β b β 2 β 5 | βββββββββ΄βββββββββ΄βββββββββ
- sqrt() Expr[source]
Compute the square root of the elements.
Examples
>>> df = pl.DataFrame({"values": [1.0, 2.0, 4.0]}) >>> df.select(pl.col("values").sqrt()) shape: (3, 1) ββββββββββββ β values β β --- β β f64 β ββββββββββββ‘ β 1.0 β β 1.414214 β β 2.0 β ββββββββββββ
- std(ddof: int = 1) Expr[source]
Get standard deviation.
- Parameters:
- ddof
βDelta Degrees of Freedomβ: the divisor used in the calculation is N - ddof, where N represents the number of elements. By default ddof is 1.
Examples
>>> df = pl.DataFrame({"a": [-1, 0, 1]}) >>> df.select(pl.col("a").std()) shape: (1, 1) βββββββ β a β β --- β β f64 β βββββββ‘ β 1.0 β βββββββ
- sub(other: Any) Expr[source]
Method equivalent of subtraction operator
expr - other.- Parameters:
- other
Numeric literal or expression value.
Examples
>>> df = pl.DataFrame({"x": [0, 1, 2, 3, 4]}) >>> df.with_columns( ... pl.col("x").sub(2).alias("x-2"), ... pl.col("x").sub(pl.col("x").cum_sum()).alias("x-expr"), ... ) shape: (5, 3) βββββββ¬ββββββ¬βββββββββ β x β x-2 β x-expr β β --- β --- β --- β β i64 β i64 β i64 β βββββββͺββββββͺβββββββββ‘ β 0 β -2 β 0 β β 1 β -1 β 0 β β 2 β 0 β -1 β β 3 β 1 β -3 β β 4 β 2 β -6 β βββββββ΄ββββββ΄βββββββββ
- sum() Expr[source]
Get sum value.
Notes
Dtypes in {Int8, UInt8, Int16, UInt16} are cast to Int64 before summing to prevent overflow issues.
If there are no non-null values, then the output is
0. If you would prefer empty sums to returnNone, you can usepl.when(expr.count()>0).then(expr.sum())instead ofexpr.sum().
Examples
>>> df = pl.DataFrame({"a": [-1, 0, 1]}) >>> df.select(pl.col("a").sum()) shape: (1, 1) βββββββ β a β β --- β β i64 β βββββββ‘ β 0 β βββββββ
- tail(n: int | Expr = 10) Expr[source]
Get the last
nrows.- Parameters:
- n
Number of rows to return.
Examples
>>> df = pl.DataFrame({"foo": [1, 2, 3, 4, 5, 6, 7]}) >>> df.select(pl.col("foo").tail(3)) shape: (3, 1) βββββββ β foo β β --- β β i64 β βββββββ‘ β 5 β β 6 β β 7 β βββββββ
- tan() Expr[source]
Compute the element-wise value for the tangent.
- Returns:
- Expr
Expression of data type
Float64.
Examples
>>> df = pl.DataFrame({"a": [1.0]}) >>> df.select(pl.col("a").tan().round(2)) shape: (1, 1) ββββββββ β a β β --- β β f64 β ββββββββ‘ β 1.56 β ββββββββ
- tanh() Expr[source]
Compute the element-wise value for the hyperbolic tangent.
- Returns:
- Expr
Expression of data type
Float64.
Examples
>>> df = pl.DataFrame({"a": [1.0]}) >>> df.select(pl.col("a").tanh()) shape: (1, 1) ββββββββββββ β a β β --- β β f64 β ββββββββββββ‘ β 0.761594 β ββββββββββββ
- to_physical() Expr[source]
Cast to physical representation of the logical dtype.
List(inner)->List(physical of inner)Array(inner)->Struct(physical of inner)Struct(fields)->Array(physical of fields)
Other data types will be left unchanged.
Warning
The physical representations are an implementation detail and not guaranteed to be stable.
Examples
Replicating the pandas pd.factorize function.
>>> pl.DataFrame({"vals": ["a", "x", None, "a"]}).with_columns( ... pl.col("vals").cast(pl.Categorical), ... pl.col("vals") ... .cast(pl.Categorical) ... .to_physical() ... .alias("vals_physical"), ... ) shape: (4, 2) ββββββββ¬ββββββββββββββββ β vals β vals_physical β β --- β --- β β cat β u32 β ββββββββͺββββββββββββββββ‘ β a β 0 β β x β 1 β β null β null β β a β 0 β ββββββββ΄ββββββββββββββββ
- top_k(k: int | IntoExprColumn = 5) Expr[source]
Return the
klargest elements.Non-null elements are always preferred over null elements. The output is not guaranteed to be in any particular order, call
sort()after this function if you wish the output to be sorted.This has time complexity:
\[O(n)\]- Parameters:
- k
Number of elements to return.
See also
Examples
Get the 5 largest values in series.
>>> df = pl.DataFrame({"value": [1, 98, 2, 3, 99, 4]}) >>> df.select( ... pl.col("value").top_k().alias("top_k"), ... pl.col("value").bottom_k().alias("bottom_k"), ... ) shape: (5, 2) βββββββββ¬βββββββββββ β top_k β bottom_k β β --- β --- β β i64 β i64 β βββββββββͺβββββββββββ‘ β 4 β 1 β β 98 β 98 β β 2 β 2 β β 3 β 3 β β 99 β 4 β βββββββββ΄βββββββββββ
- top_k_by(
- by: IntoExpr | Iterable[IntoExpr],
- k: int | IntoExprColumn = 5,
- *,
- reverse: bool | Sequence[bool] = False,
Return the elements corresponding to the
klargest elements of thebycolumn(s).Non-null elements are always preferred over null elements, regardless of the value of
reverse. The output is not guaranteed to be in any particular order, callsort()after this function if you wish the output to be sorted.This has time complexity:
\[O(n \log{n})\]Changed in version 1.0.0: The
descendingparameter was renamed toreverse.- Parameters:
- by
Column(s) used to determine the largest elements. Accepts expression input. Strings are parsed as column names.
- k
Number of elements to return.
- reverse
Consider the
ksmallest elements of thebycolumn(s) (instead of theklargest). This can be specified per column by passing a sequence of booleans.
See also
Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 2, 3, 4, 5, 6], ... "b": [6, 5, 4, 3, 2, 1], ... "c": ["Apple", "Orange", "Apple", "Apple", "Banana", "Banana"], ... } ... ) >>> df shape: (6, 3) βββββββ¬ββββββ¬βββββββββ β a β b β c β β --- β --- β --- β β i64 β i64 β str β βββββββͺββββββͺβββββββββ‘ β 1 β 6 β Apple β β 2 β 5 β Orange β β 3 β 4 β Apple β β 4 β 3 β Apple β β 5 β 2 β Banana β β 6 β 1 β Banana β βββββββ΄ββββββ΄βββββββββ
Get the top 2 rows by column
aorb.>>> df.select( ... pl.all().top_k_by("a", 2).name.suffix("_top_by_a"), ... pl.all().top_k_by("b", 2).name.suffix("_top_by_b"), ... ) shape: (2, 6) ββββββββββββββ¬βββββββββββββ¬βββββββββββββ¬βββββββββββββ¬βββββββββββββ¬βββββββββββββ β a_top_by_a β b_top_by_a β c_top_by_a β a_top_by_b β b_top_by_b β c_top_by_b β β --- β --- β --- β --- β --- β --- β β i64 β i64 β str β i64 β i64 β str β ββββββββββββββͺβββββββββββββͺβββββββββββββͺβββββββββββββͺβββββββββββββͺβββββββββββββ‘ β 6 β 1 β Banana β 1 β 6 β Apple β β 5 β 2 β Banana β 2 β 5 β Orange β ββββββββββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββββ
Get the top 2 rows by multiple columns with given order.
>>> df.select( ... pl.all() ... .top_k_by(["c", "a"], 2, reverse=[False, True]) ... .name.suffix("_by_ca"), ... pl.all() ... .top_k_by(["c", "b"], 2, reverse=[False, True]) ... .name.suffix("_by_cb"), ... ) shape: (2, 6) βββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ β a_by_ca β b_by_ca β c_by_ca β a_by_cb β b_by_cb β c_by_cb β β --- β --- β --- β --- β --- β --- β β i64 β i64 β str β i64 β i64 β str β βββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββͺββββββββββ‘ β 2 β 5 β Orange β 2 β 5 β Orange β β 5 β 2 β Banana β 6 β 1 β Banana β βββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ
Get the top 2 rows by column
ain each group.>>> ( ... df.group_by("c", maintain_order=True) ... .agg(pl.all().top_k_by("a", 2)) ... .explode(pl.all().exclude("c")) ... ) shape: (5, 3) ββββββββββ¬ββββββ¬ββββββ β c β a β b β β --- β --- β --- β β str β i64 β i64 β ββββββββββͺββββββͺββββββ‘ β Apple β 4 β 3 β β Apple β 3 β 4 β β Orange β 2 β 5 β β Banana β 6 β 1 β β Banana β 5 β 2 β ββββββββββ΄ββββββ΄ββββββ
- truediv(other: Any) Expr[source]
Method equivalent of float division operator
expr / other.- Parameters:
- other
Numeric literal or expression value.
See also
Notes
Zero-division behaviour follows IEEE-754:
0/0: Invalid operation - mathematically undefined, returns NaN. n/0: On finite operands gives an exact infinite result, eg: Β±infinity.
Examples
>>> df = pl.DataFrame( ... data={"x": [-2, -1, 0, 1, 2], "y": [0.5, 0.0, 0.0, -4.0, -0.5]} ... ) >>> df.with_columns( ... pl.col("x").truediv(2).alias("x/2"), ... pl.col("x").truediv(pl.col("y")).alias("x/y"), ... ) shape: (5, 4) βββββββ¬βββββββ¬βββββββ¬ββββββββ β x β y β x/2 β x/y β β --- β --- β --- β --- β β i64 β f64 β f64 β f64 β βββββββͺβββββββͺβββββββͺββββββββ‘ β -2 β 0.5 β -1.0 β -4.0 β β -1 β 0.0 β -0.5 β -inf β β 0 β 0.0 β 0.0 β NaN β β 1 β -4.0 β 0.5 β -0.25 β β 2 β -0.5 β 1.0 β -4.0 β βββββββ΄βββββββ΄βββββββ΄ββββββββ
- truncate(decimals: int = 0) Expr[source]
Truncate numeric data toward zero to
decimalsnumber of decimal places.- Parameters:
- decimals
Number of decimal places to truncate to.
See also
ceilRound up to the nearest integer.
floorRound down to the nearest integer.
roundRound to a given number of decimals.
round_sig_figsRound to a given number of significant figures.
Notes
Truncation discards the fractional part beyond the given number of decimals. For example, when rounding to 0 decimals 0.25, -0.25, 0.99, and -0.99 will all round to 0. When rounding to 1 decimal 1.9999 rounds to 1.9 and -1.9999 rounds to -1.9. There is no tiebreak behaviour at midpoint values as there is with
round()so 0.5 and -0.5 will also round to 0 when decimals=1.This method performs numeric truncation. For truncating temporal data (dates/datetimes), use
Expr.dt.truncate()instead.
Examples
>>> df = pl.DataFrame({"n": [-9.9999, 0.12345, 1.0251, 8.8765]}) >>> df.with_columns( ... t0=pl.col("n").truncate(0), ... t1=pl.col("n").truncate(1), ... t2=pl.col("n").truncate(2), ... t3=pl.col("n").truncate(3), ... t4=pl.col("n").truncate(4), ... ) shape: (4, 6) βββββββββββ¬βββββββ¬βββββββ¬ββββββββ¬βββββββββ¬ββββββββββ β n β t0 β t1 β t2 β t3 β t4 β β --- β --- β --- β --- β --- β --- β β f64 β f64 β f64 β f64 β f64 β f64 β βββββββββββͺβββββββͺβββββββͺββββββββͺβββββββββͺββββββββββ‘ β -9.9999 β -9.0 β -9.9 β -9.99 β -9.999 β -9.9999 β β 0.12345 β 0.0 β 0.1 β 0.12 β 0.123 β 0.1234 β β 1.0251 β 1.0 β 1.0 β 1.02 β 1.025 β 1.025 β β 8.8765 β 8.0 β 8.8 β 8.87 β 8.876 β 8.8765 β βββββββββββ΄βββββββ΄βββββββ΄ββββββββ΄βββββββββ΄ββββββββββ
- unique(*, maintain_order: bool = False) Expr[source]
Get unique values of this expression.
nullis considered to be a unique value for the purposes of this operation.- Parameters:
- maintain_order
Maintain order of data. This requires more work.
Examples
>>> df = pl.DataFrame({"a": [1, 1, 2]}) >>> df.select(pl.col("a").unique()) shape: (2, 1) βββββββ β a β β --- β β i64 β βββββββ‘ β 2 β β 1 β βββββββ >>> df.select(pl.col("a").unique(maintain_order=True)) shape: (2, 1) βββββββ β a β β --- β β i64 β βββββββ‘ β 1 β β 2 β βββββββ
- unique_counts() Expr[source]
Return a count of the unique values in the order of appearance.
This method differs from
value_countsin that it does not return the values, only the counts and might be fasterExamples
>>> df = pl.DataFrame( ... { ... "id": ["a", "b", "b", "c", "c", "c"], ... } ... ) >>> df.select(pl.col("id").unique_counts()) shape: (3, 1) βββββββ β id β β --- β β u32 β βββββββ‘ β 1 β β 2 β β 3 β βββββββ
Note that
group_bycan be used to generate counts.>>> df.group_by("id", maintain_order=True).len().select("len") shape: (3, 1) βββββββ β len β β --- β β u32 β βββββββ‘ β 1 β β 2 β β 3 β βββββββ
To add counts as a new column
pl.len()can be used as a window function.>>> df.with_columns(pl.len().over("id")) shape: (6, 2) βββββββ¬ββββββ β id β len β β --- β --- β β str β u32 β βββββββͺββββββ‘ β a β 1 β β b β 2 β β b β 2 β β c β 3 β β c β 3 β β c β 3 β βββββββ΄ββββββ
- upper_bound() Expr[source]
Calculate the upper bound.
Returns a unit Series with the highest value possible for the dtype of this expression.
Examples
>>> df = pl.DataFrame({"a": [1, 2, 3, 2, 1]}) >>> df.select(pl.col("a").upper_bound()) shape: (1, 1) βββββββββββββββββββββββ β a β β --- β β i64 β βββββββββββββββββββββββ‘ β 9223372036854775807 β βββββββββββββββββββββββ
- value_counts( ) Expr[source]
Count the occurrence of unique values.
- Parameters:
- sort
Sort the output by count, in descending order. If set to
False(default), the order is non-deterministic.- parallel
Execute the computation in parallel.
Note
This option should likely not be enabled in a
group_bycontext, as the computation will already be parallelized per group.- name
Give the resulting count column a specific name; if
normalizeis True this defaults to βproportionβ, otherwise defaults to βcountβ.- normalize
If True, the count is returned as the relative frequency of unique values normalized to 1.0.
- Returns:
- Expr
Expression of type
Struct, mapping unique values to their count (or proportion).
Examples
>>> df = pl.DataFrame( ... {"color": ["red", "blue", "red", "green", "blue", "blue"]} ... ) >>> df_count = df.select(pl.col("color").value_counts()) >>> df_count shape: (3, 1) βββββββββββββββ β color β β --- β β struct[2] β βββββββββββββββ‘ β {"green",1} β β {"blue",3} β β {"red",2} β βββββββββββββββ
>>> df_count.unnest("color") shape: (3, 2) βββββββββ¬ββββββββ β color β count β β --- β --- β β str β u32 β βββββββββͺββββββββ‘ β green β 1 β β blue β 3 β β red β 2 β βββββββββ΄ββββββββ
Sort the output by (descending) count, customize the field name, and normalize the count to its relative proportion (of 1.0).
>>> df_count = df.select( ... pl.col("color").value_counts( ... name="fraction", ... normalize=True, ... sort=True, ... ) ... ) >>> df_count shape: (3, 1) ββββββββββββββββββββββ β color β β --- β β struct[2] β ββββββββββββββββββββββ‘ β {"blue",0.5} β β {"red",0.333333} β β {"green",0.166667} β ββββββββββββββββββββββ
>>> df_count.unnest("color") shape: (3, 2) βββββββββ¬βββββββββββ β color β fraction β β --- β --- β β str β f64 β βββββββββͺβββββββββββ‘ β blue β 0.5 β β red β 0.333333 β β green β 0.166667 β βββββββββ΄βββββββββββ
Note that
group_bycan be used to generate counts.>>> df.group_by("color").len() shape: (3, 2) βββββββββ¬ββββββ β color β len β β --- β --- β β str β u32 β βββββββββͺββββββ‘ β red β 2 β β green β 1 β β blue β 3 β βββββββββ΄ββββββ
To add counts as a new column
pl.len()can be used as a window function.>>> df.with_columns(pl.len().over("color")) shape: (6, 2) βββββββββ¬ββββββ β color β len β β --- β --- β β str β u32 β βββββββββͺββββββ‘ β red β 2 β β blue β 3 β β red β 2 β β green β 1 β β blue β 3 β β blue β 3 β βββββββββ΄ββββββ
>>> df.with_columns((pl.len().over("color") / pl.len()).alias("fraction")) shape: (6, 2) βββββββββ¬βββββββββββ β color β fraction β β --- β --- β β str β f64 β βββββββββͺβββββββββββ‘ β red β 0.333333 β β blue β 0.5 β β red β 0.333333 β β green β 0.166667 β β blue β 0.5 β β blue β 0.5 β βββββββββ΄βββββββββββ
- var(ddof: int = 1) Expr[source]
Get variance.
- Parameters:
- ddof
βDelta Degrees of Freedomβ: the divisor used in the calculation is N - ddof, where N represents the number of elements. By default ddof is 1.
Examples
>>> df = pl.DataFrame({"a": [-1, 0, 1]}) >>> df.select(pl.col("a").var()) shape: (1, 1) βββββββ β a β β --- β β f64 β βββββββ‘ β 1.0 β βββββββ
- where(predicate: Expr) Expr[source]
Filter a single column.
Deprecated since version 0.20.4: Use the
filter()method instead.Alias for
filter().- Parameters:
- predicate
Boolean expression.
Examples
>>> df = pl.DataFrame( ... { ... "group_col": ["g1", "g1", "g2"], ... "b": [1, 2, 3], ... } ... ) >>> df.group_by("group_col").agg( ... [ ... pl.col("b").where(pl.col("b") < 2).sum().alias("lt"), ... pl.col("b").where(pl.col("b") >= 2).sum().alias("gte"), ... ] ... ).sort("group_col") shape: (2, 3) βββββββββββββ¬ββββββ¬ββββββ β group_col β lt β gte β β --- β --- β --- β β str β i64 β i64 β βββββββββββββͺββββββͺββββββ‘ β g1 β 1 β 2 β β g2 β 0 β 3 β βββββββββββββ΄ββββββ΄ββββββ
- xor(other: Any) Expr[source]
Method equivalent of bitwise exclusive-or operator
expr ^ other.- Parameters:
- other
Integer or boolean value; accepts expression input.
Examples
>>> df = pl.DataFrame( ... {"x": [True, False, True, False], "y": [True, True, False, False]} ... ) >>> df.with_columns(pl.col("x").xor(pl.col("y")).alias("x ^ y")) shape: (4, 3) βββββββββ¬ββββββββ¬ββββββββ β x β y β x ^ y β β --- β --- β --- β β bool β bool β bool β βββββββββͺββββββββͺββββββββ‘ β true β true β false β β false β true β true β β true β false β true β β false β false β false β βββββββββ΄ββββββββ΄ββββββββ
>>> def binary_string(n: int) -> str: ... return bin(n)[2:].zfill(8) >>> >>> df = pl.DataFrame( ... data={"x": [10, 8, 250, 66], "y": [1, 2, 3, 4]}, ... schema={"x": pl.UInt8, "y": pl.UInt8}, ... ) >>> df.with_columns( ... pl.col("x") ... .map_elements(binary_string, return_dtype=pl.String) ... .alias("bin_x"), ... pl.col("y") ... .map_elements(binary_string, return_dtype=pl.String) ... .alias("bin_y"), ... pl.col("x").xor(pl.col("y")).alias("xor_xy"), ... pl.col("x") ... .xor(pl.col("y")) ... .map_elements(binary_string, return_dtype=pl.String) ... .alias("bin_xor_xy"), ... ) shape: (4, 6) βββββββ¬ββββββ¬βββββββββββ¬βββββββββββ¬βββββββββ¬βββββββββββββ β x β y β bin_x β bin_y β xor_xy β bin_xor_xy β β --- β --- β --- β --- β --- β --- β β u8 β u8 β str β str β u8 β str β βββββββͺββββββͺβββββββββββͺβββββββββββͺβββββββββͺβββββββββββββ‘ β 10 β 1 β 00001010 β 00000001 β 11 β 00001011 β β 8 β 2 β 00001000 β 00000010 β 10 β 00001010 β β 250 β 3 β 11111010 β 00000011 β 249 β 11111001 β β 66 β 4 β 01000010 β 00000100 β 70 β 01000110 β βββββββ΄ββββββ΄βββββββββββ΄βββββββββββ΄βββββββββ΄βββββββββββββ