DataFrame#
This page gives an overview of all public DataFrame methods.
- class polars.DataFrame(
- data: FrameInitTypes | None = None,
- schema: SchemaDefinition | None = None,
- *,
- schema_overrides: SchemaDict | None = None,
- orient: Orientation | None = None,
- infer_schema_length: int | None = 100,
- nan_to_null: bool = False,
- Two-dimensional data structure representing data as a table with rows and columns. - Parameters:
- datadict, Sequence, ndarray, Series, or pandas.DataFrame
- Two-dimensional data in various forms; dict input must contain Sequences, Generators, or a - range. Sequence may contain Series or other Sequences.
- schemaSequence of str, (str,DataType) pairs, or a {str:DataType,} dict
- The DataFrame schema may be declared in several ways: - As a dict of {name:type} pairs; if type is None, it will be auto-inferred. 
- As a list of column names; in this case types are automatically inferred. 
- As a list of (name,type) pairs; this is equivalent to the dictionary form. 
 - If you supply a list of column names that does not match the names in the underlying data, the names given here will overwrite them. The number of names given in the schema should match the underlying data dimensions. 
- schema_overridesdict, default None
- Support type specification or override of one or more columns; note that any dtypes inferred from the schema param will be overridden. underlying data, the names given here will overwrite them. - The number of entries in the schema should match the underlying data dimensions, unless a sequence of dictionaries is being passed, in which case a partial schema can be declared to prevent specific fields from being loaded. 
- orient{‘col’, ‘row’}, default None
- Whether to interpret two-dimensional data as columns or as rows. If None, the orientation is inferred by matching the columns and data dimensions. If this does not yield conclusive results, column orientation is used. 
- infer_schema_lengthint, default None
- Maximum number of rows to read for schema inference; only applies if the input data is a sequence or generator of rows; other input is read as-is. 
- nan_to_nullbool, default False
- If the data comes from one or more numpy arrays, can optionally convert input data np.nan values to null instead. This is a no-op for all other input data. 
 
 - Notes - Some methods internally convert the DataFrame into a LazyFrame before collecting the results back into a DataFrame. This can lead to unexpected behavior when using a subclassed DataFrame. For example, - >>> class MyDataFrame(pl.DataFrame): ... pass ... >>> isinstance(MyDataFrame().lazy().collect(), MyDataFrame) False - Examples - Constructing a DataFrame from a dictionary: - >>> data = {"a": [1, 2], "b": [3, 4]} >>> df = pl.DataFrame(data) >>> df shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 3 │ │ 2 ┆ 4 │ └─────┴─────┘ - Notice that the dtypes are automatically inferred as polars Int64: - >>> df.dtypes [Int64, Int64] - To specify a more detailed/specific frame schema you can supply the - schemaparameter with a dictionary of (name,dtype) pairs…- >>> data = {"col1": [0, 2], "col2": [3, 7]} >>> df2 = pl.DataFrame(data, schema={"col1": pl.Float32, "col2": pl.Int64}) >>> df2 shape: (2, 2) ┌──────┬──────┐ │ col1 ┆ col2 │ │ --- ┆ --- │ │ f32 ┆ i64 │ ╞══════╪══════╡ │ 0.0 ┆ 3 │ │ 2.0 ┆ 7 │ └──────┴──────┘ - …a sequence of (name,dtype) pairs… - >>> data = {"col1": [1, 2], "col2": [3, 4]} >>> df3 = pl.DataFrame(data, schema=[("col1", pl.Float32), ("col2", pl.Int64)]) >>> df3 shape: (2, 2) ┌──────┬──────┐ │ col1 ┆ col2 │ │ --- ┆ --- │ │ f32 ┆ i64 │ ╞══════╪══════╡ │ 1.0 ┆ 3 │ │ 2.0 ┆ 4 │ └──────┴──────┘ - …or a list of typed Series. - >>> data = [ ... pl.Series("col1", [1, 2], dtype=pl.Float32), ... pl.Series("col2", [3, 4], dtype=pl.Int64), ... ] >>> df4 = pl.DataFrame(data) >>> df4 shape: (2, 2) ┌──────┬──────┐ │ col1 ┆ col2 │ │ --- ┆ --- │ │ f32 ┆ i64 │ ╞══════╪══════╡ │ 1.0 ┆ 3 │ │ 2.0 ┆ 4 │ └──────┴──────┘ - Constructing a DataFrame from a numpy ndarray, specifying column names: - >>> import numpy as np >>> data = np.array([(1, 2), (3, 4)], dtype=np.int64) >>> df5 = pl.DataFrame(data, schema=["a", "b"], orient="col") >>> df5 shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 3 │ │ 2 ┆ 4 │ └─────┴─────┘ - Constructing a DataFrame from a list of lists, row orientation inferred: - >>> data = [[1, 2, 3], [4, 5, 6]] >>> df6 = pl.DataFrame(data, schema=["a", "b", "c"]) >>> df6 shape: (2, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ 1 ┆ 2 ┆ 3 │ │ 4 ┆ 5 ┆ 6 │ └─────┴─────┴─────┘ - Methods: - Apply a custom/user-defined function (UDF) over the rows of the DataFrame. - Approximate count of unique values. - Approximate count of unique values. - Return the - ksmallest elements.- Cast DataFrame column(s) to the specified dtype(s). - Create an empty (n=0) or - n-row null-filled (n>0) copy of the DataFrame.- Create a copy of this DataFrame. - Return pairwise Pearson product-moment correlation coefficients between columns. - Summary statistics for a DataFrame. - Remove columns from the dataframe. - Drop a single column in-place and return the dropped column. - Drop all rows that contain null values. - Check whether the DataFrame is equal to another DataFrame. - Return an estimation of the total (heap) allocated size of the - DataFrame.- Explode the dataframe to long format by exploding the given columns. - Extend the memory backed by this - DataFramewith the values from- other.- Fill floating point NaN values by an Expression evaluation. - Fill null values using the specified value or strategy. - Filter the rows in the DataFrame based on a predicate expression. - Find the index of a column by name. - Apply a horizontal reduction on a DataFrame. - Check whether the DataFrame is equal to another DataFrame. - Take every nth row in the DataFrame and return as a new DataFrame. - Get a single column by name. - Find the index of a column by name. - Get the DataFrame as a List of Series. - Return a dense preview of the DataFrame. - Start a group by operation. - Group based on a time value (or index value of type Int32, Int64). - Create rolling groups based on a time, Int32, or Int64 column. - Start a group by operation. - Group based on a time value (or index value of type Int32, Int64). - Create rolling groups based on a time, Int32, or Int64 column. - Hash and combine the rows in this DataFrame. - Get the first - nrows.- Return a new DataFrame grown horizontally by stacking multiple Series to it. - Insert a Series at a certain column index. - Insert a Series at a certain column index. - Interpolate intermediate values. - Get a mask of all duplicated rows in this DataFrame. - Check if the dataframe is empty. - Get a mask of all unique rows in this DataFrame. - Return the DataFrame as a scalar, or return the element at the given row/column. - Returns an iterator over the DataFrame's columns. - Returns an iterator over the DataFrame of rows of python-native values. - Returns a non-copying iterator of slices over the underlying DataFrame. - Join in SQL-like fashion. - Perform an asof join. - Start a lazy query from this point. - Get the first - nrows.- Apply a custom/user-defined function (UDF) over the rows of the DataFrame. - Aggregate the columns of this DataFrame to their maximum value. - Get the maximum value horizontally across columns. - Aggregate the columns of this DataFrame to their mean value. - Take the mean of all values horizontally across columns. - Aggregate the columns of this DataFrame to their median value. - Unpivot a DataFrame from wide to long format. - Take two sorted DataFrames and merge them by the sorted key. - Aggregate the columns of this DataFrame to their minimum value. - Get the minimum value horizontally across columns. - Get number of chunks used by the ChunkedArrays of this DataFrame. - Return the number of unique rows, or the number of unique row-subsets. - Create a new DataFrame that shows the null counts per column. - Group by the given columns and return the groups as separate dataframes. - Offers a structured way to apply a sequence of user-defined functions (UDFs). - Create a spreadsheet-style pivot table as a DataFrame. - Aggregate the columns of this DataFrame to their product values. - Aggregate the columns of this DataFrame to their quantile value. - Rechunk the data in this DataFrame to a contiguous allocation. - Rename column names. - Replace a column by a new Series. - Replace a column at an index location. - Replace a column at an index location. - Reverse the DataFrame. - Create rolling groups based on a time, Int32, or Int64 column. - Get the values of a single row, either by index or by predicate. - Returns all data in the DataFrame as a list of rows of python-native values. - Returns DataFrame data as a keyed dictionary of python-native values. - Sample from this DataFrame. - Select columns from this DataFrame. - Select columns from this LazyFrame. - Indicate that one or multiple columns are sorted. - Shift values by the given number of indices. - Shift values by the given number of places and fill the resulting null values. - Shrink DataFrame memory usage. - Get a slice of this DataFrame. - Sort the dataframe by the given columns. - Aggregate the columns of this DataFrame to their standard deviation value. - Aggregate the columns of this DataFrame to their sum value. - Sum all values horizontally across columns. - Get the last - nrows.- Take every nth row in the DataFrame and return as a new DataFrame. - Collect the underlying arrow arrays in an Arrow Table. - Convert DataFrame to a dictionary mapping column name to values. - Convert every row to a dictionary of Python-native values. - Convert categorical variables into dummy/indicator variables. - Convert DataFrame to instantiatable string representation. - Convert DataFrame to a 2D NumPy array. - Cast to a pandas DataFrame. - Select column as Series at index location. - Convert a - DataFrameto a- Seriesof type- Struct.- Return the - klargest elements.- Transpose a DataFrame over the diagonal. - Drop duplicate rows from this dataframe. - Decompose struct columns into separate columns for each of their fields. - Unstack a long table to a wide form without doing an aggregation. - Update the values in this - DataFramewith the values in- other.- Upsample a DataFrame at a regular frequency. - Aggregate the columns of this DataFrame to their variance value. - Grow this DataFrame vertically by stacking a DataFrame to it. - Add columns to this DataFrame. - Add columns to this DataFrame. - Add a column at index 0 that counts the rows. - Write to Apache Avro file. - Write to comma-separated values (CSV) file. - Write a polars frame to a database. - Write DataFrame as delta table. - Write frame data to a table in an Excel workbook/worksheet. - Write to Arrow IPC binary stream or Feather file. - Write to Arrow IPC record batch stream. - Serialize to JSON representation. - Serialize to newline delimited JSON representation. - Write to Apache Parquet file. - Attributes: - Get or set column names. - Get the datatypes of the columns of this DataFrame. - Get flags that are set on the columns of this DataFrame. - Get the height of the DataFrame. - Get a dict[column name, DataType]. - Get the shape of the DataFrame. - Get the width of the DataFrame. - apply(
- function: Callable[[tuple[Any, ...]], Any],
- return_dtype: PolarsDataType | None = None,
- *,
- inference_size: int = 256,
- Apply a custom/user-defined function (UDF) over the rows of the DataFrame. - Deprecated since version 0.19.0: This method has been renamed to - DataFrame.map_rows().- Parameters:
- function
- Custom function or lambda. 
- return_dtype
- Output type of the operation. If none given, Polars tries to infer the type. 
- inference_size
- Only used in the case when the custom function returns rows. This uses the first - nrows to determine the output schema
 
 
 - approx_n_unique() DataFrame[source]
- Approximate count of unique values. - This is done using the HyperLogLog++ algorithm for cardinality estimation. - Examples - >>> df = pl.DataFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [1, 2, 1, 1], ... } ... ) >>> df.approx_n_unique() shape: (1, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ u32 ┆ u32 │ ╞═════╪═════╡ │ 4 ┆ 2 │ └─────┴─────┘ 
 - approx_unique() DataFrame[source]
- Approximate count of unique values. - Deprecated since version 0.18.12: This method has been renamed to - DataFrame.approx_n_unique().
 - bottom_k(
- k: int,
- *,
- by: IntoExpr | Iterable[IntoExpr],
- descending: bool | Sequence[bool] = False,
- nulls_last: bool = False,
- maintain_order: bool = False,
- Return the - ksmallest elements.- If ‘descending=True` the largest elements will be given. - Parameters:
- k
- Number of rows to return. 
- by
- Column(s) included in sort order. Accepts expression input. Strings are parsed as column names. 
- descending
- Return the ‘k’ smallest. Top-k by multiple columns can be specified per column by passing a sequence of booleans. 
- nulls_last
- Place null values last. 
- maintain_order
- Whether the order should be maintained if elements are equal. Note that if - truestreaming is not possible and performance might be worse since this requires a stable search.
 
 - See also - Examples - >>> df = pl.DataFrame( ... { ... "a": ["a", "b", "a", "b", "b", "c"], ... "b": [2, 1, 1, 3, 2, 1], ... } ... ) - Get the rows which contain the 4 smallest values in column b. - >>> df.bottom_k(4, by="b") shape: (4, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════╪═════╡ │ b ┆ 1 │ │ a ┆ 1 │ │ c ┆ 1 │ │ a ┆ 2 │ └─────┴─────┘ - Get the rows which contain the 4 smallest values when sorting on column a and b. - >>> df.bottom_k(4, by=["a", "b"]) shape: (4, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════╪═════╡ │ a ┆ 1 │ │ a ┆ 2 │ │ b ┆ 1 │ │ b ┆ 2 │ └─────┴─────┘ 
 - cast(
- dtypes: Mapping[ColumnNameOrSelector, PolarsDataType] | PolarsDataType,
- *,
- strict: bool = True,
- Cast DataFrame column(s) to the specified dtype(s). - Parameters:
- dtypes
- Mapping of column names (or selector) to dtypes, or a single dtype to which all columns will be cast. 
- strict
- Throw an error if a cast could not be done (for instance, due to an overflow). 
 
 - Examples - >>> from datetime import date >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": [date(2020, 1, 2), date(2021, 3, 4), date(2022, 5, 6)], ... } ... ) - Cast specific frame columns to the specified dtypes: - >>> df.cast({"foo": pl.Float32, "bar": pl.UInt8}) shape: (3, 3) ┌─────┬─────┬────────────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ f32 ┆ u8 ┆ date │ ╞═════╪═════╪════════════╡ │ 1.0 ┆ 6 ┆ 2020-01-02 │ │ 2.0 ┆ 7 ┆ 2021-03-04 │ │ 3.0 ┆ 8 ┆ 2022-05-06 │ └─────┴─────┴────────────┘ - Cast all frame columns to the specified dtype: - >>> df.cast(pl.Utf8).to_dict(as_series=False) {'foo': ['1', '2', '3'], 'bar': ['6.0', '7.0', '8.0'], 'ham': ['2020-01-02', '2021-03-04', '2022-05-06']} - Use selectors to define the columns being cast: - >>> import polars.selectors as cs >>> df.cast({cs.numeric(): pl.UInt32, cs.temporal(): pl.Utf8}) shape: (3, 3) ┌─────┬─────┬────────────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ u32 ┆ u32 ┆ str │ ╞═════╪═════╪════════════╡ │ 1 ┆ 6 ┆ 2020-01-02 │ │ 2 ┆ 7 ┆ 2021-03-04 │ │ 3 ┆ 8 ┆ 2022-05-06 │ └─────┴─────┴────────────┘ 
 - clear(n: int = 0) Self[source]
- Create an empty (n=0) or - n-row null-filled (n>0) copy of the DataFrame.- Returns a - n-row null-filled DataFrame with an identical schema.- ncan be greater than the current number of rows in the DataFrame.- Parameters:
- n
- Number of (null-filled) rows to return in the cleared frame. 
 
 - See also - clone
- Cheap deepcopy/clone. 
 - Examples - >>> df = pl.DataFrame( ... { ... "a": [None, 2, 3, 4], ... "b": [0.5, None, 2.5, 13], ... "c": [True, True, False, None], ... } ... ) >>> df.clear() shape: (0, 3) ┌─────┬─────┬──────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ bool │ ╞═════╪═════╪══════╡ └─────┴─────┴──────┘ - >>> df.clear(n=2) shape: (2, 3) ┌──────┬──────┬──────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ bool │ ╞══════╪══════╪══════╡ │ null ┆ null ┆ null │ │ null ┆ null ┆ null │ └──────┴──────┴──────┘ 
 - clone() Self[source]
- Create a copy of this DataFrame. - This is a cheap operation that does not copy data. - See also - clear
- Create an empty copy of the current DataFrame, with identical schema but no data. 
 - Examples - >>> df = pl.DataFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [0.5, 4, 10, 13], ... "c": [True, True, False, True], ... } ... ) >>> df.clone() shape: (4, 3) ┌─────┬──────┬───────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ bool │ ╞═════╪══════╪═══════╡ │ 1 ┆ 0.5 ┆ true │ │ 2 ┆ 4.0 ┆ true │ │ 3 ┆ 10.0 ┆ false │ │ 4 ┆ 13.0 ┆ true │ └─────┴──────┴───────┘ 
 - property columns: list[str][source]
- Get or set column names. - Examples - >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.columns ['foo', 'bar', 'ham'] - Set column names: - >>> df.columns = ["apple", "banana", "orange"] >>> df shape: (3, 3) ┌───────┬────────┬────────┐ │ apple ┆ banana ┆ orange │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═══════╪════════╪════════╡ │ 1 ┆ 6 ┆ a │ │ 2 ┆ 7 ┆ b │ │ 3 ┆ 8 ┆ c │ └───────┴────────┴────────┘ 
 - corr(**kwargs: Any) DataFrame[source]
- Return pairwise Pearson product-moment correlation coefficients between columns. - See numpy - corrcoeffor more information: https://numpy.org/doc/stable/reference/generated/numpy.corrcoef.html- Parameters:
- **kwargs
- Keyword arguments are passed to numpy - corrcoef.
 
 - Notes - This functionality requires numpy to be installed. - Examples - >>> df = pl.DataFrame({"foo": [1, 2, 3], "bar": [3, 2, 1], "ham": [7, 8, 9]}) >>> df.corr() shape: (3, 3) ┌──────┬──────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ f64 │ ╞══════╪══════╪══════╡ │ 1.0 ┆ -1.0 ┆ 1.0 │ │ -1.0 ┆ 1.0 ┆ -1.0 │ │ 1.0 ┆ -1.0 ┆ 1.0 │ └──────┴──────┴──────┘ 
 - describe( ) Self[source]
- Summary statistics for a DataFrame. - Parameters:
- percentiles
- One or more percentiles to include in the summary statistics. All values must be in the range - [0, 1].
 
 - See also - Notes - The median is included by default as the 50% percentile. - Examples - >>> from datetime import date >>> df = pl.DataFrame( ... { ... "a": [1.0, 2.8, 3.0], ... "b": [4, 5, None], ... "c": [True, False, True], ... "d": [None, "b", "c"], ... "e": ["usd", "eur", None], ... "f": [date(2020, 1, 1), date(2021, 1, 1), date(2022, 1, 1)], ... } ... ) >>> df.describe() shape: (9, 7) ┌────────────┬──────────┬──────────┬──────────┬──────┬──────┬────────────┐ │ describe ┆ a ┆ b ┆ c ┆ d ┆ e ┆ f │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ f64 ┆ f64 ┆ f64 ┆ str ┆ str ┆ str │ ╞════════════╪══════════╪══════════╪══════════╪══════╪══════╪════════════╡ │ count ┆ 3.0 ┆ 3.0 ┆ 3.0 ┆ 3 ┆ 3 ┆ 3 │ │ null_count ┆ 0.0 ┆ 1.0 ┆ 0.0 ┆ 1 ┆ 1 ┆ 0 │ │ mean ┆ 2.266667 ┆ 4.5 ┆ 0.666667 ┆ null ┆ null ┆ null │ │ std ┆ 1.101514 ┆ 0.707107 ┆ 0.57735 ┆ null ┆ null ┆ null │ │ min ┆ 1.0 ┆ 4.0 ┆ 0.0 ┆ b ┆ eur ┆ 2020-01-01 │ │ 25% ┆ 1.0 ┆ 4.0 ┆ null ┆ null ┆ null ┆ null │ │ 50% ┆ 2.8 ┆ 5.0 ┆ null ┆ null ┆ null ┆ null │ │ 75% ┆ 3.0 ┆ 5.0 ┆ null ┆ null ┆ null ┆ null │ │ max ┆ 3.0 ┆ 5.0 ┆ 1.0 ┆ c ┆ usd ┆ 2022-01-01 │ └────────────┴──────────┴──────────┴──────────┴──────┴──────┴────────────┘ 
 - drop(
- columns: ColumnNameOrSelector | Collection[ColumnNameOrSelector],
- *more_columns: ColumnNameOrSelector,
- Remove columns from the dataframe. - Parameters:
- columns
- Names of the columns that should be removed from the dataframe, or a selector that determines the columns to drop. 
- *more_columns
- Additional columns to drop, specified as positional arguments. 
 
 - Examples - Drop a single column by passing the name of that column. - >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.drop("ham") shape: (3, 2) ┌─────┬─────┐ │ foo ┆ bar │ │ --- ┆ --- │ │ i64 ┆ f64 │ ╞═════╪═════╡ │ 1 ┆ 6.0 │ │ 2 ┆ 7.0 │ │ 3 ┆ 8.0 │ └─────┴─────┘ - Drop multiple columns by passing a list of column names. - >>> df.drop(["bar", "ham"]) shape: (3, 1) ┌─────┐ │ foo │ │ --- │ │ i64 │ ╞═════╡ │ 1 │ │ 2 │ │ 3 │ └─────┘ - Drop multiple columns by passing a selector. - >>> import polars.selectors as cs >>> df.drop(cs.numeric()) shape: (3, 1) ┌─────┐ │ ham │ │ --- │ │ str │ ╞═════╡ │ a │ │ b │ │ c │ └─────┘ - Use positional arguments to drop multiple columns. - >>> df.drop("foo", "ham") shape: (3, 1) ┌─────┐ │ bar │ │ --- │ │ f64 │ ╞═════╡ │ 6.0 │ │ 7.0 │ │ 8.0 │ └─────┘ 
 - drop_in_place(name: str) Series[source]
- Drop a single column in-place and return the dropped column. - Parameters:
- name
- Name of the column to drop. 
 
- Returns:
- Series
- The dropped column. 
 
 - Examples - >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.drop_in_place("ham") shape: (3,) Series: 'ham' [str] [ "a" "b" "c" ] 
 - drop_nulls(
- subset: ColumnNameOrSelector | Collection[ColumnNameOrSelector] | None = None,
- Drop all rows that contain null values. - The original order of the remaining rows is preserved. - Parameters:
- subset
- Column name(s) for which null values are considered. If set to - None(default), use all columns.
 
 - Examples - >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, None, 8], ... "ham": ["a", "b", None], ... } ... ) - The default behavior of this method is to drop rows where any single value of the row is null. - >>> df.drop_nulls() shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ └─────┴─────┴─────┘ - This behaviour can be constrained to consider only a subset of columns, as defined by name or with a selector. For example, dropping rows if there is a null in any of the integer columns: - >>> import polars.selectors as cs >>> df.drop_nulls(subset=cs.integer()) shape: (2, 3) ┌─────┬─────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪══════╡ │ 1 ┆ 6 ┆ a │ │ 3 ┆ 8 ┆ null │ └─────┴─────┴──────┘ - Below are some additional examples that show how to drop null values based on other conditions. - >>> df = pl.DataFrame( ... { ... "a": [None, None, None, None], ... "b": [1, 2, None, 1], ... "c": [1, None, None, 1], ... } ... ) >>> df shape: (4, 3) ┌──────┬──────┬──────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ f32 ┆ i64 ┆ i64 │ ╞══════╪══════╪══════╡ │ null ┆ 1 ┆ 1 │ │ null ┆ 2 ┆ null │ │ null ┆ null ┆ null │ │ null ┆ 1 ┆ 1 │ └──────┴──────┴──────┘ - Drop a row only if all values are null: - >>> df.filter(~pl.all_horizontal(pl.all().is_null())) shape: (3, 3) ┌──────┬─────┬──────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ f32 ┆ i64 ┆ i64 │ ╞══════╪═════╪══════╡ │ null ┆ 1 ┆ 1 │ │ null ┆ 2 ┆ null │ │ null ┆ 1 ┆ 1 │ └──────┴─────┴──────┘ - Drop a column if all values are null: - >>> df[[s.name for s in df if not (s.null_count() == df.height)]] shape: (4, 2) ┌──────┬──────┐ │ b ┆ c │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞══════╪══════╡ │ 1 ┆ 1 │ │ 2 ┆ null │ │ null ┆ null │ │ 1 ┆ 1 │ └──────┴──────┘ 
 - property dtypes: list[PolarsDataType][source]
- Get the datatypes of the columns of this DataFrame. - The datatypes can also be found in column headers when printing the DataFrame. - See also - schema
- Returns a {colname:dtype} mapping. 
 - Examples - >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.dtypes [Int64, Float64, Utf8] >>> df shape: (3, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6.0 ┆ a │ │ 2 ┆ 7.0 ┆ b │ │ 3 ┆ 8.0 ┆ c │ └─────┴─────┴─────┘ 
 - equals(
- other: DataFrame,
- *,
- null_equal: bool = True,
- Check whether the DataFrame is equal to another DataFrame. - Parameters:
- other
- DataFrame to compare with. 
- null_equal
- Consider null values as equal. 
 
 - See also - assert_frame_equal
 - Examples - >>> df1 = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": ["a", "b", "c"], ... } ... ) >>> df2 = pl.DataFrame( ... { ... "foo": [3, 2, 1], ... "bar": [8.0, 7.0, 6.0], ... "ham": ["c", "b", "a"], ... } ... ) >>> df1.equals(df1) True >>> df1.equals(df2) False 
 - estimated_size(unit: SizeUnit = 'b') int | float[source]
- Return an estimation of the total (heap) allocated size of the - DataFrame.- Estimated size is given in the specified unit (bytes by default). - This estimation is the sum of the size of its buffers, validity, including nested arrays. Multiple arrays may share buffers and bitmaps. Therefore, the size of 2 arrays is not the sum of the sizes computed from this function. In particular, [ - StructArray]’s size is an upper bound.- When an array is sliced, its allocated size remains constant because the buffer unchanged. However, this function will yield a smaller number. This is because this function returns the visible size of the buffer, not its total capacity. - FFI buffers are included in this estimation. - Parameters:
- unit{‘b’, ‘kb’, ‘mb’, ‘gb’, ‘tb’}
- Scale the returned size to the given unit. 
 
 - Examples - >>> df = pl.DataFrame( ... { ... "x": list(reversed(range(1_000_000))), ... "y": [v / 1000 for v in range(1_000_000)], ... "z": [str(v) for v in range(1_000_000)], ... }, ... schema=[("x", pl.UInt32), ("y", pl.Float64), ("z", pl.Utf8)], ... ) >>> df.estimated_size() 25888898 >>> df.estimated_size("mb") 24.689577102661133 
 - explode( ) DataFrame[source]
- Explode the dataframe to long format by exploding the given columns. - Parameters:
- columns
- Column names, expressions, or a selector defining them. The underlying columns being exploded must be of List or Utf8 datatype. 
- *more_columns
- Additional names of columns to explode, specified as positional arguments. 
 
- Returns:
- DataFrame
 
 - Examples - >>> df = pl.DataFrame( ... { ... "letters": ["a", "a", "b", "c"], ... "numbers": [[1], [2, 3], [4, 5], [6, 7, 8]], ... } ... ) >>> df shape: (4, 2) ┌─────────┬───────────┐ │ letters ┆ numbers │ │ --- ┆ --- │ │ str ┆ list[i64] │ ╞═════════╪═══════════╡ │ a ┆ [1] │ │ a ┆ [2, 3] │ │ b ┆ [4, 5] │ │ c ┆ [6, 7, 8] │ └─────────┴───────────┘ >>> df.explode("numbers") shape: (8, 2) ┌─────────┬─────────┐ │ letters ┆ numbers │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════════╪═════════╡ │ a ┆ 1 │ │ a ┆ 2 │ │ a ┆ 3 │ │ b ┆ 4 │ │ b ┆ 5 │ │ c ┆ 6 │ │ c ┆ 7 │ │ c ┆ 8 │ └─────────┴─────────┘ 
 - extend(other: DataFrame) Self[source]
- Extend the memory backed by this - DataFramewith the values from- other.- Different from - vstackwhich adds the chunks from- otherto the chunks of this- DataFrame,- extendappends the data from- otherto the underlying memory locations and thus may cause a reallocation.- If this does not cause a reallocation, the resulting data structure will not have any extra chunks and thus will yield faster queries. - Prefer - extendover- vstackwhen you want to do a query after a single append. For instance, during online operations where you add- nrows and rerun a query.- Prefer - vstackover- extendwhen you want to append many times before doing a query. For instance, when you read in multiple files and want to store them in a single- DataFrame. In the latter case, finish the sequence of- vstackoperations with a- rechunk.- Parameters:
- other
- DataFrame to vertically add. 
 
 - Warning - This method modifies the dataframe in-place. The dataframe is returned for convenience only. - See also - Examples - >>> df1 = pl.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) >>> df2 = pl.DataFrame({"foo": [10, 20, 30], "bar": [40, 50, 60]}) >>> df1.extend(df2) shape: (6, 2) ┌─────┬─────┐ │ foo ┆ bar │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 4 │ │ 2 ┆ 5 │ │ 3 ┆ 6 │ │ 10 ┆ 40 │ │ 20 ┆ 50 │ │ 30 ┆ 60 │ └─────┴─────┘ 
 - fill_nan(value: Expr | int | float | None) DataFrame[source]
- Fill floating point NaN values by an Expression evaluation. - Parameters:
- value
- Value with which to replace NaN values. 
 
- Returns:
- DataFrame
- DataFrame with NaN values replaced by the given value. 
 
 - Warning - Note that floating point NaNs (Not a Number) are not missing values! To replace missing values, use - fill_null().- See also - Examples - >>> df = pl.DataFrame( ... { ... "a": [1.5, 2, float("nan"), 4], ... "b": [0.5, 4, float("nan"), 13], ... } ... ) >>> df.fill_nan(99) shape: (4, 2) ┌──────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ f64 ┆ f64 │ ╞══════╪══════╡ │ 1.5 ┆ 0.5 │ │ 2.0 ┆ 4.0 │ │ 99.0 ┆ 99.0 │ │ 4.0 ┆ 13.0 │ └──────┴──────┘ 
 - fill_null(
- value: Any | None = None,
- strategy: FillNullStrategy | None = None,
- limit: int | None = None,
- *,
- matches_supertype: bool = True,
- Fill null values using the specified value or strategy. - Parameters:
- value
- Value used to fill null values. 
- strategy{None, ‘forward’, ‘backward’, ‘min’, ‘max’, ‘mean’, ‘zero’, ‘one’}
- Strategy used to fill null values. 
- limit
- Number of consecutive null values to fill when using the ‘forward’ or ‘backward’ strategy. 
- matches_supertype
- Fill all matching supertype of the fill - value.
 
- Returns:
- DataFrame
- DataFrame with None values replaced by the filling strategy. 
 
 - See also - Examples - >>> df = pl.DataFrame( ... { ... "a": [1, 2, None, 4], ... "b": [0.5, 4, None, 13], ... } ... ) >>> df.fill_null(99) shape: (4, 2) ┌─────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ f64 │ ╞═════╪══════╡ │ 1 ┆ 0.5 │ │ 2 ┆ 4.0 │ │ 99 ┆ 99.0 │ │ 4 ┆ 13.0 │ └─────┴──────┘ >>> df.fill_null(strategy="forward") shape: (4, 2) ┌─────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ f64 │ ╞═════╪══════╡ │ 1 ┆ 0.5 │ │ 2 ┆ 4.0 │ │ 2 ┆ 4.0 │ │ 4 ┆ 13.0 │ └─────┴──────┘ - >>> df.fill_null(strategy="max") shape: (4, 2) ┌─────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ f64 │ ╞═════╪══════╡ │ 1 ┆ 0.5 │ │ 2 ┆ 4.0 │ │ 4 ┆ 13.0 │ │ 4 ┆ 13.0 │ └─────┴──────┘ - >>> df.fill_null(strategy="zero") shape: (4, 2) ┌─────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ f64 │ ╞═════╪══════╡ │ 1 ┆ 0.5 │ │ 2 ┆ 4.0 │ │ 0 ┆ 0.0 │ │ 4 ┆ 13.0 │ └─────┴──────┘ 
 - filter(
- *predicates: IntoExprColumn | Iterable[IntoExprColumn] | bool | list[bool] | np.ndarray[Any, Any],
- **constraints: Any,
- Filter the rows in the DataFrame based on a predicate expression. - The original order of the remaining rows is preserved. - Parameters:
- predicates
- Expression that evaluates to a boolean Series. 
- constraints
- Column filters. Use name=value to filter column name by the supplied value. 
 
 - Examples - >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) - Filter on one condition: - >>> df.filter(pl.col("foo") > 1) shape: (2, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 2 ┆ 7 ┆ b │ │ 3 ┆ 8 ┆ c │ └─────┴─────┴─────┘ - Filter on multiple conditions, combined with and/or operators: - >>> df.filter((pl.col("foo") < 3) & (pl.col("ham") == "a")) shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ └─────┴─────┴─────┘ - >>> df.filter((pl.col("foo") == 1) | (pl.col("ham") == "c")) shape: (2, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ │ 3 ┆ 8 ┆ c │ └─────┴─────┴─────┘ - Provide multiple filters using - *argssyntax:- >>> df.filter( ... pl.col("foo") <= 2, ... ~pl.col("ham").is_in(["b", "c"]), ... ) shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ └─────┴─────┴─────┘ - Provide multiple filters using - **kwargssyntax:- >>> df.filter(foo=2, ham="b") shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 2 ┆ 7 ┆ b │ └─────┴─────┴─────┘ 
 - find_idx_by_name(name: str) int[source]
- Find the index of a column by name. - Deprecated since version 0.19.14: This method has been renamed to - get_column_index().- Parameters:
- name
- Name of the column to find. 
 
 
 - property flags: dict[str, dict[str, bool]][source]
- Get flags that are set on the columns of this DataFrame. - Returns:
- dict
- Mapping from column names to column flags. 
 
 
 - fold(operation: Callable[[Series, Series], Series]) Series[source]
- Apply a horizontal reduction on a DataFrame. - This can be used to effectively determine aggregations on a row level, and can be applied to any DataType that can be supercasted (casted to a similar parent type). - An example of the supercast rules when applying an arithmetic operation on two DataTypes are for instance: - Int8 + Utf8 = Utf8 
- Float32 + Int64 = Float32 
- Float32 + Float64 = Float64 
 - Parameters:
- operation
- function that takes two - Seriesand returns a- Series.
 
 - Examples - A horizontal sum operation: - >>> df = pl.DataFrame( ... { ... "a": [2, 1, 3], ... "b": [1, 2, 3], ... "c": [1.0, 2.0, 3.0], ... } ... ) >>> df.fold(lambda s1, s2: s1 + s2) shape: (3,) Series: 'a' [f64] [ 4.0 5.0 9.0 ] - A horizontal minimum operation: - >>> df = pl.DataFrame({"a": [2, 1, 3], "b": [1, 2, 3], "c": [1.0, 2.0, 3.0]}) >>> df.fold(lambda s1, s2: s1.zip_with(s1 < s2, s2)) shape: (3,) Series: 'a' [f64] [ 1.0 1.0 3.0 ] - A horizontal string concatenation: - >>> df = pl.DataFrame( ... { ... "a": ["foo", "bar", 2], ... "b": [1, 2, 3], ... "c": [1.0, 2.0, 3.0], ... } ... ) >>> df.fold(lambda s1, s2: s1 + s2) shape: (3,) Series: 'a' [str] [ "foo11.0" "bar22.0" null ] - A horizontal boolean or, similar to a row-wise .any(): - >>> df = pl.DataFrame( ... { ... "a": [False, False, True], ... "b": [False, True, False], ... } ... ) >>> df.fold(lambda s1, s2: s1 | s2) shape: (3,) Series: 'a' [bool] [ false true true ] 
 - frame_equal(
- other: DataFrame,
- *,
- null_equal: bool = True,
- Check whether the DataFrame is equal to another DataFrame. - Deprecated since version 0.19.16: This method has been renamed to - equals().- Parameters:
- other
- DataFrame to compare with. 
- null_equal
- Consider null values as equal. 
 
 
 - gather_every(n: int) DataFrame[source]
- Take every nth row in the DataFrame and return as a new DataFrame. - Parameters:
- n
- Gather every n-th row. 
 
 - Examples - >>> s = pl.DataFrame({"a": [1, 2, 3, 4], "b": [5, 6, 7, 8]}) >>> s.gather_every(2) shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 5 │ │ 3 ┆ 7 │ └─────┴─────┘ 
 - get_column(name: str) Series[source]
- Get a single column by name. - Parameters:
- namestr
- Name of the column to retrieve. 
 
- Returns:
- Series
 
 - See also - Examples - >>> df = pl.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) >>> df.get_column("foo") shape: (3,) Series: 'foo' [i64] [ 1 2 3 ] 
 - get_column_index(name: str) int[source]
- Find the index of a column by name. - Parameters:
- name
- Name of the column to find. 
 
 - Examples - >>> df = pl.DataFrame( ... {"foo": [1, 2, 3], "bar": [6, 7, 8], "ham": ["a", "b", "c"]} ... ) >>> df.get_column_index("ham") 2 
 - get_columns() list[Series][source]
- Get the DataFrame as a List of Series. - Examples - >>> df = pl.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) >>> df.get_columns() [shape: (3,) Series: 'foo' [i64] [ 1 2 3 ], shape: (3,) Series: 'bar' [i64] [ 4 5 6 ]] - >>> df = pl.DataFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [0.5, 4, 10, 13], ... "c": [True, True, False, True], ... } ... ) >>> df.get_columns() [shape: (4,) Series: 'a' [i64] [ 1 2 3 4 ], shape: (4,) Series: 'b' [f64] [ 0.5 4.0 10.0 13.0 ], shape: (4,) Series: 'c' [bool] [ true true false true ]] 
 - glimpse( ) None[source]
- glimpse( ) str
- Return a dense preview of the DataFrame. - The formatting shows one line per column so that wide dataframes display cleanly. Each line shows the column name, the data type, and the first few values. - Parameters:
- max_items_per_column
- Maximum number of items to show per column. 
- max_colname_length
- Maximum length of the displayed column names; values that exceed this value are truncated with a trailing ellipsis. 
- return_as_string
- If True, return the preview as a string instead of printing to stdout. 
 
 - Examples - >>> from datetime import date >>> df = pl.DataFrame( ... { ... "a": [1.0, 2.8, 3.0], ... "b": [4, 5, None], ... "c": [True, False, True], ... "d": [None, "b", "c"], ... "e": ["usd", "eur", None], ... "f": [date(2020, 1, 1), date(2021, 1, 2), date(2022, 1, 1)], ... } ... ) >>> df.glimpse() Rows: 3 Columns: 6 $ a <f64> 1.0, 2.8, 3.0 $ b <i64> 4, 5, None $ c <bool> True, False, True $ d <str> None, 'b', 'c' $ e <str> 'usd', 'eur', None $ f <date> 2020-01-01, 2021-01-02, 2022-01-01 
 - group_by(
- by: IntoExpr | Iterable[IntoExpr],
- *more_by: IntoExpr,
- maintain_order: bool = False,
- Start a group by operation. - Parameters:
- by
- Column(s) to group by. Accepts expression input. Strings are parsed as column names. 
- *more_by
- Additional columns to group by, specified as positional arguments. 
- maintain_order
- Ensure that the order of the groups is consistent with the input data. This is slower than a default group by. Settings this to - Trueblocks the possibility to run on the streaming engine.- Note - Within each group, the order of rows is always preserved, regardless of this argument. 
 
- Returns:
- GroupBy
- Object which can be used to perform aggregations. 
 
 - Examples - Group by one column and call - aggto compute the grouped sum of another column.- >>> df = pl.DataFrame( ... { ... "a": ["a", "b", "a", "b", "c"], ... "b": [1, 2, 1, 3, 3], ... "c": [5, 4, 3, 2, 1], ... } ... ) >>> df.group_by("a").agg(pl.col("b").sum()) shape: (3, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════╪═════╡ │ a ┆ 2 │ │ b ┆ 5 │ │ c ┆ 3 │ └─────┴─────┘ - Set - maintain_order=Trueto ensure the order of the groups is consistent with the input.- >>> df.group_by("a", maintain_order=True).agg(pl.col("c")) shape: (3, 2) ┌─────┬───────────┐ │ a ┆ c │ │ --- ┆ --- │ │ str ┆ list[i64] │ ╞═════╪═══════════╡ │ a ┆ [5, 3] │ │ b ┆ [4, 2] │ │ c ┆ [1] │ └─────┴───────────┘ - Group by multiple columns by passing a list of column names. - >>> df.group_by(["a", "b"]).agg(pl.max("c")) shape: (4, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ a ┆ 1 ┆ 5 │ │ b ┆ 2 ┆ 4 │ │ b ┆ 3 ┆ 2 │ │ c ┆ 3 ┆ 1 │ └─────┴─────┴─────┘ - Or use positional arguments to group by multiple columns in the same way. Expressions are also accepted. - >>> df.group_by("a", pl.col("b") // 2).agg(pl.col("c").mean()) shape: (3, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ f64 │ ╞═════╪═════╪═════╡ │ a ┆ 0 ┆ 4.0 │ │ b ┆ 1 ┆ 3.0 │ │ c ┆ 1 ┆ 1.0 │ └─────┴─────┴─────┘ - The - GroupByobject returned by this method is iterable, returning the name and data of each group.- >>> for name, data in df.group_by("a"): ... print(name) ... print(data) ... a shape: (2, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ a ┆ 1 ┆ 5 │ │ a ┆ 1 ┆ 3 │ └─────┴─────┴─────┘ b shape: (2, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ b ┆ 2 ┆ 4 │ │ b ┆ 3 ┆ 2 │ └─────┴─────┴─────┘ c shape: (1, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ c ┆ 3 ┆ 1 │ └─────┴─────┴─────┘ 
 - group_by_dynamic(
- index_column: IntoExpr,
- *,
- every: str | timedelta,
- period: str | timedelta | None = None,
- offset: str | timedelta | None = None,
- truncate: bool | None = None,
- include_boundaries: bool = False,
- closed: ClosedInterval = 'left',
- label: Label = 'left',
- by: IntoExpr | Iterable[IntoExpr] | None = None,
- start_by: StartBy = 'window',
- check_sorted: bool = True,
- Group based on a time value (or index value of type Int32, Int64). - Time windows are calculated and rows are assigned to windows. Different from a normal group by is that a row can be member of multiple groups. By default, the windows look like: - [start, start + period) 
- [start + every, start + every + period) 
- [start + 2*every, start + 2*every + period) 
- … 
 - where - startis determined by- start_by,- offset, and- every(see parameter descriptions below).- Warning - The index column must be sorted in ascending order. If - byis passed, then the index column must be sorted in ascending order within each group.- Parameters:
- index_column
- Column used to group based on the time window. Often of type Date/Datetime. This column must be sorted in ascending order (or, if - byis specified, then it must be sorted in ascending order within each group).- In case of a dynamic group by on indices, dtype needs to be one of {Int32, Int64}. Note that Int32 gets temporarily cast to Int64, so if performance matters use an Int64 column. 
- every
- interval of the window 
- period
- length of the window, if None it will equal ‘every’ 
- offset
- offset of the window, only takes effect if - start_byis- 'window'. Defaults to negative- every.
- truncate
- truncate the time value to the window lower bound - Deprecated since version 0.19.4: Use - labelinstead.
- include_boundaries
- Add the lower and upper bound of the window to the “_lower_boundary” and “_upper_boundary” columns. This will impact performance because it’s harder to parallelize 
- closed{‘left’, ‘right’, ‘both’, ‘none’}
- Define which sides of the temporal interval are closed (inclusive). 
- label{‘left’, ‘right’, ‘datapoint’}
- Define which label to use for the window: - ‘left’: lower boundary of the window 
- ‘right’: upper boundary of the window 
- ‘datapoint’: the first value of the index column in the given window. If you don’t need the label to be at one of the boundaries, choose this option for maximum performance 
 
- by
- Also group by this column/these columns 
- start_by{‘window’, ‘datapoint’, ‘monday’, ‘tuesday’, ‘wednesday’, ‘thursday’, ‘friday’, ‘saturday’, ‘sunday’}
- The strategy to determine the start of the first window by. - ‘window’: Start by taking the earliest timestamp, truncating it with - every, and then adding- offset. Note that weekly windows start on Monday.
- ‘datapoint’: Start from the first encountered data point. 
- a day of the week (only takes effect if - everycontains- 'w'):- ‘monday’: Start the window on the Monday before the first data point. 
- ‘tuesday’: Start the window on the Tuesday before the first data point. 
- … 
- ‘sunday’: Start the window on the Sunday before the first data point. 
 
 
- check_sorted
- When the - byargument is given, polars can not check sortedness by the metadata and has to do a full scan on the index column to verify data is sorted. This is expensive. If you are sure the data within the by groups is sorted, you can set this to- False. Doing so incorrectly will lead to incorrect output
 
- Returns:
- DynamicGroupBy
- Object you can call - .aggon to aggregate by groups, the result of which will be sorted by- index_column(but note that if- bycolumns are passed, it will only be sorted within each- bygroup).
 
 - See also - Notes - If you’re coming from pandas, then - # polars df.group_by_dynamic("ts", every="1d").agg(pl.col("value").sum()) - is equivalent to - # pandas df.set_index("ts").resample("D")["value"].sum().reset_index() - though note that, unlike pandas, polars doesn’t add extra rows for empty windows. If you need - index_columnto be evenly spaced, then please combine with- DataFrame.upsample().
- The - every,- periodand- offsetarguments are created with the following string language:- 1ns (1 nanosecond) 
- 1us (1 microsecond) 
- 1ms (1 millisecond) 
- 1s (1 second) 
- 1m (1 minute) 
- 1h (1 hour) 
- 1d (1 calendar day) 
- 1w (1 calendar week) 
- 1mo (1 calendar month) 
- 1q (1 calendar quarter) 
- 1y (1 calendar year) 
- 1i (1 index count) 
 - Or combine them: “3d12h4m25s” # 3 days, 12 hours, 4 minutes, and 25 seconds - By “calendar day”, we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for “calendar week”, “calendar month”, “calendar quarter”, and “calendar year”. - In case of a group_by_dynamic on an integer column, the windows are defined by: - “1i” # length 1 
- “10i” # length 10 
 
 - Examples - >>> from datetime import datetime >>> df = pl.DataFrame( ... { ... "time": pl.datetime_range( ... start=datetime(2021, 12, 16), ... end=datetime(2021, 12, 16, 3), ... interval="30m", ... eager=True, ... ), ... "n": range(7), ... } ... ) >>> df shape: (7, 2) ┌─────────────────────┬─────┐ │ time ┆ n │ │ --- ┆ --- │ │ datetime[μs] ┆ i64 │ ╞═════════════════════╪═════╡ │ 2021-12-16 00:00:00 ┆ 0 │ │ 2021-12-16 00:30:00 ┆ 1 │ │ 2021-12-16 01:00:00 ┆ 2 │ │ 2021-12-16 01:30:00 ┆ 3 │ │ 2021-12-16 02:00:00 ┆ 4 │ │ 2021-12-16 02:30:00 ┆ 5 │ │ 2021-12-16 03:00:00 ┆ 6 │ └─────────────────────┴─────┘ - Group by windows of 1 hour starting at 2021-12-16 00:00:00. - >>> df.group_by_dynamic("time", every="1h", closed="right").agg(pl.col("n")) shape: (4, 2) ┌─────────────────────┬───────────┐ │ time ┆ n │ │ --- ┆ --- │ │ datetime[μs] ┆ list[i64] │ ╞═════════════════════╪═══════════╡ │ 2021-12-15 23:00:00 ┆ [0] │ │ 2021-12-16 00:00:00 ┆ [1, 2] │ │ 2021-12-16 01:00:00 ┆ [3, 4] │ │ 2021-12-16 02:00:00 ┆ [5, 6] │ └─────────────────────┴───────────┘ - The window boundaries can also be added to the aggregation result - >>> df.group_by_dynamic( ... "time", every="1h", include_boundaries=True, closed="right" ... ).agg(pl.col("n").mean()) shape: (4, 4) ┌─────────────────────┬─────────────────────┬─────────────────────┬─────┐ │ _lower_boundary ┆ _upper_boundary ┆ time ┆ n │ │ --- ┆ --- ┆ --- ┆ --- │ │ datetime[μs] ┆ datetime[μs] ┆ datetime[μs] ┆ f64 │ ╞═════════════════════╪═════════════════════╪═════════════════════╪═════╡ │ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-15 23:00:00 ┆ 0.0 │ │ 2021-12-16 00:00:00 ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 00:00:00 ┆ 1.5 │ │ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 3.5 │ │ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 5.5 │ └─────────────────────┴─────────────────────┴─────────────────────┴─────┘ - When closed=”left”, the window excludes the right end of interval: [lower_bound, upper_bound) - >>> df.group_by_dynamic("time", every="1h", closed="left").agg(pl.col("n")) shape: (4, 2) ┌─────────────────────┬───────────┐ │ time ┆ n │ │ --- ┆ --- │ │ datetime[μs] ┆ list[i64] │ ╞═════════════════════╪═══════════╡ │ 2021-12-16 00:00:00 ┆ [0, 1] │ │ 2021-12-16 01:00:00 ┆ [2, 3] │ │ 2021-12-16 02:00:00 ┆ [4, 5] │ │ 2021-12-16 03:00:00 ┆ [6] │ └─────────────────────┴───────────┘ - When closed=”both” the time values at the window boundaries belong to 2 groups. - >>> df.group_by_dynamic("time", every="1h", closed="both").agg(pl.col("n")) shape: (5, 2) ┌─────────────────────┬───────────┐ │ time ┆ n │ │ --- ┆ --- │ │ datetime[μs] ┆ list[i64] │ ╞═════════════════════╪═══════════╡ │ 2021-12-15 23:00:00 ┆ [0] │ │ 2021-12-16 00:00:00 ┆ [0, 1, 2] │ │ 2021-12-16 01:00:00 ┆ [2, 3, 4] │ │ 2021-12-16 02:00:00 ┆ [4, 5, 6] │ │ 2021-12-16 03:00:00 ┆ [6] │ └─────────────────────┴───────────┘ - Dynamic group bys can also be combined with grouping on normal keys - >>> df = df.with_columns(groups=pl.Series(["a", "a", "a", "b", "b", "a", "a"])) >>> df shape: (7, 3) ┌─────────────────────┬─────┬────────┐ │ time ┆ n ┆ groups │ │ --- ┆ --- ┆ --- │ │ datetime[μs] ┆ i64 ┆ str │ ╞═════════════════════╪═════╪════════╡ │ 2021-12-16 00:00:00 ┆ 0 ┆ a │ │ 2021-12-16 00:30:00 ┆ 1 ┆ a │ │ 2021-12-16 01:00:00 ┆ 2 ┆ a │ │ 2021-12-16 01:30:00 ┆ 3 ┆ b │ │ 2021-12-16 02:00:00 ┆ 4 ┆ b │ │ 2021-12-16 02:30:00 ┆ 5 ┆ a │ │ 2021-12-16 03:00:00 ┆ 6 ┆ a │ └─────────────────────┴─────┴────────┘ >>> df.group_by_dynamic( ... "time", ... every="1h", ... closed="both", ... by="groups", ... include_boundaries=True, ... ).agg(pl.col("n")) shape: (7, 5) ┌────────┬─────────────────────┬─────────────────────┬─────────────────────┬───────────┐ │ groups ┆ _lower_boundary ┆ _upper_boundary ┆ time ┆ n │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ datetime[μs] ┆ datetime[μs] ┆ datetime[μs] ┆ list[i64] │ ╞════════╪═════════════════════╪═════════════════════╪═════════════════════╪═══════════╡ │ a ┆ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-15 23:00:00 ┆ [0] │ │ a ┆ 2021-12-16 00:00:00 ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 00:00:00 ┆ [0, 1, 2] │ │ a ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ [2] │ │ a ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ [5, 6] │ │ a ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 04:00:00 ┆ 2021-12-16 03:00:00 ┆ [6] │ │ b ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ [3, 4] │ │ b ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ [4] │ └────────┴─────────────────────┴─────────────────────┴─────────────────────┴───────────┘ - Dynamic group by on an index column - >>> df = pl.DataFrame( ... { ... "idx": pl.int_range(0, 6, eager=True), ... "A": ["A", "A", "B", "B", "B", "C"], ... } ... ) >>> ( ... df.group_by_dynamic( ... "idx", ... every="2i", ... period="3i", ... include_boundaries=True, ... closed="right", ... ).agg(pl.col("A").alias("A_agg_list")) ... ) shape: (4, 4) ┌─────────────────┬─────────────────┬─────┬─────────────────┐ │ _lower_boundary ┆ _upper_boundary ┆ idx ┆ A_agg_list │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 ┆ list[str] │ ╞═════════════════╪═════════════════╪═════╪═════════════════╡ │ -2 ┆ 1 ┆ -2 ┆ ["A", "A"] │ │ 0 ┆ 3 ┆ 0 ┆ ["A", "B", "B"] │ │ 2 ┆ 5 ┆ 2 ┆ ["B", "B", "C"] │ │ 4 ┆ 7 ┆ 4 ┆ ["C"] │ └─────────────────┴─────────────────┴─────┴─────────────────┘ 
 - group_by_rolling(
- index_column: IntoExpr,
- *,
- period: str | timedelta,
- offset: str | timedelta | None = None,
- closed: ClosedInterval = 'right',
- by: IntoExpr | Iterable[IntoExpr] | None = None,
- check_sorted: bool = True,
- Create rolling groups based on a time, Int32, or Int64 column. - Deprecated since version 0.19.9: This method has been renamed to - DataFrame.rolling().- Parameters:
- index_column
- Column used to group based on the time window. Often of type Date/Datetime. This column must be sorted in ascending order (or, if - byis specified, then it must be sorted in ascending order within each group).- In case of a rolling group by on indices, dtype needs to be one of {Int32, Int64}. Note that Int32 gets temporarily cast to Int64, so if performance matters use an Int64 column. 
- period
- length of the window - must be non-negative 
- offset
- offset of the window. Default is -period 
- closed{‘right’, ‘left’, ‘both’, ‘none’}
- Define which sides of the temporal interval are closed (inclusive). 
- by
- Also group by this column/these columns 
- check_sorted
- When the - byargument is given, polars can not check sortedness by the metadata and has to do a full scan on the index column to verify data is sorted. This is expensive. If you are sure the data within the by groups is sorted, you can set this to- False. Doing so incorrectly will lead to incorrect output
 
 
 - groupby(
- by: IntoExpr | Iterable[IntoExpr],
- *more_by: IntoExpr,
- maintain_order: bool = False,
- Start a group by operation. - Deprecated since version 0.19.0: This method has been renamed to - DataFrame.group_by().- Parameters:
- by
- Column(s) to group by. Accepts expression input. Strings are parsed as column names. 
- *more_by
- Additional columns to group by, specified as positional arguments. 
- maintain_order
- Ensure that the order of the groups is consistent with the input data. This is slower than a default group by. Settings this to - Trueblocks the possibility to run on the streaming engine.- Note - Within each group, the order of rows is always preserved, regardless of this argument. 
 
- Returns:
- GroupBy
- Object which can be used to perform aggregations. 
 
 
 - groupby_dynamic(
- index_column: IntoExpr,
- *,
- every: str | timedelta,
- period: str | timedelta | None = None,
- offset: str | timedelta | None = None,
- truncate: bool = True,
- include_boundaries: bool = False,
- closed: ClosedInterval = 'left',
- by: IntoExpr | Iterable[IntoExpr] | None = None,
- start_by: StartBy = 'window',
- check_sorted: bool = True,
- Group based on a time value (or index value of type Int32, Int64). - Deprecated since version 0.19.0: This method has been renamed to - DataFrame.group_by_dynamic().- Parameters:
- index_column
- Column used to group based on the time window. Often of type Date/Datetime. This column must be sorted in ascending order (or, if - byis specified, then it must be sorted in ascending order within each group).- In case of a dynamic group by on indices, dtype needs to be one of {Int32, Int64}. Note that Int32 gets temporarily cast to Int64, so if performance matters use an Int64 column. 
- every
- interval of the window 
- period
- length of the window, if None it will equal ‘every’ 
- offset
- offset of the window, only takes effect if - start_byis- 'window'. Defaults to negative- every.
- truncate
- truncate the time value to the window lower bound 
- include_boundaries
- Add the lower and upper bound of the window to the “_lower_bound” and “_upper_bound” columns. This will impact performance because it’s harder to parallelize 
- closed{‘left’, ‘right’, ‘both’, ‘none’}
- Define which sides of the temporal interval are closed (inclusive). 
- by
- Also group by this column/these columns 
- start_by{‘window’, ‘datapoint’, ‘monday’, ‘tuesday’, ‘wednesday’, ‘thursday’, ‘friday’, ‘saturday’, ‘sunday’}
- The strategy to determine the start of the first window by. - ‘window’: Start by taking the earliest timestamp, truncating it with - every, and then adding- offset. Note that weekly windows start on Monday.
- ‘datapoint’: Start from the first encountered data point. 
- a day of the week (only takes effect if - everycontains- 'w'):- ‘monday’: Start the window on the Monday before the first data point. 
- ‘tuesday’: Start the window on the Tuesday before the first data point. 
- … 
- ‘sunday’: Start the window on the Sunday before the first data point. 
 
 
- check_sorted
- When the - byargument is given, polars can not check sortedness by the metadata and has to do a full scan on the index column to verify data is sorted. This is expensive. If you are sure the data within the by groups is sorted, you can set this to- False. Doing so incorrectly will lead to incorrect output
 
- Returns:
- DynamicGroupBy
- Object you can call - .aggon to aggregate by groups, the result of which will be sorted by- index_column(but note that if- bycolumns are passed, it will only be sorted within each- bygroup).
 
 
 - groupby_rolling(
- index_column: IntoExpr,
- *,
- period: str | timedelta,
- offset: str | timedelta | None = None,
- closed: ClosedInterval = 'right',
- by: IntoExpr | Iterable[IntoExpr] | None = None,
- check_sorted: bool = True,
- Create rolling groups based on a time, Int32, or Int64 column. - Deprecated since version 0.19.0: This method has been renamed to - DataFrame.rolling().- Parameters:
- index_column
- Column used to group based on the time window. Often of type Date/Datetime. This column must be sorted in ascending order (or, if - byis specified, then it must be sorted in ascending order within each group).- In case of a rolling group by on indices, dtype needs to be one of {Int32, Int64}. Note that Int32 gets temporarily cast to Int64, so if performance matters use an Int64 column. 
- period
- length of the window - must be non-negative 
- offset
- offset of the window. Default is -period 
- closed{‘right’, ‘left’, ‘both’, ‘none’}
- Define which sides of the temporal interval are closed (inclusive). 
- by
- Also group by this column/these columns 
- check_sorted
- When the - byargument is given, polars can not check sortedness by the metadata and has to do a full scan on the index column to verify data is sorted. This is expensive. If you are sure the data within the by groups is sorted, you can set this to- False. Doing so incorrectly will lead to incorrect output
 
 
 - hash_rows( ) Series[source]
- Hash and combine the rows in this DataFrame. - The hash value is of type - UInt64.- Parameters:
- seed
- Random seed parameter. Defaults to 0. 
- seed_1
- Random seed parameter. Defaults to - seedif not set.
- seed_2
- Random seed parameter. Defaults to - seedif not set.
- seed_3
- Random seed parameter. Defaults to - seedif not set.
 
 - Notes - This implementation of - hash_rows()does not guarantee stable results across different Polars versions. Its stability is only guaranteed within a single version.- Examples - >>> df = pl.DataFrame( ... { ... "foo": [1, None, 3, 4], ... "ham": ["a", "b", None, "d"], ... } ... ) >>> df.hash_rows(seed=42) shape: (4,) Series: '' [u64] [ 10783150408545073287 1438741209321515184 10047419486152048166 2047317070637311557 ] 
 - head(n: int = 5) Self[source]
- Get the first - nrows.- Parameters:
- n
- Number of rows to return. If a negative value is passed, return all rows except the last - abs(n).
 
 - Examples - >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3, 4, 5], ... "bar": [6, 7, 8, 9, 10], ... "ham": ["a", "b", "c", "d", "e"], ... } ... ) >>> df.head(3) shape: (3, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ │ 2 ┆ 7 ┆ b │ │ 3 ┆ 8 ┆ c │ └─────┴─────┴─────┘ - Pass a negative value to get all rows - exceptthe last- abs(n).- >>> df.head(-3) shape: (2, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ │ 2 ┆ 7 ┆ b │ └─────┴─────┴─────┘ 
 - property height: int[source]
- Get the height of the DataFrame. - Examples - >>> df = pl.DataFrame({"foo": [1, 2, 3, 4, 5]}) >>> df.height 5 
 - hstack(columns: list[Series] | DataFrame, *, in_place: bool = False) Self[source]
- Return a new DataFrame grown horizontally by stacking multiple Series to it. - Parameters:
- columns
- Series to stack. 
- in_place
- Modify in place. 
 
 - Examples - >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> x = pl.Series("apple", [10, 20, 30]) >>> df.hstack([x]) shape: (3, 4) ┌─────┬─────┬─────┬───────┐ │ foo ┆ bar ┆ ham ┆ apple │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str ┆ i64 │ ╞═════╪═════╪═════╪═══════╡ │ 1 ┆ 6 ┆ a ┆ 10 │ │ 2 ┆ 7 ┆ b ┆ 20 │ │ 3 ┆ 8 ┆ c ┆ 30 │ └─────┴─────┴─────┴───────┘ 
 - insert_at_idx(index: int, column: Series) Self[source]
- Insert a Series at a certain column index. This operation is in place. - Deprecated since version 0.19.14: This method has been renamed to - insert_column().- Parameters:
- index
- Column to insert the new - Seriescolumn.
- column
- Seriesto insert.
 
 
 - insert_column(index: int, column: Series) Self[source]
- Insert a Series at a certain column index. - This operation is in place. - Parameters:
- index
- Index at which to insert the new - Seriescolumn.
- column
- Seriesto insert.
 
 - Examples - >>> df = pl.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) >>> s = pl.Series("baz", [97, 98, 99]) >>> df.insert_column(1, s) shape: (3, 3) ┌─────┬─────┬─────┐ │ foo ┆ baz ┆ bar │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ 1 ┆ 97 ┆ 4 │ │ 2 ┆ 98 ┆ 5 │ │ 3 ┆ 99 ┆ 6 │ └─────┴─────┴─────┘ - >>> df = pl.DataFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [0.5, 4, 10, 13], ... "c": [True, True, False, True], ... } ... ) >>> s = pl.Series("d", [-2.5, 15, 20.5, 0]) >>> df.insert_column(3, s) shape: (4, 4) ┌─────┬──────┬───────┬──────┐ │ a ┆ b ┆ c ┆ d │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ bool ┆ f64 │ ╞═════╪══════╪═══════╪══════╡ │ 1 ┆ 0.5 ┆ true ┆ -2.5 │ │ 2 ┆ 4.0 ┆ true ┆ 15.0 │ │ 3 ┆ 10.0 ┆ false ┆ 20.5 │ │ 4 ┆ 13.0 ┆ true ┆ 0.0 │ └─────┴──────┴───────┴──────┘ 
 - interpolate() DataFrame[source]
- Interpolate intermediate values. The interpolation method is linear. - Examples - >>> df = pl.DataFrame( ... { ... "foo": [1, None, 9, 10], ... "bar": [6, 7, 9, None], ... "baz": [1, None, None, 9], ... } ... ) >>> df.interpolate() shape: (4, 3) ┌──────┬──────┬──────────┐ │ foo ┆ bar ┆ baz │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ f64 │ ╞══════╪══════╪══════════╡ │ 1.0 ┆ 6.0 ┆ 1.0 │ │ 5.0 ┆ 7.0 ┆ 3.666667 │ │ 9.0 ┆ 9.0 ┆ 6.333333 │ │ 10.0 ┆ null ┆ 9.0 │ └──────┴──────┴──────────┘ 
 - is_duplicated() Series[source]
- Get a mask of all duplicated rows in this DataFrame. - Examples - >>> df = pl.DataFrame( ... { ... "a": [1, 2, 3, 1], ... "b": ["x", "y", "z", "x"], ... } ... ) >>> df.is_duplicated() shape: (4,) Series: '' [bool] [ true false false true ] - This mask can be used to visualize the duplicated lines like this: - >>> df.filter(df.is_duplicated()) shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ str │ ╞═════╪═════╡ │ 1 ┆ x │ │ 1 ┆ x │ └─────┴─────┘ 
 - is_empty() bool[source]
- Check if the dataframe is empty. - Examples - >>> df = pl.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) >>> df.is_empty() False >>> df.filter(pl.col("foo") > 99).is_empty() True 
 - is_unique() Series[source]
- Get a mask of all unique rows in this DataFrame. - Examples - >>> df = pl.DataFrame( ... { ... "a": [1, 2, 3, 1], ... "b": ["x", "y", "z", "x"], ... } ... ) >>> df.is_unique() shape: (4,) Series: '' [bool] [ false true true false ] - This mask can be used to visualize the unique lines like this: - >>> df.filter(df.is_unique()) shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ str │ ╞═════╪═════╡ │ 2 ┆ y │ │ 3 ┆ z │ └─────┴─────┘ 
 - item(row: int | None = None, column: int | str | None = None) Any[source]
- Return the DataFrame as a scalar, or return the element at the given row/column. - Parameters:
- row
- Optional row index. 
- column
- Optional column index or name. 
 
 - See also - row
- Get the values of a single row, either by index or by predicate. 
 - Notes - If row/col not provided, this is equivalent to - df[0,0], with a check that the shape is (1,1). With row/col, this is equivalent to- df[row,col].- Examples - >>> df = pl.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}) >>> df.select((pl.col("a") * pl.col("b")).sum()).item() 32 >>> df.item(1, 1) 5 >>> df.item(2, "b") 6 
 - iter_columns() Iterator[Series][source]
- Returns an iterator over the DataFrame’s columns. - Returns:
- Iterator of Series.
 
 - Notes - Consider whether you can use - all()instead. If you can, it will be more efficient.- Examples - >>> df = pl.DataFrame( ... { ... "a": [1, 3, 5], ... "b": [2, 4, 6], ... } ... ) >>> [s.name for s in df.iter_columns()] ['a', 'b'] - If you’re using this to modify a dataframe’s columns, e.g. - >>> # Do NOT do this >>> pl.DataFrame(column * 2 for column in df.iter_columns()) shape: (3, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 2 ┆ 4 │ │ 6 ┆ 8 │ │ 10 ┆ 12 │ └─────┴─────┘ - then consider whether you can use - all()instead:- >>> df.select(pl.all() * 2) shape: (3, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 2 ┆ 4 │ │ 6 ┆ 8 │ │ 10 ┆ 12 │ └─────┴─────┘ 
 - iter_rows(
- *,
- named: Literal[False] = False,
- buffer_size: int = 500,
- iter_rows(
- *,
- named: Literal[True],
- buffer_size: int = 500,
- Returns an iterator over the DataFrame of rows of python-native values. - Parameters:
- named
- Return dictionaries instead of tuples. The dictionaries are a mapping of column name to row value. This is more expensive than returning a regular tuple, but allows for accessing values by column name. 
- buffer_size
- Determines the number of rows that are buffered internally while iterating over the data; you should only modify this in very specific cases where the default value is determined not to be a good fit to your access pattern, as the speedup from using the buffer is significant (~2-4x). Setting this value to zero disables row buffering (not recommended). 
 
- Returns:
- iterator of tuples (default) or dictionaries (if named) of python row values
 
 - Warning - Row iteration is not optimal as the underlying data is stored in columnar form; where possible, prefer export via one of the dedicated export/output methods that deals with columnar data. - See also - rows
- Materialises all frame data as a list of rows (potentially expensive). 
- rows_by_key
- Materialises frame data as a key-indexed dictionary. 
 - Notes - If you have - ns-precision temporal values you should be aware that Python natively only supports up to- μs-precision;- ns-precision values will be truncated to microseconds on conversion to Python. If this matters to your use-case you should export to a different format (such as Arrow or NumPy).- Examples - >>> df = pl.DataFrame( ... { ... "a": [1, 3, 5], ... "b": [2, 4, 6], ... } ... ) >>> [row[0] for row in df.iter_rows()] [1, 3, 5] >>> [row["b"] for row in df.iter_rows(named=True)] [2, 4, 6] 
 - iter_slices(
- n_rows: int = 10000,
- Returns a non-copying iterator of slices over the underlying DataFrame. - Parameters:
- n_rows
- Determines the number of rows contained in each DataFrame slice. 
 
 - See also - iter_rows
- Row iterator over frame data (does not materialise all rows). 
- partition_by
- Split into multiple DataFrames, partitioned by groups. 
 - Examples - >>> from datetime import date >>> df = pl.DataFrame( ... data={ ... "a": range(17_500), ... "b": date(2023, 1, 1), ... "c": "klmnoopqrstuvwxyz", ... }, ... schema_overrides={"a": pl.Int32}, ... ) >>> for idx, frame in enumerate(df.iter_slices()): ... print(f"{type(frame).__name__}:[{idx}]:{len(frame)}") ... DataFrame:[0]:10000 DataFrame:[1]:7500 - Using - iter_slicesis an efficient way to chunk-iterate over DataFrames and any supported frame export/conversion types; for example, as RecordBatches:- >>> for frame in df.iter_slices(n_rows=15_000): ... record_batch = frame.to_arrow().to_batches()[0] ... print(f"{record_batch.schema}\n<< {len(record_batch)}") ... a: int32 b: date32[day] c: large_string << 15000 a: int32 b: date32[day] c: large_string << 2500 
 - join(
- other: DataFrame,
- on: str | Expr | Sequence[str | Expr] | None = None,
- how: JoinStrategy = 'inner',
- *,
- left_on: str | Expr | Sequence[str | Expr] | None = None,
- right_on: str | Expr | Sequence[str | Expr] | None = None,
- suffix: str = '_right',
- validate: JoinValidation = 'm:m',
- Join in SQL-like fashion. - Parameters:
- other
- DataFrame to join with. 
- on
- Name(s) of the join columns in both DataFrames. 
- how{‘inner’, ‘left’, ‘outer’, ‘semi’, ‘anti’, ‘cross’}
- Join strategy. - Note - A left join preserves the row order of the left DataFrame. 
- left_on
- Name(s) of the left join column(s). 
- right_on
- Name(s) of the right join column(s). 
- suffix
- Suffix to append to columns with a duplicate name. 
- validate: {‘m:m’, ‘m:1’, ‘1:m’, ‘1:1’}
- Checks if join is of specified type. - many_to_many
- “m:m”: default, does not result in checks 
 
- one_to_one
- “1:1”: check if join keys are unique in both left and right datasets 
 
- one_to_many
- “1:m”: check if join keys are unique in left dataset 
 
- many_to_one
- “m:1”: check if join keys are unique in right dataset 
 
 - Note - This is currently not supported the streaming engine. 
- This is only supported when joined by single columns. 
 
 
- Returns:
- DataFrame
 
 - See also - Notes - For joining on columns with categorical data, see - pl.StringCache().- Examples - >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": ["a", "b", "c"], ... } ... ) >>> other_df = pl.DataFrame( ... { ... "apple": ["x", "y", "z"], ... "ham": ["a", "b", "d"], ... } ... ) >>> df.join(other_df, on="ham") shape: (2, 4) ┌─────┬─────┬─────┬───────┐ │ foo ┆ bar ┆ ham ┆ apple │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str ┆ str │ ╞═════╪═════╪═════╪═══════╡ │ 1 ┆ 6.0 ┆ a ┆ x │ │ 2 ┆ 7.0 ┆ b ┆ y │ └─────┴─────┴─────┴───────┘ - >>> df.join(other_df, on="ham", how="outer") shape: (4, 4) ┌──────┬──────┬─────┬───────┐ │ foo ┆ bar ┆ ham ┆ apple │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str ┆ str │ ╞══════╪══════╪═════╪═══════╡ │ 1 ┆ 6.0 ┆ a ┆ x │ │ 2 ┆ 7.0 ┆ b ┆ y │ │ null ┆ null ┆ d ┆ z │ │ 3 ┆ 8.0 ┆ c ┆ null │ └──────┴──────┴─────┴───────┘ - >>> df.join(other_df, on="ham", how="left") shape: (3, 4) ┌─────┬─────┬─────┬───────┐ │ foo ┆ bar ┆ ham ┆ apple │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str ┆ str │ ╞═════╪═════╪═════╪═══════╡ │ 1 ┆ 6.0 ┆ a ┆ x │ │ 2 ┆ 7.0 ┆ b ┆ y │ │ 3 ┆ 8.0 ┆ c ┆ null │ └─────┴─────┴─────┴───────┘ - >>> df.join(other_df, on="ham", how="semi") shape: (2, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6.0 ┆ a │ │ 2 ┆ 7.0 ┆ b │ └─────┴─────┴─────┘ - >>> df.join(other_df, on="ham", how="anti") shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞═════╪═════╪═════╡ │ 3 ┆ 8.0 ┆ c │ └─────┴─────┴─────┘ 
 - join_asof(
- other: DataFrame,
- *,
- left_on: str | None | Expr = None,
- right_on: str | None | Expr = None,
- on: str | None | Expr = None,
- by_left: str | Sequence[str] | None = None,
- by_right: str | Sequence[str] | None = None,
- by: str | Sequence[str] | None = None,
- strategy: AsofJoinStrategy = 'backward',
- suffix: str = '_right',
- tolerance: str | int | float | timedelta | None = None,
- allow_parallel: bool = True,
- force_parallel: bool = False,
- Perform an asof join. - This is similar to a left-join except that we match on nearest key rather than equal keys. - Both DataFrames must be sorted by the asof_join key. - For each row in the left DataFrame: - A “backward” search selects the last row in the right DataFrame whose ‘on’ key is less than or equal to the left’s key. 
- A “forward” search selects the first row in the right DataFrame whose ‘on’ key is greater than or equal to the left’s key. 
- A “nearest” search selects the last row in the right DataFrame whose value is nearest to the left’s key. String keys are not currently supported for a nearest search. 
 - The default is “backward”. - Parameters:
- other
- Lazy DataFrame to join with. 
- left_on
- Join column of the left DataFrame. 
- right_on
- Join column of the right DataFrame. 
- on
- Join column of both DataFrames. If set, - left_onand- right_onshould be None.
- by
- join on these columns before doing asof join 
- by_left
- join on these columns before doing asof join 
- by_right
- join on these columns before doing asof join 
- strategy{‘backward’, ‘forward’, ‘nearest’}
- Join strategy. 
- suffix
- Suffix to append to columns with a duplicate name. 
- tolerance
- Numeric tolerance. By setting this the join will only be done if the near keys are within this distance. If an asof join is done on columns of dtype “Date”, “Datetime”, “Duration” or “Time”, use either a datetime.timedelta object or the following string language: - 1ns (1 nanosecond) 
- 1us (1 microsecond) 
- 1ms (1 millisecond) 
- 1s (1 second) 
- 1m (1 minute) 
- 1h (1 hour) 
- 1d (1 calendar day) 
- 1w (1 calendar week) 
- 1mo (1 calendar month) 
- 1q (1 calendar quarter) 
- 1y (1 calendar year) 
- 1i (1 index count) 
 - Or combine them: “3d12h4m25s” # 3 days, 12 hours, 4 minutes, and 25 seconds - By “calendar day”, we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for “calendar week”, “calendar month”, “calendar quarter”, and “calendar year”. 
- allow_parallel
- Allow the physical plan to optionally evaluate the computation of both DataFrames up to the join in parallel. 
- force_parallel
- Force the physical plan to evaluate the computation of both DataFrames up to the join in parallel. 
 
 - Examples - >>> from datetime import datetime >>> gdp = pl.DataFrame( ... { ... "date": [ ... datetime(2016, 1, 1), ... datetime(2017, 1, 1), ... datetime(2018, 1, 1), ... datetime(2019, 1, 1), ... ], # note record date: Jan 1st (sorted!) ... "gdp": [4164, 4411, 4566, 4696], ... } ... ).set_sorted("date") >>> population = pl.DataFrame( ... { ... "date": [ ... datetime(2016, 5, 12), ... datetime(2017, 5, 12), ... datetime(2018, 5, 12), ... datetime(2019, 5, 12), ... ], # note record date: May 12th (sorted!) ... "population": [82.19, 82.66, 83.12, 83.52], ... } ... ).set_sorted("date") >>> population.join_asof(gdp, on="date", strategy="backward") shape: (4, 3) ┌─────────────────────┬────────────┬──────┐ │ date ┆ population ┆ gdp │ │ --- ┆ --- ┆ --- │ │ datetime[μs] ┆ f64 ┆ i64 │ ╞═════════════════════╪════════════╪══════╡ │ 2016-05-12 00:00:00 ┆ 82.19 ┆ 4164 │ │ 2017-05-12 00:00:00 ┆ 82.66 ┆ 4411 │ │ 2018-05-12 00:00:00 ┆ 83.12 ┆ 4566 │ │ 2019-05-12 00:00:00 ┆ 83.52 ┆ 4696 │ └─────────────────────┴────────────┴──────┘ 
 - lazy() LazyFrame[source]
- Start a lazy query from this point. This returns a - LazyFrameobject.- Operations on a - LazyFrameare not executed until this is requested by either calling:- .fetch()
- (run on a small number of rows) 
 
- .collect()
- (run on all data) 
 
- .describe_plan()
- (print unoptimized query plan) 
 
- .describe_optimized_plan()
- (print optimized query plan) 
 
- .show_graph()
- (show (un)optimized query plan as graphviz graph) 
 
 - Lazy operations are advised because they allow for query optimization and more parallelization. - Returns:
- LazyFrame
 
 - Examples - >>> df = pl.DataFrame( ... { ... "a": [None, 2, 3, 4], ... "b": [0.5, None, 2.5, 13], ... "c": [True, True, False, None], ... } ... ) >>> df.lazy() <LazyFrame [3 cols, {"a": Int64 … "c": Boolean}] at ...> 
 - limit(n: int = 5) Self[source]
- Get the first - nrows.- Alias for - DataFrame.head().- Parameters:
- n
- Number of rows to return. If a negative value is passed, return all rows except the last - abs(n).
 
 - See also 
 - map_rows(
- function: Callable[[tuple[Any, ...]], Any],
- return_dtype: PolarsDataType | None = None,
- *,
- inference_size: int = 256,
- Apply a custom/user-defined function (UDF) over the rows of the DataFrame. - Warning - This method is much slower than the native expressions API. Only use it if you cannot implement your logic otherwise. - The UDF will receive each row as a tuple of values: - udf(row).- Implementing logic using a Python function is almost always significantly slower and more memory intensive than implementing the same logic using the native expression API because: - The native expression engine runs in Rust; UDFs run in Python. 
- Use of Python UDFs forces the DataFrame to be materialized in memory. 
- Polars-native expressions can be parallelised (UDFs typically cannot). 
- Polars-native expressions can be logically optimised (UDFs cannot). 
 - Wherever possible you should strongly prefer the native expression API to achieve the best performance. - Parameters:
- function
- Custom function or lambda. 
- return_dtype
- Output type of the operation. If none given, Polars tries to infer the type. 
- inference_size
- Only used in the case when the custom function returns rows. This uses the first - nrows to determine the output schema.
 
 - Notes - The frame-level - applycannot track column names (as the UDF is a black-box that may arbitrarily drop, rearrange, transform, or add new columns); if you want to apply a UDF such that column names are preserved, you should use the expression-level- applysyntax instead.
- If your function is expensive and you don’t want it to be called more than once for a given input, consider applying an - @lru_cachedecorator to it. If your data is suitable you may achieve significant speedups.
 - Examples - >>> df = pl.DataFrame({"foo": [1, 2, 3], "bar": [-1, 5, 8]}) - Return a DataFrame by mapping each row to a tuple: - >>> df.map_rows(lambda t: (t[0] * 2, t[1] * 3)) shape: (3, 2) ┌──────────┬──────────┐ │ column_0 ┆ column_1 │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞══════════╪══════════╡ │ 2 ┆ -3 │ │ 4 ┆ 15 │ │ 6 ┆ 24 │ └──────────┴──────────┘ - However, it is much better to implement this with a native expression: - >>> df.select( ... pl.col("foo") * 2, ... pl.col("bar") * 3, ... ) - Return a DataFrame with a single column by mapping each row to a scalar: - >>> df.map_rows(lambda t: (t[0] * 2 + t[1])) shape: (3, 1) ┌───────┐ │ apply │ │ --- │ │ i64 │ ╞═══════╡ │ 1 │ │ 9 │ │ 14 │ └───────┘ - In this case it is better to use the following native expression: - >>> df.select(pl.col("foo") * 2 + pl.col("bar")) 
 - max(axis: Literal[0] = None) Self[source]
- max(axis: Literal[1]) Series
- max(axis: int = 0) Self | Series
- Aggregate the columns of this DataFrame to their maximum value. - Parameters:
- axis
- Either 0 (vertical) or 1 (horizontal). - Deprecated since version 0.19.14: This argument will be removed in a future version. This method will only support vertical aggregation, as if - axiswere set to- 0. To perform horizontal aggregation, use- max_horizontal().
 
 - Examples - >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.max() shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 3 ┆ 8 ┆ c │ └─────┴─────┴─────┘ 
 - max_horizontal() Series[source]
- Get the maximum value horizontally across columns. - Returns:
- Series
- A Series named - "max".
 
 - Examples - >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [4.0, 5.0, 6.0], ... } ... ) >>> df.max_horizontal() shape: (3,) Series: 'max' [f64] [ 4.0 5.0 6.0 ] 
 - mean(
- *,
- axis: Literal[0] = None,
- null_strategy: NullStrategy = 'ignore',
- mean(*, axis: Literal[1], null_strategy: NullStrategy = 'ignore') Series
- mean(*, axis: int, null_strategy: NullStrategy = 'ignore') Self | Series
- Aggregate the columns of this DataFrame to their mean value. - Parameters:
- axis
- Either 0 (vertical) or 1 (horizontal). - Deprecated since version 0.19.14: This argument will be removed in a future version. This method will only support vertical aggregation, as if - axiswere set to- 0. To perform horizontal aggregation, use- mean_horizontal().
- null_strategy{‘ignore’, ‘propagate’}
- This argument is only used if - axis == 1.- Deprecated since version 0.19.14: This argument will be removed in a future version. 
 
 - Examples - >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... "spam": [True, False, None], ... } ... ) >>> df.mean() shape: (1, 4) ┌─────┬─────┬──────┬──────┐ │ foo ┆ bar ┆ ham ┆ spam │ │ --- ┆ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ str ┆ f64 │ ╞═════╪═════╪══════╪══════╡ │ 2.0 ┆ 7.0 ┆ null ┆ 0.5 │ └─────┴─────┴──────┴──────┘ 
 - mean_horizontal(*, ignore_nulls: bool = True) Series[source]
- Take the mean of all values horizontally across columns. - Parameters:
- ignore_nulls
- Ignore null values (default). If set to - False, any null value in the input will lead to a null output.
 
- Returns:
- Series
- A Series named - "mean".
 
 - Examples - >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [4.0, 5.0, 6.0], ... } ... ) >>> df.mean_horizontal() shape: (3,) Series: 'mean' [f64] [ 2.5 3.5 4.5 ] 
 - median() Self[source]
- Aggregate the columns of this DataFrame to their median value. - Examples - >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.median() shape: (1, 3) ┌─────┬─────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ str │ ╞═════╪═════╪══════╡ │ 2.0 ┆ 7.0 ┆ null │ └─────┴─────┴──────┘ 
 - melt(
- id_vars: ColumnNameOrSelector | Sequence[ColumnNameOrSelector] | None = None,
- value_vars: ColumnNameOrSelector | Sequence[ColumnNameOrSelector] | None = None,
- variable_name: str | None = None,
- value_name: str | None = None,
- Unpivot a DataFrame from wide to long format. - Optionally leaves identifiers set. - This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars) while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis leaving just two non-identifier columns, ‘variable’ and ‘value’. - Parameters:
- id_vars
- Column(s) or selector(s) to use as identifier variables. 
- value_vars
- Column(s) or selector(s) to use as values variables; if - value_varsis empty all columns that are not in- id_varswill be used.
- variable_name
- Name to give to the - variablecolumn. Defaults to “variable”
- value_name
- Name to give to the - valuecolumn. Defaults to “value”
 
 - Examples - >>> df = pl.DataFrame( ... { ... "a": ["x", "y", "z"], ... "b": [1, 3, 5], ... "c": [2, 4, 6], ... } ... ) >>> import polars.selectors as cs >>> df.melt(id_vars="a", value_vars=cs.numeric()) shape: (6, 3) ┌─────┬──────────┬───────┐ │ a ┆ variable ┆ value │ │ --- ┆ --- ┆ --- │ │ str ┆ str ┆ i64 │ ╞═════╪══════════╪═══════╡ │ x ┆ b ┆ 1 │ │ y ┆ b ┆ 3 │ │ z ┆ b ┆ 5 │ │ x ┆ c ┆ 2 │ │ y ┆ c ┆ 4 │ │ z ┆ c ┆ 6 │ └─────┴──────────┴───────┘ 
 - merge_sorted(
- other: DataFrame,
- key: str,
- Take two sorted DataFrames and merge them by the sorted key. - The output of this operation will also be sorted. It is the callers responsibility that the frames are sorted by that key otherwise the output will not make sense. - The schemas of both DataFrames must be equal. - Parameters:
- other
- Other DataFrame that must be merged 
- key
- Key that is sorted. 
 
 - Examples - >>> df0 = pl.DataFrame( ... {"name": ["steve", "elise", "bob"], "age": [42, 44, 18]} ... ).sort("age") >>> df0 shape: (3, 2) ┌───────┬─────┐ │ name ┆ age │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═══════╪═════╡ │ bob ┆ 18 │ │ steve ┆ 42 │ │ elise ┆ 44 │ └───────┴─────┘ >>> df1 = pl.DataFrame( ... {"name": ["anna", "megan", "steve", "thomas"], "age": [21, 33, 42, 20]} ... ).sort("age") >>> df1 shape: (4, 2) ┌────────┬─────┐ │ name ┆ age │ │ --- ┆ --- │ │ str ┆ i64 │ ╞════════╪═════╡ │ thomas ┆ 20 │ │ anna ┆ 21 │ │ megan ┆ 33 │ │ steve ┆ 42 │ └────────┴─────┘ >>> df0.merge_sorted(df1, key="age") shape: (7, 2) ┌────────┬─────┐ │ name ┆ age │ │ --- ┆ --- │ │ str ┆ i64 │ ╞════════╪═════╡ │ bob ┆ 18 │ │ thomas ┆ 20 │ │ anna ┆ 21 │ │ megan ┆ 33 │ │ steve ┆ 42 │ │ steve ┆ 42 │ │ elise ┆ 44 │ └────────┴─────┘ 
 - min(axis: Literal[0] | None = None) Self[source]
- min(axis: Literal[1]) Series
- min(axis: int) Self | Series
- Aggregate the columns of this DataFrame to their minimum value. - Parameters:
- axis
- Either 0 (vertical) or 1 (horizontal). - Deprecated since version 0.19.14: This argument will be removed in a future version. This method will only support vertical aggregation, as if - axiswere set to- 0. To perform horizontal aggregation, use- min_horizontal().
 
 - Examples - >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.min() shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ └─────┴─────┴─────┘ 
 - min_horizontal() Series[source]
- Get the minimum value horizontally across columns. - Returns:
- Series
- A Series named - "min".
 
 - Examples - >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [4.0, 5.0, 6.0], ... } ... ) >>> df.min_horizontal() shape: (3,) Series: 'min' [f64] [ 1.0 2.0 3.0 ] 
 - n_chunks(strategy: Literal['first'] = 'first') int[source]
- n_chunks(strategy: Literal['all']) list[int]
- Get number of chunks used by the ChunkedArrays of this DataFrame. - Parameters:
- strategy{‘first’, ‘all’}
- Return the number of chunks of the ‘first’ column, or ‘all’ columns in this DataFrame. 
 
 - Examples - >>> df = pl.DataFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [0.5, 4, 10, 13], ... "c": [True, True, False, True], ... } ... ) >>> df.n_chunks() 1 >>> df.n_chunks(strategy="all") [1, 1, 1] 
 - n_unique(subset: str | Expr | Sequence[str | Expr] | None = None) int[source]
- Return the number of unique rows, or the number of unique row-subsets. - Parameters:
- subset
- One or more columns/expressions that define what to count; omit to return the count of unique rows. 
 
 - Notes - This method operates at the - DataFramelevel; to operate on subsets at the expression level you can make use of struct-packing instead, for example:- >>> expr_unique_subset = pl.struct(["a", "b"]).n_unique() - If instead you want to count the number of unique values per-column, you can also use expression-level syntax to return a new frame containing that result: - >>> df = pl.DataFrame([[1, 2, 3], [1, 2, 4]], schema=["a", "b", "c"]) >>> df_nunique = df.select(pl.all().n_unique()) - In aggregate context there is also an equivalent method for returning the unique values per-group: - >>> df_agg_nunique = df.group_by(by=["a"]).n_unique() - Examples - >>> df = pl.DataFrame( ... { ... "a": [1, 1, 2, 3, 4, 5], ... "b": [0.5, 0.5, 1.0, 2.0, 3.0, 3.0], ... "c": [True, True, True, False, True, True], ... } ... ) >>> df.n_unique() 5 - Simple columns subset. - >>> df.n_unique(subset=["b", "c"]) 4 - Expression subset. - >>> df.n_unique( ... subset=[ ... (pl.col("a") // 2), ... (pl.col("c") | (pl.col("b") >= 2)), ... ], ... ) 3 
 - null_count() Self[source]
- Create a new DataFrame that shows the null counts per column. - Examples - >>> df = pl.DataFrame( ... { ... "foo": [1, None, 3], ... "bar": [6, 7, None], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.null_count() shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ u32 ┆ u32 ┆ u32 │ ╞═════╪═════╪═════╡ │ 1 ┆ 1 ┆ 0 │ └─────┴─────┴─────┘ 
 - partition_by(
- by: ColumnNameOrSelector | Sequence[ColumnNameOrSelector],
- *more_by: str,
- maintain_order: bool = True,
- include_key: bool = True,
- as_dict: Literal[False] = False,
- partition_by(
- by: ColumnNameOrSelector | Sequence[ColumnNameOrSelector],
- *more_by: str,
- maintain_order: bool = True,
- include_key: bool = True,
- as_dict: Literal[True],
- Group by the given columns and return the groups as separate dataframes. - Parameters:
- by
- Column name(s) or selector(s) to group by. 
- *more_by
- Additional names of columns to group by, specified as positional arguments. 
- maintain_order
- Ensure that the order of the groups is consistent with the input data. This is slower than a default partition by operation. 
- include_key
- Include the columns used to partition the DataFrame in the output. 
- as_dict
- Return a dictionary instead of a list. The dictionary keys are the distinct group values that identify that group. 
 
 - Examples - Pass a single column name to partition by that column. - >>> df = pl.DataFrame( ... { ... "a": ["a", "b", "a", "b", "c"], ... "b": [1, 2, 1, 3, 3], ... "c": [5, 4, 3, 2, 1], ... } ... ) >>> df.partition_by("a") [shape: (2, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ a ┆ 1 ┆ 5 │ │ a ┆ 1 ┆ 3 │ └─────┴─────┴─────┘, shape: (2, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ b ┆ 2 ┆ 4 │ │ b ┆ 3 ┆ 2 │ └─────┴─────┴─────┘, shape: (1, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ c ┆ 3 ┆ 1 │ └─────┴─────┴─────┘] - Partition by multiple columns by either passing a list of column names, or by specifying each column name as a positional argument. - >>> df.partition_by("a", "b") [shape: (2, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ a ┆ 1 ┆ 5 │ │ a ┆ 1 ┆ 3 │ └─────┴─────┴─────┘, shape: (1, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ b ┆ 2 ┆ 4 │ └─────┴─────┴─────┘, shape: (1, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ b ┆ 3 ┆ 2 │ └─────┴─────┴─────┘, shape: (1, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ c ┆ 3 ┆ 1 │ └─────┴─────┴─────┘] - Return the partitions as a dictionary by specifying - as_dict=True.- >>> import polars.selectors as cs >>> df.partition_by(cs.string(), as_dict=True) {'a': shape: (2, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ a ┆ 1 ┆ 5 │ │ a ┆ 1 ┆ 3 │ └─────┴─────┴─────┘, 'b': shape: (2, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ b ┆ 2 ┆ 4 │ │ b ┆ 3 ┆ 2 │ └─────┴─────┴─────┘, 'c': shape: (1, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ c ┆ 3 ┆ 1 │ └─────┴─────┴─────┘} 
 - pipe(
- function: Callable[Concatenate[DataFrame, P], T],
- *args: P.args,
- **kwargs: P.kwargs,
- Offers a structured way to apply a sequence of user-defined functions (UDFs). - Parameters:
- function
- Callable; will receive the frame as the first parameter, followed by any given args/kwargs. 
- *args
- Arguments to pass to the UDF. 
- **kwargs
- Keyword arguments to pass to the UDF. 
 
 - Notes - It is recommended to use LazyFrame when piping operations, in order to fully take advantage of query optimization and parallelization. See - df.lazy().- Examples - >>> def cast_str_to_int(data, col_name): ... return data.with_columns(pl.col(col_name).cast(pl.Int64)) ... >>> df = pl.DataFrame({"a": [1, 2, 3, 4], "b": ["10", "20", "30", "40"]}) >>> df.pipe(cast_str_to_int, col_name="b") shape: (4, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 10 │ │ 2 ┆ 20 │ │ 3 ┆ 30 │ │ 4 ┆ 40 │ └─────┴─────┘ - >>> df = pl.DataFrame({"b": [1, 2], "a": [3, 4]}) >>> df shape: (2, 2) ┌─────┬─────┐ │ b ┆ a │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 3 │ │ 2 ┆ 4 │ └─────┴─────┘ >>> df.pipe(lambda tdf: tdf.select(sorted(tdf.columns))) shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 3 ┆ 1 │ │ 4 ┆ 2 │ └─────┴─────┘ 
 - pivot(
- values: ColumnNameOrSelector | Sequence[ColumnNameOrSelector] | None,
- index: ColumnNameOrSelector | Sequence[ColumnNameOrSelector] | None,
- columns: ColumnNameOrSelector | Sequence[ColumnNameOrSelector] | None,
- aggregate_function: PivotAgg | Expr | None = None,
- *,
- maintain_order: bool = True,
- sort_columns: bool = False,
- separator: str = '_',
- Create a spreadsheet-style pivot table as a DataFrame. - Only available in eager mode. See “Examples” section below for how to do a “lazy pivot” if you know the unique column values in advance. - Parameters:
- values
- Column values to aggregate. Can be multiple columns if the columns arguments contains multiple columns as well. 
- index
- One or multiple keys to group by. 
- columns
- Name of the column(s) whose values will be used as the header of the output DataFrame. 
- aggregate_function
- Choose from: - None: no aggregation takes place, will raise error if multiple values are in group. 
- A predefined aggregate function string, one of {‘first’, ‘sum’, ‘max’, ‘min’, ‘mean’, ‘median’, ‘last’, ‘count’} 
- An expression to do the aggregation. 
 
- maintain_order
- Sort the grouped keys so that the output order is predictable. 
- sort_columns
- Sort the transposed columns by name. Default is by order of discovery. 
- separator
- Used as separator/delimiter in generated column names. 
 
- Returns:
- DataFrame
 
 - Examples - >>> df = pl.DataFrame( ... { ... "foo": ["one", "one", "two", "two", "one", "two"], ... "bar": ["y", "y", "y", "x", "x", "x"], ... "baz": [1, 2, 3, 4, 5, 6], ... } ... ) >>> df.pivot(values="baz", index="foo", columns="bar", aggregate_function="sum") shape: (2, 3) ┌─────┬─────┬─────┐ │ foo ┆ y ┆ x │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ one ┆ 3 ┆ 5 │ │ two ┆ 3 ┆ 10 │ └─────┴─────┴─────┘ - Pivot using selectors to determine the index/values/columns: - >>> import polars.selectors as cs >>> df.pivot( ... values=cs.numeric(), ... index=cs.string(), ... columns=cs.string(), ... aggregate_function="sum", ... sort_columns=True, ... ).sort( ... by=cs.string(), ... ) shape: (4, 6) ┌─────┬─────┬──────┬──────┬──────┬──────┐ │ foo ┆ bar ┆ one ┆ two ┆ x ┆ y │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪══════╪══════╪══════╪══════╡ │ one ┆ x ┆ 5 ┆ null ┆ 5 ┆ null │ │ one ┆ y ┆ 3 ┆ null ┆ null ┆ 3 │ │ two ┆ x ┆ null ┆ 10 ┆ 10 ┆ null │ │ two ┆ y ┆ null ┆ 3 ┆ null ┆ 3 │ └─────┴─────┴──────┴──────┴──────┴──────┘ - Run an expression as aggregation function - >>> df = pl.DataFrame( ... { ... "col1": ["a", "a", "a", "b", "b", "b"], ... "col2": ["x", "x", "x", "x", "y", "y"], ... "col3": [6, 7, 3, 2, 5, 7], ... } ... ) >>> df.pivot( ... index="col1", ... columns="col2", ... values="col3", ... aggregate_function=pl.element().tanh().mean(), ... ) shape: (2, 3) ┌──────┬──────────┬──────────┐ │ col1 ┆ x ┆ y │ │ --- ┆ --- ┆ --- │ │ str ┆ f64 ┆ f64 │ ╞══════╪══════════╪══════════╡ │ a ┆ 0.998347 ┆ null │ │ b ┆ 0.964028 ┆ 0.999954 │ └──────┴──────────┴──────────┘ - Note that - pivotis only available in eager mode. If you know the unique column values in advance, you can use- polars.LazyFrame.groupby()to get the same result as above in lazy mode:- >>> index = pl.col("col1") >>> columns = pl.col("col2") >>> values = pl.col("col3") >>> unique_column_values = ["x", "y"] >>> aggregate_function = lambda col: col.tanh().mean() >>> ( ... df.lazy() ... .group_by(index) ... .agg( ... *[ ... aggregate_function(values.filter(columns == value)).alias(value) ... for value in unique_column_values ... ] ... ) ... .collect() ... ) shape: (2, 3) ┌──────┬──────────┬──────────┐ │ col1 ┆ x ┆ y │ │ --- ┆ --- ┆ --- │ │ str ┆ f64 ┆ f64 │ ╞══════╪══════════╪══════════╡ │ a ┆ 0.998347 ┆ null │ │ b ┆ 0.964028 ┆ 0.999954 │ └──────┴──────────┴──────────┘ 
 - product() DataFrame[source]
- Aggregate the columns of this DataFrame to their product values. - Examples - >>> df = pl.DataFrame( ... { ... "a": [1, 2, 3], ... "b": [0.5, 4, 10], ... "c": [True, True, False], ... } ... ) - >>> df.product() shape: (1, 3) ┌─────┬──────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ i64 │ ╞═════╪══════╪═════╡ │ 6 ┆ 20.0 ┆ 0 │ └─────┴──────┴─────┘ 
 - quantile(
- quantile: float,
- interpolation: RollingInterpolationMethod = 'nearest',
- Aggregate the columns of this DataFrame to their quantile value. - Parameters:
- quantile
- Quantile between 0.0 and 1.0. 
- interpolation{‘nearest’, ‘higher’, ‘lower’, ‘midpoint’, ‘linear’}
- Interpolation method. 
 
 - Examples - >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.quantile(0.5, "nearest") shape: (1, 3) ┌─────┬─────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ str │ ╞═════╪═════╪══════╡ │ 2.0 ┆ 7.0 ┆ null │ └─────┴─────┴──────┘ 
 - rechunk() Self[source]
- Rechunk the data in this DataFrame to a contiguous allocation. - This will make sure all subsequent operations have optimal and predictable performance. 
 - rename(mapping: dict[str, str]) DataFrame[source]
- Rename column names. - Parameters:
- mapping
- Key value pairs that map from old name to new name. 
 
 - Examples - >>> df = pl.DataFrame( ... {"foo": [1, 2, 3], "bar": [6, 7, 8], "ham": ["a", "b", "c"]} ... ) >>> df.rename({"foo": "apple"}) shape: (3, 3) ┌───────┬─────┬─────┐ │ apple ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═══════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ │ 2 ┆ 7 ┆ b │ │ 3 ┆ 8 ┆ c │ └───────┴─────┴─────┘ 
 - replace(column: str, new_column: Series) Self[source]
- Replace a column by a new Series. - Parameters:
- column
- Column to replace. 
- new_column
- New column to insert. 
 
 - Examples - >>> df = pl.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) >>> s = pl.Series([10, 20, 30]) >>> df.replace("foo", s) # works in-place! shape: (3, 2) ┌─────┬─────┐ │ foo ┆ bar │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 10 ┆ 4 │ │ 20 ┆ 5 │ │ 30 ┆ 6 │ └─────┴─────┘ 
 - replace_at_idx(index: int, new_column: Series) Self[source]
- Replace a column at an index location. - Deprecated since version 0.19.14: This method has been renamed to - replace_column().- Parameters:
- index
- Column index. 
- new_column
- Series that will replace the column. 
 
 
 - replace_column(index: int, column: Series) Self[source]
- Replace a column at an index location. - This operation is in place. - Parameters:
- index
- Column index. 
- column
- Series that will replace the column. 
 
 - Examples - >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> s = pl.Series("apple", [10, 20, 30]) >>> df.replace_column(0, s) shape: (3, 3) ┌───────┬─────┬─────┐ │ apple ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═══════╪═════╪═════╡ │ 10 ┆ 6 ┆ a │ │ 20 ┆ 7 ┆ b │ │ 30 ┆ 8 ┆ c │ └───────┴─────┴─────┘ 
 - reverse() DataFrame[source]
- Reverse the DataFrame. - Examples - >>> df = pl.DataFrame( ... { ... "key": ["a", "b", "c"], ... "val": [1, 2, 3], ... } ... ) >>> df.reverse() shape: (3, 2) ┌─────┬─────┐ │ key ┆ val │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════╪═════╡ │ c ┆ 3 │ │ b ┆ 2 │ │ a ┆ 1 │ └─────┴─────┘ 
 - rolling(
- index_column: IntoExpr,
- *,
- period: str | timedelta,
- offset: str | timedelta | None = None,
- closed: ClosedInterval = 'right',
- by: IntoExpr | Iterable[IntoExpr] | None = None,
- check_sorted: bool = True,
- Create rolling groups based on a time, Int32, or Int64 column. - Different from a - group_by_dynamicthe windows are now determined by the individual values and are not of constant intervals. For constant intervals use- DataFrame.group_by_dynamic().- If you have a time series - <t_0, t_1, ..., t_n>, then by default the windows created will be- (t_0 - period, t_0] 
- (t_1 - period, t_1] 
- … 
- (t_n - period, t_n] 
 - whereas if you pass a non-default - offset, then the windows will be- (t_0 + offset, t_0 + offset + period] 
- (t_1 + offset, t_1 + offset + period] 
- … 
- (t_n + offset, t_n + offset + period] 
 - The - periodand- offsetarguments are created either from a timedelta, or by using the following string language:- 1ns (1 nanosecond) 
- 1us (1 microsecond) 
- 1ms (1 millisecond) 
- 1s (1 second) 
- 1m (1 minute) 
- 1h (1 hour) 
- 1d (1 calendar day) 
- 1w (1 calendar week) 
- 1mo (1 calendar month) 
- 1q (1 calendar quarter) 
- 1y (1 calendar year) 
- 1i (1 index count) 
 - Or combine them: “3d12h4m25s” # 3 days, 12 hours, 4 minutes, and 25 seconds - By “calendar day”, we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for “calendar week”, “calendar month”, “calendar quarter”, and “calendar year”. - In case of a rolling operation on an integer column, the windows are defined by: - “1i” # length 1 
- “10i” # length 10 
 - Parameters:
- index_column
- Column used to group based on the time window. Often of type Date/Datetime. This column must be sorted in ascending order (or, if - byis specified, then it must be sorted in ascending order within each group).- In case of a rolling operation on indices, dtype needs to be one of {Int32, Int64}. Note that Int32 gets temporarily cast to Int64, so if performance matters use an Int64 column. 
- period
- length of the window - must be non-negative 
- offset
- offset of the window. Default is -period 
- closed{‘right’, ‘left’, ‘both’, ‘none’}
- Define which sides of the temporal interval are closed (inclusive). 
- by
- Also group by this column/these columns 
- check_sorted
- When the - byargument is given, polars can not check sortedness by the metadata and has to do a full scan on the index column to verify data is sorted. This is expensive. If you are sure the data within the by groups is sorted, you can set this to- False. Doing so incorrectly will lead to incorrect output
 
- Returns:
- RollingGroupBy
- Object you can call - .aggon to aggregate by groups, the result of which will be sorted by- index_column(but note that if- bycolumns are passed, it will only be sorted within each- bygroup).
 
 - See also - Examples - >>> dates = [ ... "2020-01-01 13:45:48", ... "2020-01-01 16:42:13", ... "2020-01-01 16:45:09", ... "2020-01-02 18:12:48", ... "2020-01-03 19:45:32", ... "2020-01-08 23:16:43", ... ] >>> df = pl.DataFrame({"dt": dates, "a": [3, 7, 5, 9, 2, 1]}).with_columns( ... pl.col("dt").str.strptime(pl.Datetime).set_sorted() ... ) >>> out = df.rolling(index_column="dt", period="2d").agg( ... [ ... pl.sum("a").alias("sum_a"), ... pl.min("a").alias("min_a"), ... pl.max("a").alias("max_a"), ... ] ... ) >>> assert out["sum_a"].to_list() == [3, 10, 15, 24, 11, 1] >>> assert out["max_a"].to_list() == [3, 7, 7, 9, 9, 1] >>> assert out["min_a"].to_list() == [3, 3, 3, 3, 2, 1] >>> out shape: (6, 4) ┌─────────────────────┬───────┬───────┬───────┐ │ dt ┆ sum_a ┆ min_a ┆ max_a │ │ --- ┆ --- ┆ --- ┆ --- │ │ datetime[μs] ┆ i64 ┆ i64 ┆ i64 │ ╞═════════════════════╪═══════╪═══════╪═══════╡ │ 2020-01-01 13:45:48 ┆ 3 ┆ 3 ┆ 3 │ │ 2020-01-01 16:42:13 ┆ 10 ┆ 3 ┆ 7 │ │ 2020-01-01 16:45:09 ┆ 15 ┆ 3 ┆ 7 │ │ 2020-01-02 18:12:48 ┆ 24 ┆ 3 ┆ 9 │ │ 2020-01-03 19:45:32 ┆ 11 ┆ 2 ┆ 9 │ │ 2020-01-08 23:16:43 ┆ 1 ┆ 1 ┆ 1 │ └─────────────────────┴───────┴───────┴───────┘ 
 - row( ) tuple[Any, ...][source]
- row( ) dict[str, Any]
- Get the values of a single row, either by index or by predicate. - Parameters:
- index
- Row index. 
- by_predicate
- Select the row according to a given expression/predicate. 
- named
- Return a dictionary instead of a tuple. The dictionary is a mapping of column name to row value. This is more expensive than returning a regular tuple, but allows for accessing values by column name. 
 
- Returns:
- tuple (default) or dictionary of row values
 
 - Warning - You should NEVER use this method to iterate over a DataFrame; if you require row-iteration you should strongly prefer use of - iter_rows()instead.- See also - Notes - The - indexand- by_predicateparams are mutually exclusive. Additionally, to ensure clarity, the- by_predicateparameter must be supplied by keyword.- When using - by_predicateit is an error condition if anything other than one row is returned; more than one row raises- TooManyRowsReturnedError, and zero rows will raise- NoRowsReturnedError(both inherit from- RowsError).- Examples - Specify an index to return the row at the given index as a tuple. - >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.row(2) (3, 8, 'c') - Specify - named=Trueto get a dictionary instead with a mapping of column names to row values.- >>> df.row(2, named=True) {'foo': 3, 'bar': 8, 'ham': 'c'} - Use - by_predicateto return the row that matches the given predicate.- >>> df.row(by_predicate=(pl.col("ham") == "b")) (2, 7, 'b') 
 - rows(*, named: Literal[False] = False) list[tuple[Any, ...]][source]
- rows(*, named: Literal[True]) list[dict[str, Any]]
- Returns all data in the DataFrame as a list of rows of python-native values. - Parameters:
- named
- Return dictionaries instead of tuples. The dictionaries are a mapping of column name to row value. This is more expensive than returning a regular tuple, but allows for accessing values by column name. 
 
- Returns:
- list of tuples (default) or dictionaries of row values
 
 - Warning - Row-iteration is not optimal as the underlying data is stored in columnar form; where possible, prefer export via one of the dedicated export/output methods. Where possible you should also consider using - iter_rowsinstead to avoid materialising all the data at once.- See also - iter_rows
- Row iterator over frame data (does not materialise all rows). 
- rows_by_key
- Materialises frame data as a key-indexed dictionary. 
 - Notes - If you have - ns-precision temporal values you should be aware that Python natively only supports up to- μs-precision;- ns-precision values will be truncated to microseconds on conversion to Python. If this matters to your use-case you should export to a different format (such as Arrow or NumPy).- Examples - >>> df = pl.DataFrame( ... { ... "x": ["a", "b", "b", "a"], ... "y": [1, 2, 3, 4], ... "z": [0, 3, 6, 9], ... } ... ) >>> df.rows() [('a', 1, 0), ('b', 2, 3), ('b', 3, 6), ('a', 4, 9)] >>> df.rows(named=True) [{'x': 'a', 'y': 1, 'z': 0}, {'x': 'b', 'y': 2, 'z': 3}, {'x': 'b', 'y': 3, 'z': 6}, {'x': 'a', 'y': 4, 'z': 9}] 
 - rows_by_key(
- key: ColumnNameOrSelector | Sequence[ColumnNameOrSelector],
- *,
- named: bool = False,
- include_key: bool = False,
- unique: bool = False,
- Returns DataFrame data as a keyed dictionary of python-native values. - Note that this method should not be used in place of native operations, due to the high cost of materialising all frame data out into a dictionary; it should be used only when you need to move the values out into a Python data structure or other object that cannot operate directly with Polars/Arrow. - Parameters:
- key
- The column(s) to use as the key for the returned dictionary. If multiple columns are specified, the key will be a tuple of those values, otherwise it will be a string. 
- named
- Return dictionary rows instead of tuples, mapping column name to row value. 
- include_key
- Include key values inline with the associated data (by default the key values are omitted as a memory/performance optimisation, as they can be reoconstructed from the key). 
- unique
- Indicate that the key is unique; this will result in a 1:1 mapping from key to a single associated row. Note that if the key is not actually unique the last row with the given key will be returned. 
 
 - See also - Notes - If you have - ns-precision temporal values you should be aware that Python natively only supports up to- μs-precision;- ns-precision values will be truncated to microseconds on conversion to Python. If this matters to your use-case you should export to a different format (such as Arrow or NumPy).- Examples - >>> df = pl.DataFrame( ... { ... "w": ["a", "b", "b", "a"], ... "x": ["q", "q", "q", "k"], ... "y": [1.0, 2.5, 3.0, 4.5], ... "z": [9, 8, 7, 6], ... } ... ) - Group rows by the given key column(s): - >>> df.rows_by_key(key=["w"]) defaultdict(<class 'list'>, {'a': [('q', 1.0, 9), ('k', 4.5, 6)], 'b': [('q', 2.5, 8), ('q', 3.0, 7)]}) - Return the same row groupings as dictionaries: - >>> df.rows_by_key(key=["w"], named=True) defaultdict(<class 'list'>, {'a': [{'x': 'q', 'y': 1.0, 'z': 9}, {'x': 'k', 'y': 4.5, 'z': 6}], 'b': [{'x': 'q', 'y': 2.5, 'z': 8}, {'x': 'q', 'y': 3.0, 'z': 7}]}) - Return row groupings, assuming keys are unique: - >>> df.rows_by_key(key=["z"], unique=True) {9: ('a', 'q', 1.0), 8: ('b', 'q', 2.5), 7: ('b', 'q', 3.0), 6: ('a', 'k', 4.5)} - Return row groupings as dictionaries, assuming keys are unique: - >>> df.rows_by_key(key=["z"], named=True, unique=True) {9: {'w': 'a', 'x': 'q', 'y': 1.0}, 8: {'w': 'b', 'x': 'q', 'y': 2.5}, 7: {'w': 'b', 'x': 'q', 'y': 3.0}, 6: {'w': 'a', 'x': 'k', 'y': 4.5}} - Return dictionary rows grouped by a compound key, including key values: - >>> df.rows_by_key(key=["w", "x"], named=True, include_key=True) defaultdict(<class 'list'>, {('a', 'q'): [{'w': 'a', 'x': 'q', 'y': 1.0, 'z': 9}], ('b', 'q'): [{'w': 'b', 'x': 'q', 'y': 2.5, 'z': 8}, {'w': 'b', 'x': 'q', 'y': 3.0, 'z': 7}], ('a', 'k'): [{'w': 'a', 'x': 'k', 'y': 4.5, 'z': 6}]}) 
 - sample(
- n: int | Series | None = None,
- *,
- fraction: float | Series | None = None,
- with_replacement: bool = False,
- shuffle: bool = False,
- seed: int | None = None,
- Sample from this DataFrame. - Parameters:
- n
- Number of items to return. Cannot be used with - fraction. Defaults to 1 if- fractionis None.
- fraction
- Fraction of items to return. Cannot be used with - n.
- with_replacement
- Allow values to be sampled more than once. 
- shuffle
- If set to True, the order of the sampled rows will be shuffled. If set to False (default), the order of the returned rows will be neither stable nor fully random. 
- seed
- Seed for the random number generator. If set to None (default), a random seed is generated for each sample operation. 
 
 - Examples - >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.sample(n=2, seed=0) shape: (2, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 3 ┆ 8 ┆ c │ │ 2 ┆ 7 ┆ b │ └─────┴─────┴─────┘ 
 - property schema: SchemaDict[source]
- Get a dict[column name, DataType]. - Examples - >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.schema OrderedDict({'foo': Int64, 'bar': Float64, 'ham': Utf8}) 
 - select(
- *exprs: IntoExpr | Iterable[IntoExpr],
- **named_exprs: IntoExpr,
- Select columns from this DataFrame. - Parameters:
- *exprs
- Column(s) to select, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals. 
- **named_exprs
- Additional columns to select, specified as keyword arguments. The columns will be renamed to the keyword used. 
 
 - Examples - Pass the name of a column to select that column. - >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.select("foo") shape: (3, 1) ┌─────┐ │ foo │ │ --- │ │ i64 │ ╞═════╡ │ 1 │ │ 2 │ │ 3 │ └─────┘ - Multiple columns can be selected by passing a list of column names. - >>> df.select(["foo", "bar"]) shape: (3, 2) ┌─────┬─────┐ │ foo ┆ bar │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 6 │ │ 2 ┆ 7 │ │ 3 ┆ 8 │ └─────┴─────┘ - Multiple columns can also be selected using positional arguments instead of a list. Expressions are also accepted. - >>> df.select(pl.col("foo"), pl.col("bar") + 1) shape: (3, 2) ┌─────┬─────┐ │ foo ┆ bar │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 7 │ │ 2 ┆ 8 │ │ 3 ┆ 9 │ └─────┴─────┘ - Use keyword arguments to easily name your expression inputs. - >>> df.select(threshold=pl.when(pl.col("foo") > 2).then(10).otherwise(0)) shape: (3, 1) ┌───────────┐ │ threshold │ │ --- │ │ i32 │ ╞═══════════╡ │ 0 │ │ 0 │ │ 10 │ └───────────┘ - Expressions with multiple outputs can be automatically instantiated as Structs by enabling the setting - Config.set_auto_structify(True):- >>> with pl.Config(auto_structify=True): ... df.select( ... is_odd=(pl.col(pl.INTEGER_DTYPES) % 2).name.suffix("_is_odd"), ... ) ... shape: (3, 1) ┌───────────┐ │ is_odd │ │ --- │ │ struct[2] │ ╞═══════════╡ │ {1,0} │ │ {0,1} │ │ {1,0} │ └───────────┘ 
 - select_seq(
- *exprs: IntoExpr | Iterable[IntoExpr],
- **named_exprs: IntoExpr,
- Select columns from this LazyFrame. - This will run all expression sequentially instead of in parallel. Use this when the work per expression is cheap. - Parameters:
- *exprs
- Column(s) to select, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals. 
- **named_exprs
- Additional columns to select, specified as keyword arguments. The columns will be renamed to the keyword used. 
 
 - See also 
 - set_sorted( ) DataFrame[source]
- Indicate that one or multiple columns are sorted. - Parameters:
- column
- Columns that are sorted 
- more_columns
- Additional columns that are sorted, specified as positional arguments. 
- descending
- Whether the columns are sorted in descending order. 
 
 
 - property shape: tuple[int, int][source]
- Get the shape of the DataFrame. - Examples - >>> df = pl.DataFrame({"foo": [1, 2, 3, 4, 5]}) >>> df.shape (5, 1) 
 - shift(n: int = 1, *, fill_value: IntoExpr | None = None) DataFrame[source]
- Shift values by the given number of indices. - Parameters:
- n
- Number of indices to shift forward. If a negative value is passed, values are shifted in the opposite direction instead. 
- fill_value
- Fill the resulting null values with this value. Accepts expression input. Non-expression inputs are parsed as literals. 
 
 - Notes - This method is similar to the - LAGoperation in SQL when the value for- nis positive. With a negative value for- n, it is similar to- LEAD.- Examples - By default, values are shifted forward by one index. - >>> df = pl.DataFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [5, 6, 7, 8], ... } ... ) >>> df.shift() shape: (4, 2) ┌──────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞══════╪══════╡ │ null ┆ null │ │ 1 ┆ 5 │ │ 2 ┆ 6 │ │ 3 ┆ 7 │ └──────┴──────┘ - Pass a negative value to shift in the opposite direction instead. - >>> df.shift(-2) shape: (4, 2) ┌──────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞══════╪══════╡ │ 3 ┆ 7 │ │ 4 ┆ 8 │ │ null ┆ null │ │ null ┆ null │ └──────┴──────┘ - Specify - fill_valueto fill the resulting null values.- >>> df.shift(-2, fill_value=100) shape: (4, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 3 ┆ 7 │ │ 4 ┆ 8 │ │ 100 ┆ 100 │ │ 100 ┆ 100 │ └─────┴─────┘ 
 - shift_and_fill( ) DataFrame[source]
- Shift values by the given number of places and fill the resulting null values. - Deprecated since version 0.19.12: Use - shift()instead.- Parameters:
- fill_value
- fill None values with this value. 
- n
- Number of places to shift (may be negative). 
 
 
 - shrink_to_fit(*, in_place: bool = False) Self[source]
- Shrink DataFrame memory usage. - Shrinks to fit the exact capacity needed to hold the data. 
 - slice(offset: int, length: int | None = None) Self[source]
- Get a slice of this DataFrame. - Parameters:
- offset
- Start index. Negative indexing is supported. 
- length
- Length of the slice. If set to - None, all rows starting at the offset will be selected.
 
 - Examples - >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.slice(1, 2) shape: (2, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞═════╪═════╪═════╡ │ 2 ┆ 7.0 ┆ b │ │ 3 ┆ 8.0 ┆ c │ └─────┴─────┴─────┘ 
 - sort(
- by: IntoExpr | Iterable[IntoExpr],
- *more_by: IntoExpr,
- descending: bool | Sequence[bool] = False,
- nulls_last: bool = False,
- Sort the dataframe by the given columns. - Parameters:
- by
- Column(s) to sort by. Accepts expression input. Strings are parsed as column names. 
- *more_by
- Additional columns to sort by, specified as positional arguments. 
- descending
- Sort in descending order. When sorting by multiple columns, can be specified per column by passing a sequence of booleans. 
- nulls_last
- Place null values last. 
 
 - Examples - Pass a single column name to sort by that column. - >>> df = pl.DataFrame( ... { ... "a": [1, 2, None], ... "b": [6.0, 5.0, 4.0], ... "c": ["a", "c", "b"], ... } ... ) >>> df.sort("a") shape: (3, 3) ┌──────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞══════╪═════╪═════╡ │ null ┆ 4.0 ┆ b │ │ 1 ┆ 6.0 ┆ a │ │ 2 ┆ 5.0 ┆ c │ └──────┴─────┴─────┘ - Sorting by expressions is also supported. - >>> df.sort(pl.col("a") + pl.col("b") * 2, nulls_last=True) shape: (3, 3) ┌──────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞══════╪═════╪═════╡ │ 2 ┆ 5.0 ┆ c │ │ 1 ┆ 6.0 ┆ a │ │ null ┆ 4.0 ┆ b │ └──────┴─────┴─────┘ - Sort by multiple columns by passing a list of columns. - >>> df.sort(["c", "a"], descending=True) shape: (3, 3) ┌──────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞══════╪═════╪═════╡ │ 2 ┆ 5.0 ┆ c │ │ null ┆ 4.0 ┆ b │ │ 1 ┆ 6.0 ┆ a │ └──────┴─────┴─────┘ - Or use positional arguments to sort by multiple columns in the same way. - >>> df.sort("c", "a", descending=[False, True]) shape: (3, 3) ┌──────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞══════╪═════╪═════╡ │ 1 ┆ 6.0 ┆ a │ │ null ┆ 4.0 ┆ b │ │ 2 ┆ 5.0 ┆ c │ └──────┴─────┴─────┘ 
 - std(ddof: int = 1) Self[source]
- Aggregate the columns of this DataFrame to their standard deviation value. - Parameters:
- ddof
- “Delta Degrees of Freedom”: the divisor used in the calculation is N - ddof, where N represents the number of elements. By default ddof is 1. 
 
 - Examples - >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.std() shape: (1, 3) ┌─────┬─────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ str │ ╞═════╪═════╪══════╡ │ 1.0 ┆ 1.0 ┆ null │ └─────┴─────┴──────┘ >>> df.std(ddof=0) shape: (1, 3) ┌──────────┬──────────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ str │ ╞══════════╪══════════╪══════╡ │ 0.816497 ┆ 0.816497 ┆ null │ └──────────┴──────────┴──────┘ 
 - sum(
- *,
- axis: Literal[0] = None,
- null_strategy: NullStrategy = 'ignore',
- sum(*, axis: Literal[1], null_strategy: NullStrategy = 'ignore') Series
- sum(*, axis: int, null_strategy: NullStrategy = 'ignore') Self | Series
- Aggregate the columns of this DataFrame to their sum value. - Parameters:
- axis
- Either 0 (vertical) or 1 (horizontal). - Deprecated since version 0.19.14: This argument will be removed in a future version. This method will only support vertical aggregation, as if - axiswere set to- 0. To perform horizontal aggregation, use- sum_horizontal().
- null_strategy{‘ignore’, ‘propagate’}
- This argument is only used if - axis == 1.- Deprecated since version 0.19.14: This argument will be removed in a future version. 
 
 - Examples - >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.sum() shape: (1, 3) ┌─────┬─────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪══════╡ │ 6 ┆ 21 ┆ null │ └─────┴─────┴──────┘ 
 - sum_horizontal(*, ignore_nulls: bool = True) Series[source]
- Sum all values horizontally across columns. - Parameters:
- ignore_nulls
- Ignore null values (default). If set to - False, any null value in the input will lead to a null output.
 
- Returns:
- Series
- A Series named - "sum".
 
 - Examples - >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [4.0, 5.0, 6.0], ... } ... ) >>> df.sum_horizontal() shape: (3,) Series: 'sum' [f64] [ 5.0 7.0 9.0 ] 
 - tail(n: int = 5) Self[source]
- Get the last - nrows.- Parameters:
- n
- Number of rows to return. If a negative value is passed, return all rows except the first - abs(n).
 
 - Examples - >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3, 4, 5], ... "bar": [6, 7, 8, 9, 10], ... "ham": ["a", "b", "c", "d", "e"], ... } ... ) >>> df.tail(3) shape: (3, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 3 ┆ 8 ┆ c │ │ 4 ┆ 9 ┆ d │ │ 5 ┆ 10 ┆ e │ └─────┴─────┴─────┘ - Pass a negative value to get all rows - exceptthe first- abs(n).- >>> df.tail(-3) shape: (2, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 4 ┆ 9 ┆ d │ │ 5 ┆ 10 ┆ e │ └─────┴─────┴─────┘ 
 - take_every(n: int) DataFrame[source]
- Take every nth row in the DataFrame and return as a new DataFrame. - Deprecated since version 0.19.14: This method has been renamed to - gather_every().- Parameters:
- n
- Gather every n-th row. 
 
 
 - to_arrow() Table[source]
- Collect the underlying arrow arrays in an Arrow Table. - This operation is mostly zero copy. - Data types that do copy:
- CategoricalType 
 
 - Examples - >>> df = pl.DataFrame( ... {"foo": [1, 2, 3, 4, 5, 6], "bar": ["a", "b", "c", "d", "e", "f"]} ... ) >>> df.to_arrow() pyarrow.Table foo: int64 bar: large_string ---- foo: [[1,2,3,4,5,6]] bar: [["a","b","c","d","e","f"]] 
 - to_dict(as_series: Literal[True] = True) dict[str, Series][source]
- to_dict(as_series: Literal[False]) dict[str, list[Any]]
- to_dict(as_series: bool) dict[str, Series] | dict[str, list[Any]]
- Convert DataFrame to a dictionary mapping column name to values. - Parameters:
- as_series
- True -> Values are Series False -> Values are List[Any] 
 
 - Examples - >>> df = pl.DataFrame( ... { ... "A": [1, 2, 3, 4, 5], ... "fruits": ["banana", "banana", "apple", "apple", "banana"], ... "B": [5, 4, 3, 2, 1], ... "cars": ["beetle", "audi", "beetle", "beetle", "beetle"], ... "optional": [28, 300, None, 2, -30], ... } ... ) >>> df shape: (5, 5) ┌─────┬────────┬─────┬────────┬──────────┐ │ A ┆ fruits ┆ B ┆ cars ┆ optional │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ i64 ┆ str ┆ i64 │ ╞═════╪════════╪═════╪════════╪══════════╡ │ 1 ┆ banana ┆ 5 ┆ beetle ┆ 28 │ │ 2 ┆ banana ┆ 4 ┆ audi ┆ 300 │ │ 3 ┆ apple ┆ 3 ┆ beetle ┆ null │ │ 4 ┆ apple ┆ 2 ┆ beetle ┆ 2 │ │ 5 ┆ banana ┆ 1 ┆ beetle ┆ -30 │ └─────┴────────┴─────┴────────┴──────────┘ >>> df.to_dict(as_series=False) {'A': [1, 2, 3, 4, 5], 'fruits': ['banana', 'banana', 'apple', 'apple', 'banana'], 'B': [5, 4, 3, 2, 1], 'cars': ['beetle', 'audi', 'beetle', 'beetle', 'beetle'], 'optional': [28, 300, None, 2, -30]} >>> df.to_dict(as_series=True) {'A': shape: (5,) Series: 'A' [i64] [ 1 2 3 4 5 ], 'fruits': shape: (5,) Series: 'fruits' [str] [ "banana" "banana" "apple" "apple" "banana" ], 'B': shape: (5,) Series: 'B' [i64] [ 5 4 3 2 1 ], 'cars': shape: (5,) Series: 'cars' [str] [ "beetle" "audi" "beetle" "beetle" "beetle" ], 'optional': shape: (5,) Series: 'optional' [i64] [ 28 300 null 2 -30 ]} 
 - to_dicts() list[dict[str, Any]][source]
- Convert every row to a dictionary of Python-native values. - Notes - If you have - ns-precision temporal values you should be aware that Python natively only supports up to- μs-precision;- ns-precision values will be truncated to microseconds on conversion to Python. If this matters to your use-case you should export to a different format (such as Arrow or NumPy).- Examples - >>> df = pl.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) >>> df.to_dicts() [{'foo': 1, 'bar': 4}, {'foo': 2, 'bar': 5}, {'foo': 3, 'bar': 6}] 
 - to_dummies(
- columns: ColumnNameOrSelector | Sequence[ColumnNameOrSelector] | None = None,
- *,
- separator: str = '_',
- drop_first: bool = False,
- Convert categorical variables into dummy/indicator variables. - Parameters:
- columns
- Column name(s) or selector(s) that should be converted to dummy variables. If set to - None(default), convert all columns.
- separator
- Separator/delimiter used when generating column names. 
- drop_first
- Remove the first category from the variables being encoded. 
 
 - Examples - >>> df = pl.DataFrame( ... { ... "foo": [1, 2], ... "bar": [3, 4], ... "ham": ["a", "b"], ... } ... ) >>> df.to_dummies() shape: (2, 6) ┌───────┬───────┬───────┬───────┬───────┬───────┐ │ foo_1 ┆ foo_2 ┆ bar_3 ┆ bar_4 ┆ ham_a ┆ ham_b │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ u8 ┆ u8 ┆ u8 ┆ u8 ┆ u8 ┆ u8 │ ╞═══════╪═══════╪═══════╪═══════╪═══════╪═══════╡ │ 1 ┆ 0 ┆ 1 ┆ 0 ┆ 1 ┆ 0 │ │ 0 ┆ 1 ┆ 0 ┆ 1 ┆ 0 ┆ 1 │ └───────┴───────┴───────┴───────┴───────┴───────┘ - >>> df.to_dummies(drop_first=True) shape: (2, 3) ┌───────┬───────┬───────┐ │ foo_2 ┆ bar_4 ┆ ham_b │ │ --- ┆ --- ┆ --- │ │ u8 ┆ u8 ┆ u8 │ ╞═══════╪═══════╪═══════╡ │ 0 ┆ 0 ┆ 0 │ │ 1 ┆ 1 ┆ 1 │ └───────┴───────┴───────┘ - >>> import polars.selectors as cs >>> df.to_dummies(cs.integer(), separator=":") shape: (2, 5) ┌───────┬───────┬───────┬───────┬─────┐ │ foo:1 ┆ foo:2 ┆ bar:3 ┆ bar:4 ┆ ham │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ u8 ┆ u8 ┆ u8 ┆ u8 ┆ str │ ╞═══════╪═══════╪═══════╪═══════╪═════╡ │ 1 ┆ 0 ┆ 1 ┆ 0 ┆ a │ │ 0 ┆ 1 ┆ 0 ┆ 1 ┆ b │ └───────┴───────┴───────┴───────┴─────┘ - >>> df.to_dummies(cs.integer(), drop_first=True, separator=":") shape: (2, 3) ┌───────┬───────┬─────┐ │ foo:2 ┆ bar:4 ┆ ham │ │ --- ┆ --- ┆ --- │ │ u8 ┆ u8 ┆ str │ ╞═══════╪═══════╪═════╡ │ 0 ┆ 0 ┆ a │ │ 1 ┆ 1 ┆ b │ └───────┴───────┴─────┘ 
 - to_init_repr(n: int = 1000) str[source]
- Convert DataFrame to instantiatable string representation. - Parameters:
- n
- Only use first n rows. 
 
 - Examples - >>> df = pl.DataFrame( ... [ ... pl.Series("foo", [1, 2, 3], dtype=pl.UInt8), ... pl.Series("bar", [6.0, 7.0, 8.0], dtype=pl.Float32), ... pl.Series("ham", ["a", "b", "c"], dtype=pl.Categorical), ... ] ... ) >>> print(df.to_init_repr()) pl.DataFrame( [ pl.Series("foo", [1, 2, 3], dtype=pl.UInt8), pl.Series("bar", [6.0, 7.0, 8.0], dtype=pl.Float32), pl.Series("ham", ['a', 'b', 'c'], dtype=pl.Categorical), ] ) - >>> df_from_str_repr = eval(df.to_init_repr()) >>> df_from_str_repr shape: (3, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ u8 ┆ f32 ┆ cat │ ╞═════╪═════╪═════╡ │ 1 ┆ 6.0 ┆ a │ │ 2 ┆ 7.0 ┆ b │ │ 3 ┆ 8.0 ┆ c │ └─────┴─────┴─────┘ 
 - to_numpy(
- *,
- structured: bool = False,
- order: IndexOrder = 'fortran',
- Convert DataFrame to a 2D NumPy array. - This operation clones data. - Parameters:
- structured
- Optionally return a structured array, with field names and dtypes that correspond to the DataFrame schema. 
- order
- The index order of the returned NumPy array, either C-like or Fortran-like. In general, using the Fortran-like index order is faster. However, the C-like order might be more appropriate to use for downstream applications to prevent cloning data, e.g. when reshaping into a one-dimensional array. Note that this option only takes effect if - structuredis set to- Falseand the DataFrame dtypes allow for a global dtype for all columns.
 
 - Notes - If you’re attempting to convert Utf8 to an array you’ll need to install - pyarrow.- Examples - >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.5, 7.0, 8.5], ... "ham": ["a", "b", "c"], ... }, ... schema_overrides={"foo": pl.UInt8, "bar": pl.Float32}, ... ) - Export to a standard 2D numpy array. - >>> df.to_numpy() array([[1, 6.5, 'a'], [2, 7.0, 'b'], [3, 8.5, 'c']], dtype=object) - Export to a structured array, which can better-preserve individual column data, such as name and dtype… - >>> df.to_numpy(structured=True) array([(1, 6.5, 'a'), (2, 7. , 'b'), (3, 8.5, 'c')], dtype=[('foo', 'u1'), ('bar', '<f4'), ('ham', '<U1')]) - …optionally zero-copying as a record array view: - >>> import numpy as np >>> df.to_numpy(structured=True).view(np.recarray) rec.array([(1, 6.5, 'a'), (2, 7. , 'b'), (3, 8.5, 'c')], dtype=[('foo', 'u1'), ('bar', '<f4'), ('ham', '<U1')]) 
 - to_pandas( ) DataFrame[source]
- Cast to a pandas DataFrame. - This requires that - pandasand- pyarroware installed. This operation clones data, unless- use_pyarrow_extension_array=True.- Parameters:
- use_pyarrow_extension_array
- Use PyArrow backed-extension arrays instead of numpy arrays for each column of the pandas DataFrame; this allows zero copy operations and preservation of null values. Subsequent operations on the resulting pandas DataFrame may trigger conversion to NumPy arrays if that operation is not supported by pyarrow compute functions. 
- **kwargs
- Arguments will be sent to - pyarrow.Table.to_pandas().
 
- Returns:
 - Examples - >>> import pandas >>> df1 = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> pandas_df1 = df1.to_pandas() >>> type(pandas_df1) <class 'pandas.core.frame.DataFrame'> >>> pandas_df1.dtypes foo int64 bar int64 ham object dtype: object >>> df2 = pl.DataFrame( ... { ... "foo": [1, 2, None], ... "bar": [6, None, 8], ... "ham": [None, "b", "c"], ... } ... ) >>> pandas_df2 = df2.to_pandas() >>> pandas_df2 foo bar ham 0 1.0 6.0 None 1 2.0 NaN b 2 NaN 8.0 c >>> pandas_df2.dtypes foo float64 bar float64 ham object dtype: object >>> pandas_df2_pa = df2.to_pandas( ... use_pyarrow_extension_array=True ... ) >>> pandas_df2_pa foo bar ham 0 1 6 <NA> 1 2 <NA> b 2 <NA> 8 c >>> pandas_df2_pa.dtypes foo int64[pyarrow] bar int64[pyarrow] ham large_string[pyarrow] dtype: object 
 - to_series(index: int = 0) Series[source]
- Select column as Series at index location. - Parameters:
- index
- Location of selection. 
 
 - See also - Examples - >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.to_series(1) shape: (3,) Series: 'bar' [i64] [ 6 7 8 ] 
 - to_struct(name: str) Series[source]
- Convert a - DataFrameto a- Seriesof type- Struct.- Parameters:
- name
- Name for the struct Series 
 
 - Examples - >>> df = pl.DataFrame( ... { ... "a": [1, 2, 3, 4, 5], ... "b": ["one", "two", "three", "four", "five"], ... } ... ) >>> df.to_struct("nums") shape: (5,) Series: 'nums' [struct[2]] [ {1,"one"} {2,"two"} {3,"three"} {4,"four"} {5,"five"} ] 
 - top_k(
- k: int,
- *,
- by: IntoExpr | Iterable[IntoExpr],
- descending: bool | Sequence[bool] = False,
- nulls_last: bool = False,
- maintain_order: bool = False,
- Return the - klargest elements.- If ‘descending=True` the smallest elements will be given. - Parameters:
- k
- Number of rows to return. 
- by
- Column(s) included in sort order. Accepts expression input. Strings are parsed as column names. 
- descending
- Return the ‘k’ smallest. Top-k by multiple columns can be specified per column by passing a sequence of booleans. 
- nulls_last
- Place null values last. 
- maintain_order
- Whether the order should be maintained if elements are equal. Note that if - truestreaming is not possible and performance might be worse since this requires a stable search.
 
 - See also - Examples - >>> df = pl.DataFrame( ... { ... "a": ["a", "b", "a", "b", "b", "c"], ... "b": [2, 1, 1, 3, 2, 1], ... } ... ) - Get the rows which contain the 4 largest values in column b. - >>> df.top_k(4, by="b") shape: (4, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════╪═════╡ │ b ┆ 3 │ │ a ┆ 2 │ │ b ┆ 2 │ │ b ┆ 1 │ └─────┴─────┘ - Get the rows which contain the 4 largest values when sorting on column b and a. - >>> df.top_k(4, by=["b", "a"]) shape: (4, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════╪═════╡ │ b ┆ 3 │ │ b ┆ 2 │ │ a ┆ 2 │ │ c ┆ 1 │ └─────┴─────┘ 
 - transpose(
- *,
- include_header: bool = False,
- header_name: str = 'column',
- column_names: str | Iterable[str] | None = None,
- Transpose a DataFrame over the diagonal. - Parameters:
- include_header
- If set, the column names will be added as first column. 
- header_name
- If - include_headeris set, this determines the name of the column that will be inserted.
- column_names
- Optional iterable yielding strings or a string naming an existing column. These will name the value (non-header) columns in the transposed data. 
 
- Returns:
- DataFrame
 
 - Notes - This is a very expensive operation. Perhaps you can do it differently. - Examples - >>> df = pl.DataFrame({"a": [1, 2, 3], "b": [1, 2, 3]}) >>> df.transpose(include_header=True) shape: (2, 4) ┌────────┬──────────┬──────────┬──────────┐ │ column ┆ column_0 ┆ column_1 ┆ column_2 │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 ┆ i64 │ ╞════════╪══════════╪══════════╪══════════╡ │ a ┆ 1 ┆ 2 ┆ 3 │ │ b ┆ 1 ┆ 2 ┆ 3 │ └────────┴──────────┴──────────┴──────────┘ - Replace the auto-generated column names with a list - >>> df.transpose(include_header=False, column_names=["a", "b", "c"]) shape: (2, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ 1 ┆ 2 ┆ 3 │ │ 1 ┆ 2 ┆ 3 │ └─────┴─────┴─────┘ - Include the header as a separate column - >>> df.transpose( ... include_header=True, header_name="foo", column_names=["a", "b", "c"] ... ) shape: (2, 4) ┌─────┬─────┬─────┬─────┐ │ foo ┆ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╪═════╡ │ a ┆ 1 ┆ 2 ┆ 3 │ │ b ┆ 1 ┆ 2 ┆ 3 │ └─────┴─────┴─────┴─────┘ - Replace the auto-generated column with column names from a generator function - >>> def name_generator(): ... base_name = "my_column_" ... count = 0 ... while True: ... yield f"{base_name}{count}" ... count += 1 ... >>> df.transpose(include_header=False, column_names=name_generator()) shape: (2, 3) ┌─────────────┬─────────────┬─────────────┐ │ my_column_0 ┆ my_column_1 ┆ my_column_2 │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ╞═════════════╪═════════════╪═════════════╡ │ 1 ┆ 2 ┆ 3 │ │ 1 ┆ 2 ┆ 3 │ └─────────────┴─────────────┴─────────────┘ - Use an existing column as the new column names - >>> df = pl.DataFrame(dict(id=["a", "b", "c"], col1=[1, 3, 2], col2=[3, 4, 6])) >>> df.transpose(column_names="id") shape: (2, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ 1 ┆ 3 ┆ 2 │ │ 3 ┆ 4 ┆ 6 │ └─────┴─────┴─────┘ >>> df.transpose(include_header=True, header_name="new_id", column_names="id") shape: (2, 4) ┌────────┬─────┬─────┬─────┐ │ new_id ┆ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 ┆ i64 │ ╞════════╪═════╪═════╪═════╡ │ col1 ┆ 1 ┆ 3 ┆ 2 │ │ col2 ┆ 3 ┆ 4 ┆ 6 │ └────────┴─────┴─────┴─────┘ 
 - unique(
- subset: ColumnNameOrSelector | Collection[ColumnNameOrSelector] | None = None,
- *,
- keep: UniqueKeepStrategy = 'any',
- maintain_order: bool = False,
- Drop duplicate rows from this dataframe. - Parameters:
- subset
- Column name(s) or selector(s), to consider when identifying duplicate rows. If set to - None(default), use all columns.
- keep{‘first’, ‘last’, ‘any’, ‘none’}
- Which of the duplicate rows to keep. - ‘any’: Does not give any guarantee of which row is kept.
- This allows more optimizations. 
 
- ‘none’: Don’t keep duplicate rows. 
- ‘first’: Keep first unique row. 
- ‘last’: Keep last unique row. 
 
- maintain_order
- Keep the same order as the original DataFrame. This is more expensive to compute. Settings this to - Trueblocks the possibility to run on the streaming engine.
 
- Returns:
- DataFrame
- DataFrame with unique rows. 
 
 - Warning - This method will fail if there is a column of type - Listin the DataFrame or subset.- Examples - >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3, 1], ... "bar": ["a", "a", "a", "a"], ... "ham": ["b", "b", "b", "b"], ... } ... ) >>> df.unique(maintain_order=True) shape: (3, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ a ┆ b │ │ 2 ┆ a ┆ b │ │ 3 ┆ a ┆ b │ └─────┴─────┴─────┘ >>> df.unique(subset=["bar", "ham"], maintain_order=True) shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ a ┆ b │ └─────┴─────┴─────┘ >>> df.unique(keep="last", maintain_order=True) shape: (3, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ str │ ╞═════╪═════╪═════╡ │ 2 ┆ a ┆ b │ │ 3 ┆ a ┆ b │ │ 1 ┆ a ┆ b │ └─────┴─────┴─────┘ 
 - unnest(
- columns: ColumnNameOrSelector | Collection[ColumnNameOrSelector],
- *more_columns: ColumnNameOrSelector,
- Decompose struct columns into separate columns for each of their fields. - The new columns will be inserted into the dataframe at the location of the struct column. - Parameters:
- columns
- Name of the struct column(s) that should be unnested. 
- *more_columns
- Additional columns to unnest, specified as positional arguments. 
 
 - Examples - >>> df = pl.DataFrame( ... { ... "before": ["foo", "bar"], ... "t_a": [1, 2], ... "t_b": ["a", "b"], ... "t_c": [True, None], ... "t_d": [[1, 2], [3]], ... "after": ["baz", "womp"], ... } ... ).select("before", pl.struct(pl.col("^t_.$")).alias("t_struct"), "after") >>> df shape: (2, 3) ┌────────┬─────────────────────┬───────┐ │ before ┆ t_struct ┆ after │ │ --- ┆ --- ┆ --- │ │ str ┆ struct[4] ┆ str │ ╞════════╪═════════════════════╪═══════╡ │ foo ┆ {1,"a",true,[1, 2]} ┆ baz │ │ bar ┆ {2,"b",null,[3]} ┆ womp │ └────────┴─────────────────────┴───────┘ >>> df.unnest("t_struct") shape: (2, 6) ┌────────┬─────┬─────┬──────┬───────────┬───────┐ │ before ┆ t_a ┆ t_b ┆ t_c ┆ t_d ┆ after │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ str ┆ bool ┆ list[i64] ┆ str │ ╞════════╪═════╪═════╪══════╪═══════════╪═══════╡ │ foo ┆ 1 ┆ a ┆ true ┆ [1, 2] ┆ baz │ │ bar ┆ 2 ┆ b ┆ null ┆ [3] ┆ womp │ └────────┴─────┴─────┴──────┴───────────┴───────┘ 
 - unstack(
- step: int,
- how: UnstackDirection = 'vertical',
- columns: ColumnNameOrSelector | Sequence[ColumnNameOrSelector] | None = None,
- fill_values: list[Any] | None = None,
- Unstack a long table to a wide form without doing an aggregation. - This can be much faster than a pivot, because it can skip the grouping phase. - Parameters:
- step
- Number of rows in the unstacked frame. 
- how{ ‘vertical’, ‘horizontal’ }
- Direction of the unstack. 
- columns
- Column name(s) or selector(s) to include in the operation. If set to - None(default), use all columns.
- fill_values
- Fill values that don’t fit the new size with this value. 
 
 - Warning - This functionality is experimental and may be subject to changes without it being considered a breaking change. - Examples - >>> from string import ascii_uppercase >>> df = pl.DataFrame( ... { ... "x": list(ascii_uppercase[0:8]), ... "y": pl.int_range(1, 9, eager=True), ... } ... ).with_columns( ... z=pl.int_ranges(pl.col("y"), pl.col("y") + 2, dtype=pl.UInt8), ... ) >>> df shape: (8, 3) ┌─────┬─────┬──────────┐ │ x ┆ y ┆ z │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ list[u8] │ ╞═════╪═════╪══════════╡ │ A ┆ 1 ┆ [1, 2] │ │ B ┆ 2 ┆ [2, 3] │ │ C ┆ 3 ┆ [3, 4] │ │ D ┆ 4 ┆ [4, 5] │ │ E ┆ 5 ┆ [5, 6] │ │ F ┆ 6 ┆ [6, 7] │ │ G ┆ 7 ┆ [7, 8] │ │ H ┆ 8 ┆ [8, 9] │ └─────┴─────┴──────────┘ >>> df.unstack(step=4, how="vertical") shape: (4, 6) ┌─────┬─────┬─────┬─────┬──────────┬──────────┐ │ x_0 ┆ x_1 ┆ y_0 ┆ y_1 ┆ z_0 ┆ z_1 │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ i64 ┆ i64 ┆ list[u8] ┆ list[u8] │ ╞═════╪═════╪═════╪═════╪══════════╪══════════╡ │ A ┆ E ┆ 1 ┆ 5 ┆ [1, 2] ┆ [5, 6] │ │ B ┆ F ┆ 2 ┆ 6 ┆ [2, 3] ┆ [6, 7] │ │ C ┆ G ┆ 3 ┆ 7 ┆ [3, 4] ┆ [7, 8] │ │ D ┆ H ┆ 4 ┆ 8 ┆ [4, 5] ┆ [8, 9] │ └─────┴─────┴─────┴─────┴──────────┴──────────┘ >>> df.unstack(step=2, how="horizontal") shape: (4, 6) ┌─────┬─────┬─────┬─────┬──────────┬──────────┐ │ x_0 ┆ x_1 ┆ y_0 ┆ y_1 ┆ z_0 ┆ z_1 │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ i64 ┆ i64 ┆ list[u8] ┆ list[u8] │ ╞═════╪═════╪═════╪═════╪══════════╪══════════╡ │ A ┆ B ┆ 1 ┆ 2 ┆ [1, 2] ┆ [2, 3] │ │ C ┆ D ┆ 3 ┆ 4 ┆ [3, 4] ┆ [4, 5] │ │ E ┆ F ┆ 5 ┆ 6 ┆ [5, 6] ┆ [6, 7] │ │ G ┆ H ┆ 7 ┆ 8 ┆ [7, 8] ┆ [8, 9] │ └─────┴─────┴─────┴─────┴──────────┴──────────┘ >>> import polars.selectors as cs >>> df.unstack(step=5, columns=cs.numeric(), fill_values=0) shape: (5, 2) ┌─────┬─────┐ │ y_0 ┆ y_1 │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 6 │ │ 2 ┆ 7 │ │ 3 ┆ 8 │ │ 4 ┆ 0 │ │ 5 ┆ 0 │ └─────┴─────┘ 
 - update(
- other: DataFrame,
- on: str | Sequence[str] | None = None,
- left_on: str | Sequence[str] | None = None,
- right_on: str | Sequence[str] | None = None,
- how: Literal['left', 'inner', 'outer'] = 'left',
- include_nulls: bool | None = False,
- Update the values in this - DataFramewith the values in- other.- By default, null values in the right dataframe are ignored. Use - ignore_nulls=Falseto overwrite values in this frame with null values in other frame.- Parameters:
- other
- DataFrame that will be used to update the values 
- on
- Column names that will be joined on. If none given the row count is used. 
- left_on
- Join column(s) of the left DataFrame. 
- right_on
- Join column(s) of the right DataFrame. 
- how{‘left’, ‘inner’, ‘outer’}
- ‘left’ will keep all rows from the left table; rows may be duplicated if multiple rows in the right frame match the left row’s key. 
- ‘inner’ keeps only those rows where the key exists in both frames. 
- ‘outer’ will update existing rows where the key matches while also adding any new rows contained in the given frame. 
 
- include_nulls
- If True, null values from the right dataframe will be used to update the left dataframe. 
 
 - Warning - This functionality is experimental and may change without it being considered a breaking change. - Notes - This is syntactic sugar for a left/inner join, with an optional coalesce when - include_nulls = False.- Examples - >>> df = pl.DataFrame( ... { ... "A": [1, 2, 3, 4], ... "B": [400, 500, 600, 700], ... } ... ) >>> df shape: (4, 2) ┌─────┬─────┐ │ A ┆ B │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 400 │ │ 2 ┆ 500 │ │ 3 ┆ 600 │ │ 4 ┆ 700 │ └─────┴─────┘ >>> new_df = pl.DataFrame( ... { ... "B": [-66, None, -99], ... "C": [5, 3, 1], ... } ... ) - Update - dfvalues with the non-null values in- new_df, by row index:- >>> df.update(new_df) shape: (4, 2) ┌─────┬─────┐ │ A ┆ B │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ -66 │ │ 2 ┆ 500 │ │ 3 ┆ -99 │ │ 4 ┆ 700 │ └─────┴─────┘ - Update - dfvalues with the non-null values in- new_df, by row index, but only keeping those rows that are common to both frames:- >>> df.update(new_df, how="inner") shape: (3, 2) ┌─────┬─────┐ │ A ┆ B │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ -66 │ │ 2 ┆ 500 │ │ 3 ┆ -99 │ └─────┴─────┘ - Update - dfvalues with the non-null values in- new_df, using an outer join strategy that defines explicit join columns in each frame:- >>> df.update(new_df, left_on=["A"], right_on=["C"], how="outer") shape: (5, 2) ┌─────┬─────┐ │ A ┆ B │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ -99 │ │ 2 ┆ 500 │ │ 3 ┆ 600 │ │ 4 ┆ 700 │ │ 5 ┆ -66 │ └─────┴─────┘ - Update - dfvalues including null values in- new_df, using an outer join strategy that defines explicit join columns in each frame:- >>> df.update( ... new_df, left_on="A", right_on="C", how="outer", include_nulls=True ... ) shape: (5, 2) ┌─────┬──────┐ │ A ┆ B │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪══════╡ │ 1 ┆ -99 │ │ 2 ┆ 500 │ │ 3 ┆ null │ │ 4 ┆ 700 │ │ 5 ┆ -66 │ └─────┴──────┘ 
 - upsample(
- time_column: str,
- *,
- every: str | timedelta,
- offset: str | timedelta | None = None,
- by: str | Sequence[str] | None = None,
- maintain_order: bool = False,
- Upsample a DataFrame at a regular frequency. - The - everyand- offsetarguments are created with the following string language:- 1ns (1 nanosecond) 
- 1us (1 microsecond) 
- 1ms (1 millisecond) 
- 1s (1 second) 
- 1m (1 minute) 
- 1h (1 hour) 
- 1d (1 calendar day) 
- 1w (1 calendar week) 
- 1mo (1 calendar month) 
- 1q (1 calendar quarter) 
- 1y (1 calendar year) 
- 1i (1 index count) 
 - Or combine them: - “3d12h4m25s” # 3 days, 12 hours, 4 minutes, and 25 seconds 
 - By “calendar day”, we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for “calendar week”, “calendar month”, “calendar quarter”, and “calendar year”. - Parameters:
- time_column
- time column will be used to determine a date_range. Note that this column has to be sorted for the output to make sense. 
- every
- interval will start ‘every’ duration 
- offset
- change the start of the date_range by this offset. 
- by
- First group by these columns and then upsample for every group 
- maintain_order
- Keep the ordering predictable. This is slower. 
 
- Returns:
- DataFrame
- Result will be sorted by - time_column(but note that if- bycolumns are passed, it will only be sorted within each- bygroup).
 
 - Examples - Upsample a DataFrame by a certain interval. - >>> from datetime import datetime >>> df = pl.DataFrame( ... { ... "time": [ ... datetime(2021, 2, 1), ... datetime(2021, 4, 1), ... datetime(2021, 5, 1), ... datetime(2021, 6, 1), ... ], ... "groups": ["A", "B", "A", "B"], ... "values": [0, 1, 2, 3], ... } ... ).set_sorted("time") >>> df.upsample( ... time_column="time", every="1mo", by="groups", maintain_order=True ... ).select(pl.all().forward_fill()) shape: (7, 3) ┌─────────────────────┬────────┬────────┐ │ time ┆ groups ┆ values │ │ --- ┆ --- ┆ --- │ │ datetime[μs] ┆ str ┆ i64 │ ╞═════════════════════╪════════╪════════╡ │ 2021-02-01 00:00:00 ┆ A ┆ 0 │ │ 2021-03-01 00:00:00 ┆ A ┆ 0 │ │ 2021-04-01 00:00:00 ┆ A ┆ 0 │ │ 2021-05-01 00:00:00 ┆ A ┆ 2 │ │ 2021-04-01 00:00:00 ┆ B ┆ 1 │ │ 2021-05-01 00:00:00 ┆ B ┆ 1 │ │ 2021-06-01 00:00:00 ┆ B ┆ 3 │ └─────────────────────┴────────┴────────┘ 
 - var(ddof: int = 1) Self[source]
- Aggregate the columns of this DataFrame to their variance value. - Parameters:
- ddof
- “Delta Degrees of Freedom”: the divisor used in the calculation is N - ddof, where N represents the number of elements. By default ddof is 1. 
 
 - Examples - >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.var() shape: (1, 3) ┌─────┬─────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ str │ ╞═════╪═════╪══════╡ │ 1.0 ┆ 1.0 ┆ null │ └─────┴─────┴──────┘ >>> df.var(ddof=0) shape: (1, 3) ┌──────────┬──────────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ str │ ╞══════════╪══════════╪══════╡ │ 0.666667 ┆ 0.666667 ┆ null │ └──────────┴──────────┴──────┘ 
 - vstack(other: DataFrame, *, in_place: bool = False) Self[source]
- Grow this DataFrame vertically by stacking a DataFrame to it. - Parameters:
- other
- DataFrame to stack. 
- in_place
- Modify in place. 
 
 - See also - Examples - >>> df1 = pl.DataFrame( ... { ... "foo": [1, 2], ... "bar": [6, 7], ... "ham": ["a", "b"], ... } ... ) >>> df2 = pl.DataFrame( ... { ... "foo": [3, 4], ... "bar": [8, 9], ... "ham": ["c", "d"], ... } ... ) >>> df1.vstack(df2) shape: (4, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ │ 2 ┆ 7 ┆ b │ │ 3 ┆ 8 ┆ c │ │ 4 ┆ 9 ┆ d │ └─────┴─────┴─────┘ 
 - property width: int[source]
- Get the width of the DataFrame. - Examples - >>> df = pl.DataFrame({"foo": [1, 2, 3, 4, 5]}) >>> df.width 1 
 - with_columns(
- *exprs: IntoExpr | Iterable[IntoExpr],
- **named_exprs: IntoExpr,
- Add columns to this DataFrame. - Added columns will replace existing columns with the same name. - Parameters:
- *exprs
- Column(s) to add, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals. 
- **named_exprs
- Additional columns to add, specified as keyword arguments. The columns will be renamed to the keyword used. 
 
- Returns:
- DataFrame
- A new DataFrame with the columns added. 
 
 - Notes - Creating a new DataFrame using this method does not create a new copy of existing data. - Examples - Pass an expression to add it as a new column. - >>> df = pl.DataFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [0.5, 4, 10, 13], ... "c": [True, True, False, True], ... } ... ) >>> df.with_columns((pl.col("a") ** 2).alias("a^2")) shape: (4, 4) ┌─────┬──────┬───────┬──────┐ │ a ┆ b ┆ c ┆ a^2 │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ bool ┆ f64 │ ╞═════╪══════╪═══════╪══════╡ │ 1 ┆ 0.5 ┆ true ┆ 1.0 │ │ 2 ┆ 4.0 ┆ true ┆ 4.0 │ │ 3 ┆ 10.0 ┆ false ┆ 9.0 │ │ 4 ┆ 13.0 ┆ true ┆ 16.0 │ └─────┴──────┴───────┴──────┘ - Added columns will replace existing columns with the same name. - >>> df.with_columns(pl.col("a").cast(pl.Float64)) shape: (4, 3) ┌─────┬──────┬───────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ bool │ ╞═════╪══════╪═══════╡ │ 1.0 ┆ 0.5 ┆ true │ │ 2.0 ┆ 4.0 ┆ true │ │ 3.0 ┆ 10.0 ┆ false │ │ 4.0 ┆ 13.0 ┆ true │ └─────┴──────┴───────┘ - Multiple columns can be added by passing a list of expressions. - >>> df.with_columns( ... [ ... (pl.col("a") ** 2).alias("a^2"), ... (pl.col("b") / 2).alias("b/2"), ... (pl.col("c").not_()).alias("not c"), ... ] ... ) shape: (4, 6) ┌─────┬──────┬───────┬──────┬──────┬───────┐ │ a ┆ b ┆ c ┆ a^2 ┆ b/2 ┆ not c │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ bool ┆ f64 ┆ f64 ┆ bool │ ╞═════╪══════╪═══════╪══════╪══════╪═══════╡ │ 1 ┆ 0.5 ┆ true ┆ 1.0 ┆ 0.25 ┆ false │ │ 2 ┆ 4.0 ┆ true ┆ 4.0 ┆ 2.0 ┆ false │ │ 3 ┆ 10.0 ┆ false ┆ 9.0 ┆ 5.0 ┆ true │ │ 4 ┆ 13.0 ┆ true ┆ 16.0 ┆ 6.5 ┆ false │ └─────┴──────┴───────┴──────┴──────┴───────┘ - Multiple columns also can be added using positional arguments instead of a list. - >>> df.with_columns( ... (pl.col("a") ** 2).alias("a^2"), ... (pl.col("b") / 2).alias("b/2"), ... (pl.col("c").not_()).alias("not c"), ... ) shape: (4, 6) ┌─────┬──────┬───────┬──────┬──────┬───────┐ │ a ┆ b ┆ c ┆ a^2 ┆ b/2 ┆ not c │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ bool ┆ f64 ┆ f64 ┆ bool │ ╞═════╪══════╪═══════╪══════╪══════╪═══════╡ │ 1 ┆ 0.5 ┆ true ┆ 1.0 ┆ 0.25 ┆ false │ │ 2 ┆ 4.0 ┆ true ┆ 4.0 ┆ 2.0 ┆ false │ │ 3 ┆ 10.0 ┆ false ┆ 9.0 ┆ 5.0 ┆ true │ │ 4 ┆ 13.0 ┆ true ┆ 16.0 ┆ 6.5 ┆ false │ └─────┴──────┴───────┴──────┴──────┴───────┘ - Use keyword arguments to easily name your expression inputs. - >>> df.with_columns( ... ab=pl.col("a") * pl.col("b"), ... not_c=pl.col("c").not_(), ... ) shape: (4, 5) ┌─────┬──────┬───────┬──────┬───────┐ │ a ┆ b ┆ c ┆ ab ┆ not_c │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ bool ┆ f64 ┆ bool │ ╞═════╪══════╪═══════╪══════╪═══════╡ │ 1 ┆ 0.5 ┆ true ┆ 0.5 ┆ false │ │ 2 ┆ 4.0 ┆ true ┆ 8.0 ┆ false │ │ 3 ┆ 10.0 ┆ false ┆ 30.0 ┆ true │ │ 4 ┆ 13.0 ┆ true ┆ 52.0 ┆ false │ └─────┴──────┴───────┴──────┴───────┘ - Expressions with multiple outputs can be automatically instantiated as Structs by enabling the setting - Config.set_auto_structify(True):- >>> with pl.Config(auto_structify=True): ... df.drop("c").with_columns( ... diffs=pl.col(["a", "b"]).diff().name.suffix("_diff"), ... ) ... shape: (4, 3) ┌─────┬──────┬─────────────┐ │ a ┆ b ┆ diffs │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ struct[2] │ ╞═════╪══════╪═════════════╡ │ 1 ┆ 0.5 ┆ {null,null} │ │ 2 ┆ 4.0 ┆ {1,3.5} │ │ 3 ┆ 10.0 ┆ {1,6.0} │ │ 4 ┆ 13.0 ┆ {1,3.0} │ └─────┴──────┴─────────────┘ 
 - with_columns_seq(
- *exprs: IntoExpr | Iterable[IntoExpr],
- **named_exprs: IntoExpr,
- Add columns to this DataFrame. - Added columns will replace existing columns with the same name. - This will run all expression sequentially instead of in parallel. Use this when the work per expression is cheap. - Parameters:
- *exprs
- Column(s) to add, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals. 
- **named_exprs
- Additional columns to add, specified as keyword arguments. The columns will be renamed to the keyword used. 
 
- Returns:
- LazyFrame
- A new LazyFrame with the columns added. 
 
 - See also 
 - with_row_count(name: str = 'row_nr', offset: int = 0) Self[source]
- Add a column at index 0 that counts the rows. - Parameters:
- name
- Name of the column to add. 
- offset
- Start the row count at this offset. Default = 0 
 
 - Examples - >>> df = pl.DataFrame( ... { ... "a": [1, 3, 5], ... "b": [2, 4, 6], ... } ... ) >>> df.with_row_count() shape: (3, 3) ┌────────┬─────┬─────┐ │ row_nr ┆ a ┆ b │ │ --- ┆ --- ┆ --- │ │ u32 ┆ i64 ┆ i64 │ ╞════════╪═════╪═════╡ │ 0 ┆ 1 ┆ 2 │ │ 1 ┆ 3 ┆ 4 │ │ 2 ┆ 5 ┆ 6 │ └────────┴─────┴─────┘ 
 - write_avro(
- file: BinaryIO | BytesIO | str | Path,
- compression: AvroCompression = 'uncompressed',
- name: str = '',
- Write to Apache Avro file. - Parameters:
- file
- File path or writeable file-like object to which the data will be written. 
- compression{‘uncompressed’, ‘snappy’, ‘deflate’}
- Compression method. Defaults to “uncompressed”. 
- name
- Schema name. Defaults to empty string. 
 
 - Examples - >>> import pathlib >>> >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3, 4, 5], ... "bar": [6, 7, 8, 9, 10], ... "ham": ["a", "b", "c", "d", "e"], ... } ... ) >>> path: pathlib.Path = dirpath / "new_file.avro" >>> df.write_avro(path) 
 - write_csv(
- file: None = None,
- *,
- include_bom: bool = False,
- include_header: bool = True,
- separator: str = ',',
- line_terminator: str = '\n',
- quote_char: str = '"',
- batch_size: int = 1024,
- datetime_format: str | None = None,
- date_format: str | None = None,
- time_format: str | None = None,
- float_precision: int | None = None,
- null_value: str | None = None,
- quote_style: CsvQuoteStyle | None = None,
- write_csv(
- file: BytesIO | TextIOWrapper | str | Path,
- *,
- include_bom: bool = False,
- include_header: bool = True,
- separator: str = ',',
- line_terminator: str = '\n',
- quote_char: str = '"',
- batch_size: int = 1024,
- datetime_format: str | None = None,
- date_format: str | None = None,
- time_format: str | None = None,
- float_precision: int | None = None,
- null_value: str | None = None,
- quote_style: CsvQuoteStyle | None = None,
- Write to comma-separated values (CSV) file. - Parameters:
- file
- File path or writeable file-like object to which the result will be written. If set to - None(default), the output is returned as a string instead.
- include_bom
- Whether to include UTF-8 BOM in the CSV output. 
- include_header
- Whether to include header in the CSV output. 
- separator
- Separate CSV fields with this symbol. 
- line_terminator
- String used to end each row. 
- quote_char
- Byte to use as quoting character. 
- batch_size
- Number of rows that will be processed per thread. 
- datetime_format
- A format string, with the specifiers defined by the chrono Rust crate. If no format specified, the default fractional-second precision is inferred from the maximum timeunit found in the frame’s Datetime cols (if any). 
- date_format
- A format string, with the specifiers defined by the chrono Rust crate. 
- time_format
- A format string, with the specifiers defined by the chrono Rust crate. 
- float_precision
- Number of decimal places to write, applied to both - Float32and- Float64datatypes.
- null_value
- A string representing null values (defaulting to the empty string). 
- quote_style{‘necessary’, ‘always’, ‘non_numeric’, ‘never’}
- Determines the quoting strategy used. - necessary (default): This puts quotes around fields only when necessary. They are necessary when fields contain a quote, separator or record terminator. Quotes are also necessary when writing an empty record (which is indistinguishable from a record with one empty field). This is the default. 
- always: This puts quotes around every field. Always. 
- never: This never puts quotes around fields, even if that results in invalid CSV data (e.g.: by not quoting strings containing the separator). 
- non_numeric: This puts quotes around all fields that are non-numeric. Namely, when writing a field that does not parse as a valid float or integer, then quotes will be used even if they aren`t strictly necessary. 
 
 
 - Examples - >>> import pathlib >>> >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3, 4, 5], ... "bar": [6, 7, 8, 9, 10], ... "ham": ["a", "b", "c", "d", "e"], ... } ... ) >>> path: pathlib.Path = dirpath / "new_file.csv" >>> df.write_csv(path, separator=",") 
 - write_database(
- table_name: str,
- connection: str,
- *,
- if_exists: DbWriteMode = 'fail',
- engine: DbWriteEngine = 'sqlalchemy',
- Write a polars frame to a database. - Parameters:
- table_name
- Schema-qualified name of the table to create or append to in the target SQL database. If your table name contains special characters, it should be quoted. 
- connection
- Connection URI string, for example: - “postgresql://user:pass@server:port/database” 
- “sqlite:////path/to/database.db” 
 
- if_exists{‘append’, ‘replace’, ‘fail’}
- The insert mode: - ‘replace’ will create a new database table, overwriting an existing one. 
- ‘append’ will append to an existing table. 
- ‘fail’ will fail if table already exists. 
 
- engine{‘sqlalchemy’, ‘adbc’}
- Select the engine used for writing the data. 
 
 
 - write_delta(
- target: str | Path | deltalake.DeltaTable,
- *,
- mode: Literal['error', 'append', 'overwrite', 'ignore'] = 'error',
- overwrite_schema: bool = False,
- storage_options: dict[str, str] | None = None,
- delta_write_options: dict[str, Any] | None = None,
- Write DataFrame as delta table. - Parameters:
- target
- URI of a table or a DeltaTable object. 
- mode{‘error’, ‘append’, ‘overwrite’, ‘ignore’}
- How to handle existing data. - If ‘error’, throw an error if the table already exists (default). 
- If ‘append’, will add new data. 
- If ‘overwrite’, will replace table with new data. 
- If ‘ignore’, will not write anything if table already exists. 
 
- overwrite_schema
- If True, allows updating the schema of the table. 
- storage_options
- Extra options for the storage backends supported by - deltalake. For cloud storages, this may include configurations for authentication etc.
- delta_write_options
- Additional keyword arguments while writing a Delta lake Table. See a list of supported write options here. 
 
- Raises:
- TypeError
- If the DataFrame contains unsupported data types. 
- ArrowInvalidError
- If the DataFrame contains data types that could not be cast to their primitive type. 
 
 - Notes - The Polars data types - Null,- Categoricaland- Timeare not supported by the delta protocol specification and will raise a TypeError.- Some other data types are not supported but have an associated primitive type to which they can be cast. This affects the following data types: - Unsigned integers 
- Datetimetypes with millisecond or nanosecond precision or with
- time zone information 
 
 - Polars columns are always nullable. To write data to a delta table with non-nullable columns, a custom pyarrow schema has to be passed to the - delta_write_options. See the last example below.- Examples - Write a dataframe to the local filesystem as a Delta Lake table. - >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3, 4, 5], ... "bar": [6, 7, 8, 9, 10], ... "ham": ["a", "b", "c", "d", "e"], ... } ... ) >>> table_path = "/path/to/delta-table/" >>> df.write_delta(table_path) - Append data to an existing Delta Lake table on the local filesystem. Note that this will fail if the schema of the new data does not match the schema of the existing table. - >>> df.write_delta(table_path, mode="append") - Overwrite a Delta Lake table as a new version. If the schemas of the new and old data are the same, setting - overwrite_schemais not required.- >>> existing_table_path = "/path/to/delta-table/" >>> df.write_delta( ... existing_table_path, mode="overwrite", overwrite_schema=True ... ) - Write a dataframe as a Delta Lake table to a cloud object store like S3. - >>> table_path = "s3://bucket/prefix/to/delta-table/" >>> df.write_delta( ... table_path, ... storage_options={ ... "AWS_REGION": "THE_AWS_REGION", ... "AWS_ACCESS_KEY_ID": "THE_AWS_ACCESS_KEY_ID", ... "AWS_SECRET_ACCESS_KEY": "THE_AWS_SECRET_ACCESS_KEY", ... }, ... ) - Write DataFrame as a Delta Lake table with non-nullable columns. - >>> import pyarrow as pa >>> existing_table_path = "/path/to/delta-table/" >>> df.write_delta( ... existing_table_path, ... delta_write_options={ ... "schema": pa.schema([pa.field("foo", pa.int64(), nullable=False)]) ... }, ... ) 
 - write_excel(
- workbook: Workbook | BytesIO | Path | str | None = None,
- worksheet: str | None = None,
- *,
- position: tuple[int, int] | str = 'A1',
- table_style: str | dict[str, Any] | None = None,
- table_name: str | None = None,
- column_formats: ColumnFormatDict | None = None,
- dtype_formats: dict[OneOrMoreDataTypes, str] | None = None,
- conditional_formats: ConditionalFormatDict | None = None,
- header_format: dict[str, Any] | None = None,
- column_totals: ColumnTotalsDefinition | None = None,
- column_widths: ColumnWidthsDefinition | None = None,
- row_totals: RowTotalsDefinition | None = None,
- row_heights: dict[int | tuple[int, ...], int] | int | None = None,
- sparklines: dict[str, Sequence[str] | dict[str, Any]] | None = None,
- formulas: dict[str, str | dict[str, str]] | None = None,
- float_precision: int = 3,
- include_header: bool = True,
- autofilter: bool = True,
- autofit: bool = False,
- hidden_columns: Sequence[str] | SelectorType | None = None,
- hide_gridlines: bool = False,
- sheet_zoom: int | None = None,
- freeze_panes: str | tuple[int, int] | tuple[str, int, int] | tuple[int, int, int, int] | None = None,
- Write frame data to a table in an Excel workbook/worksheet. - Parameters:
- workbookWorkbook
- String name or path of the workbook to create, BytesIO object to write into, or an open - xlsxwriter.Workbookobject that has not been closed. If None, writes to a- dataframe.xlsxworkbook in the working directory.
- worksheetstr
- Name of target worksheet; if None, writes to “Sheet1” when creating a new workbook (note that writing to an existing workbook requires a valid existing -or new- worksheet name). 
- position{str, tuple}
- Table position in Excel notation (eg: “A1”), or a (row,col) integer tuple. 
- table_style{str, dict}
- A named Excel table style, such as “Table Style Medium 4”, or a dictionary of - {"key":value,}options containing one or more of the following keys: “style”, “first_column”, “last_column”, “banded_columns, “banded_rows”.
- table_namestr
- Name of the output table object in the worksheet; can then be referred to in the sheet by formulae/charts, or by subsequent - xlsxwriteroperations.
- column_formatsdict
- A - {colname(s):str,}or- {selector:str,}dictionary for applying an Excel format string to the given columns. Formats defined here (such as “dd/mm/yyyy”, “0.00%”, etc) will override any defined in- dtype_formats.
- dtype_formatsdict
- A - {dtype:str,}dictionary that sets the default Excel format for the given dtype. (This can be overridden on a per-column basis by the- column_formatsparam). It is also valid to use dtype groups such as- pl.FLOAT_DTYPESas the dtype/format key, to simplify setting uniform integer and float formats.
- conditional_formatsdict
- A dictionary of colname (or selector) keys to a format str, dict, or list that defines conditional formatting options for the specified columns. - If supplying a string typename, should be one of the valid - xlsxwritertypes such as “3_color_scale”, “data_bar”, etc.
- If supplying a dictionary you can make use of any/all - xlsxwritersupported options, including icon sets, formulae, etc.
- Supplying multiple columns as a tuple/key will apply a single format across all columns - this is effective in creating a heatmap, as the min/max values will be determined across the entire range, not per-column. 
- Finally, you can also supply a list made up from the above options in order to apply more than one conditional format to the same range. 
 
- header_formatdict
- A - {key:value,}dictionary of- xlsxwriterformat options to apply to the table header row, such as- {"bold":True, "font_color":"#702963"}.
- column_totals{bool, list, dict}
- Add a column-total row to the exported table. - If True, all numeric columns will have an associated total using “sum”. 
- If passing a string, it must be one of the valid total function names and all numeric columns will have an associated total using that function. 
- If passing a list of colnames, only those given will have a total. 
- For more control, pass a - {colname:funcname,}dict.
 - Valid total function names are “average”, “count_nums”, “count”, “max”, “min”, “std_dev”, “sum”, and “var”. 
- column_widths{dict, int}
- A - {colname:int,}or- {selector:int,}dict or a single integer that sets (or overrides if autofitting) table column widths, in integer pixel units. If given as an integer the same value is used for all table columns.
- row_totals{dict, bool}
- Add a row-total column to the right-hand side of the exported table. - If True, a column called “total” will be added at the end of the table that applies a “sum” function row-wise across all numeric columns. 
- If passing a list/sequence of column names, only the matching columns will participate in the sum. 
- Can also pass a - {colname:columns,}dictionary to create one or more total columns with distinct names, referencing different columns.
 
- row_heights{dict, int}
- An int or - {row_index:int,}dictionary that sets the height of the given rows (if providing a dictionary) or all rows (if providing an integer) that intersect with the table body (including any header and total row) in integer pixel units. Note that- row_indexstarts at zero and will be the header row (unless- include_headeris False).
- sparklinesdict
- A - {colname:list,}or- {colname:dict,}dictionary defining one or more sparklines to be written into a new column in the table.- If passing a list of colnames (used as the source of the sparkline data) the default sparkline settings are used (eg: line chart with no markers). 
- For more control an - xlsxwriter-compliant options dict can be supplied, in which case three additional polars-specific keys are available: “columns”, “insert_before”, and “insert_after”. These allow you to define the source columns and position the sparkline(s) with respect to other table columns. If no position directive is given, sparklines are added to the end of the table (eg: to the far right) in the order they are given.
 
- formulasdict
- A - {colname:formula,}or- {colname:dict,}dictionary defining one or more formulas to be written into a new column in the table. Note that you are strongly advised to use structured references in your formulae wherever possible to make it simple to reference columns by name.- If providing a string formula (such as “=[@colx]*[@coly]”) the column will be added to the end of the table (eg: to the far right), after any default sparklines and before any row_totals. 
- For the most control supply an options dictionary with the following keys: “formula” (mandatory), one of “insert_before” or “insert_after”, and optionally “return_dtype”. The latter is used to appropriately format the output of the formula and allow it to participate in row/column totals. 
 
- float_precisionint
- Default number of decimals displayed for floating point columns (note that this is purely a formatting directive; the actual values are not rounded). 
- include_headerbool
- Indicate if the table should be created with a header row. 
- autofilterbool
- If the table has headers, provide autofilter capability. 
- autofitbool
- Calculate individual column widths from the data. 
- hidden_columnslist
- A list or selector representing table columns to hide in the worksheet. 
- hide_gridlinesbool
- Do not display any gridlines on the output worksheet. 
- sheet_zoomint
- Set the default zoom level of the output worksheet. 
- freeze_panesstr | (str, int, int) | (int, int) | (int, int, int, int)
- Freeze workbook panes. - If (row, col) is supplied, panes are split at the top-left corner of the specified cell, which are 0-indexed. Thus, to freeze only the top row, supply (1, 0). 
- Alternatively, cell notation can be used to supply the cell. For example, “A2” indicates the split occurs at the top-left of cell A2, which is the equivalent of (1, 0). 
- If (row, col, top_row, top_col) are supplied, the panes are split based on the - rowand- col, and the scrolling region is inititalized to begin at the- top_rowand- top_col. Thus, to freeze only the top row and have the scrolling region begin at row 10, column D (5th col), supply (1, 0, 9, 4). Using cell notation for (row, col), supplying (“A2”, 9, 4) is equivalent.
 
 
 - Notes - A list of compatible - xlsxwriterformat property names can be found here: https://xlsxwriter.readthedocs.io/format.html#format-methods-and-format-properties
- Conditional formatting dictionaries should provide xlsxwriter-compatible definitions; polars will take care of how they are applied on the worksheet with respect to the relative sheet/column position. For supported options, see: https://xlsxwriter.readthedocs.io/working_with_conditional_formats.html 
- Similarly, sparkline option dictionaries should contain xlsxwriter-compatible key/values, as well as a mandatory polars “columns” key that defines the sparkline source data; these source columns should all be adjacent. Two other polars-specific keys are available to help define where the sparkline appears in the table: “insert_after”, and “insert_before”. The value associated with these keys should be the name of a column in the exported table. https://xlsxwriter.readthedocs.io/working_with_sparklines.html 
- Formula dictionaries must contain a key called “formula”, and then optional “insert_after”, “insert_before”, and/or “return_dtype” keys. These additional keys allow the column to be injected into the table at a specific location, and/or to define the return type of the formula (eg: “Int64”, “Float64”, etc). Formulas that refer to table columns should use Excel’s structured references syntax to ensure the formula is applied correctly and is table-relative. https://support.microsoft.com/en-us/office/using-structured-references-with-excel-tables-f5ed2452-2337-4f71-bed3-c8ae6d2b276e 
 - Examples - Instantiate a basic DataFrame: - >>> from random import uniform >>> from datetime import date >>> >>> df = pl.DataFrame( ... { ... "dtm": [date(2023, 1, 1), date(2023, 1, 2), date(2023, 1, 3)], ... "num": [uniform(-500, 500), uniform(-500, 500), uniform(-500, 500)], ... "val": [10_000, 20_000, 30_000], ... } ... ) - Export to “dataframe.xlsx” (the default workbook name, if not specified) in the working directory, add column totals (“sum” by default) on all numeric columns, then autofit: - >>> df.write_excel(column_totals=True, autofit=True) - Write frame to a specific location on the sheet, set a named table style, apply US-style date formatting, increase default float precision, apply a non-default total function to a single column, autofit: - >>> df.write_excel( ... position="B4", ... table_style="Table Style Light 16", ... dtype_formats={pl.Date: "mm/dd/yyyy"}, ... column_totals={"num": "average"}, ... float_precision=6, ... autofit=True, ... ) - Write the same frame to a named worksheet twice, applying different styles and conditional formatting to each table, adding table titles using explicit xlsxwriter integration: - >>> from xlsxwriter import Workbook >>> with Workbook("multi_frame.xlsx") as wb: ... # basic/default conditional formatting ... df.write_excel( ... workbook=wb, ... worksheet="data", ... position=(3, 1), # specify position as (row,col) coordinates ... conditional_formats={"num": "3_color_scale", "val": "data_bar"}, ... table_style="Table Style Medium 4", ... ) ... ... # advanced conditional formatting, custom styles ... df.write_excel( ... workbook=wb, ... worksheet="data", ... position=(len(df) + 7, 1), ... table_style={ ... "style": "Table Style Light 4", ... "first_column": True, ... }, ... conditional_formats={ ... "num": { ... "type": "3_color_scale", ... "min_color": "#76933c", ... "mid_color": "#c4d79b", ... "max_color": "#ebf1de", ... }, ... "val": { ... "type": "data_bar", ... "data_bar_2010": True, ... "bar_color": "#9bbb59", ... "bar_negative_color_same": True, ... "bar_negative_border_color_same": True, ... }, ... }, ... column_formats={"num": "#,##0.000;[White]-#,##0.000"}, ... column_widths={"val": 125}, ... autofit=True, ... ) ... ... # add some table titles (with a custom format) ... ws = wb.get_worksheet_by_name("data") ... fmt_title = wb.add_format( ... { ... "font_color": "#4f6228", ... "font_size": 12, ... "italic": True, ... "bold": True, ... } ... ) ... ws.write(2, 1, "Basic/default conditional formatting", fmt_title) ... ws.write(len(df) + 6, 1, "Customised conditional formatting", fmt_title) ... - Export a table containing two different types of sparklines. Use default options for the “trend” sparkline and customised options (and positioning) for the “+/-” win_loss sparkline, with non-default integer dtype formatting, column totals, a subtle two-tone heatmap and hidden worksheet gridlines: - >>> df = pl.DataFrame( ... { ... "id": ["aaa", "bbb", "ccc", "ddd", "eee"], ... "q1": [100, 55, -20, 0, 35], ... "q2": [30, -10, 15, 60, 20], ... "q3": [-50, 0, 40, 80, 80], ... "q4": [75, 55, 25, -10, -55], ... } ... ) >>> df.write_excel( ... table_style="Table Style Light 2", ... # apply accounting format to all flavours of integer ... dtype_formats={pl.INTEGER_DTYPES: "#,##0_);(#,##0)"}, ... sparklines={ ... # default options; just provide source cols ... "trend": ["q1", "q2", "q3", "q4"], ... # customised sparkline type, with positioning directive ... "+/-": { ... "columns": ["q1", "q2", "q3", "q4"], ... "insert_after": "id", ... "type": "win_loss", ... }, ... }, ... conditional_formats={ ... # create a unified multi-column heatmap ... ("q1", "q2", "q3", "q4"): { ... "type": "2_color_scale", ... "min_color": "#95b3d7", ... "max_color": "#ffffff", ... }, ... }, ... column_totals=["q1", "q2", "q3", "q4"], ... row_totals=True, ... hide_gridlines=True, ... ) - Export a table containing an Excel formula-based column that calculates a standardised Z-score, showing use of structured references in conjunction with positioning directives, column totals, and custom formatting. - >>> df = pl.DataFrame( ... { ... "id": ["a123", "b345", "c567", "d789", "e101"], ... "points": [99, 45, 50, 85, 35], ... } ... ) >>> df.write_excel( ... table_style={ ... "style": "Table Style Medium 15", ... "first_column": True, ... }, ... column_formats={ ... "id": {"font": "Consolas"}, ... "points": {"align": "center"}, ... "z-score": {"align": "center"}, ... }, ... column_totals="average", ... formulas={ ... "z-score": { ... # use structured references to refer to the table columns and 'totals' row ... "formula": "=STANDARDIZE([@points], [[#Totals],[points]], STDEV([points]))", ... "insert_after": "points", ... "return_dtype": pl.Float64, ... } ... }, ... hide_gridlines=True, ... sheet_zoom=125, ... ) 
 - write_ipc(
- file: None,
- compression: IpcCompression = 'uncompressed',
- write_ipc( ) None
- Write to Arrow IPC binary stream or Feather file. - See “File or Random Access format” in https://arrow.apache.org/docs/python/ipc.html. - Parameters:
- file
- Path or writeable file-like object to which the IPC data will be written. If set to - None, the output is returned as a BytesIO object.
- compression{‘uncompressed’, ‘lz4’, ‘zstd’}
- Compression method. Defaults to “uncompressed”. 
 
 - Examples - >>> import pathlib >>> >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3, 4, 5], ... "bar": [6, 7, 8, 9, 10], ... "ham": ["a", "b", "c", "d", "e"], ... } ... ) >>> path: pathlib.Path = dirpath / "new_file.arrow" >>> df.write_ipc(path) 
 - write_ipc_stream(
- file: None,
- compression: IpcCompression = 'uncompressed',
- write_ipc_stream( ) None
- Write to Arrow IPC record batch stream. - See “Streaming format” in https://arrow.apache.org/docs/python/ipc.html. - Parameters:
- file
- Path or writeable file-like object to which the IPC record batch data will be written. If set to - None, the output is returned as a BytesIO object.
- compression{‘uncompressed’, ‘lz4’, ‘zstd’}
- Compression method. Defaults to “uncompressed”. 
 
 - Examples - >>> import pathlib >>> >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3, 4, 5], ... "bar": [6, 7, 8, 9, 10], ... "ham": ["a", "b", "c", "d", "e"], ... } ... ) >>> path: pathlib.Path = dirpath / "new_file.arrow" >>> df.write_ipc_stream(path) 
 - write_json( ) str[source]
- write_json( ) None
- Serialize to JSON representation. - Parameters:
- file
- File path or writeable file-like object to which the result will be written. If set to - None(default), the output is returned as a string instead.
- pretty
- Pretty serialize json. 
- row_oriented
- Write to row oriented json. This is slower, but more common. 
 
 - See also - Examples - >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... } ... ) >>> df.write_json() '{"columns":[{"name":"foo","datatype":"Int64","bit_settings":"","values":[1,2,3]},{"name":"bar","datatype":"Int64","bit_settings":"","values":[6,7,8]}]}' >>> df.write_json(row_oriented=True) '[{"foo":1,"bar":6},{"foo":2,"bar":7},{"foo":3,"bar":8}]' 
 - write_ndjson(file: None = None) str[source]
- write_ndjson(file: IOBase | str | Path) None
- Serialize to newline delimited JSON representation. - Parameters:
- file
- File path or writeable file-like object to which the result will be written. If set to - None(default), the output is returned as a string instead.
 
 - Examples - >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... } ... ) >>> df.write_ndjson() '{"foo":1,"bar":6}\n{"foo":2,"bar":7}\n{"foo":3,"bar":8}\n' 
 - write_parquet(
- file: str | Path | BytesIO,
- *,
- compression: ParquetCompression = 'zstd',
- compression_level: int | None = None,
- statistics: bool = False,
- row_group_size: int | None = None,
- data_page_size: int | None = None,
- use_pyarrow: bool = False,
- pyarrow_options: dict[str, Any] | None = None,
- Write to Apache Parquet file. - Parameters:
- file
- File path or writeable file-like object to which the result will be written. 
- compression{‘lz4’, ‘uncompressed’, ‘snappy’, ‘gzip’, ‘lzo’, ‘brotli’, ‘zstd’}
- Choose “zstd” for good compression performance. Choose “lz4” for fast compression/decompression. Choose “snappy” for more backwards compatibility guarantees when you deal with older parquet readers. 
- compression_level
- The level of compression to use. Higher compression means smaller files on disk. - “gzip” : min-level: 0, max-level: 10. 
- “brotli” : min-level: 0, max-level: 11. 
- “zstd” : min-level: 1, max-level: 22. 
 
- statistics
- Write statistics to the parquet headers. This requires extra compute. 
- row_group_size
- Size of the row groups in number of rows. Defaults to 512^2 rows. 
- data_page_size
- Size of the data page in bytes. Defaults to 1024^2 bytes. 
- use_pyarrow
- Use C++ parquet implementation vs Rust parquet implementation. At the moment C++ supports more features. 
- pyarrow_options
- Arguments passed to - pyarrow.parquet.write_table.- If you pass - partition_colshere, the dataset will be written using- pyarrow.parquet.write_to_dataset. The- partition_colsparameter leads to write the dataset to a directory. Similar to Spark’s partitioned datasets.
 
 - Examples - >>> import pathlib >>> >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3, 4, 5], ... "bar": [6, 7, 8, 9, 10], ... "ham": ["a", "b", "c", "d", "e"], ... } ... ) >>> path: pathlib.Path = dirpath / "new_file.parquet" >>> df.write_parquet(path) - We can use pyarrow with use_pyarrow_write_to_dataset=True to write partitioned datasets. The following example will write the first row to ../watermark=1/.parquet and the other rows to ../watermark=2/.parquet. - >>> df = pl.DataFrame({"a": [1, 2, 3], "watermark": [1, 2, 2]}) >>> path: pathlib.Path = dirpath / "partitioned_object" >>> df.write_parquet( ... path, ... use_pyarrow=True, ... pyarrow_options={"partition_cols": ["watermark"]}, ... )