DataFrame#
This page gives an overview of all public DataFrame methods.
- class polars.DataFrame(
- data: FrameInitTypes | None = None,
- schema: SchemaDefinition | None = None,
- *,
- schema_overrides: SchemaDict | None = None,
- strict: bool = True,
- orient: Orientation | None = None,
- infer_schema_length: int | None = 100,
- nan_to_null: bool = False,
Two-dimensional data structure representing data as a table with rows and columns.
- Parameters:
- datadict, Sequence, ndarray, Series, or pandas.DataFrame
Two-dimensional data in various forms; dict input must contain Sequences, Generators, or a
range
. Sequence may contain Series or other Sequences.- schemaSequence of str, (str,DataType) pairs, or a {str:DataType,} dict
The schema of the resulting DataFrame. The schema may be declared in several ways:
As a dict of {name:type} pairs; if type is None, it will be auto-inferred.
As a list of column names; in this case types are automatically inferred.
As a list of (name,type) pairs; this is equivalent to the dictionary form.
If you supply a list of column names that does not match the names in the underlying data, the names given here will overwrite them. The number of names given in the schema should match the underlying data dimensions.
If set to
None
(default), the schema is inferred from the data.- schema_overridesdict, default None
Support type specification or override of one or more columns; note that any dtypes inferred from the schema param will be overridden.
The number of entries in the schema should match the underlying data dimensions, unless a sequence of dictionaries is being passed, in which case a partial schema can be declared to prevent specific fields from being loaded.
- strictbool, default True
Throw an error if any
data
value does not exactly match the given or inferred data type for that column. If set toFalse
, values that do not match the data type are cast to that data type or, if casting is not possible, set to null instead.- orient{‘col’, ‘row’}, default None
Whether to interpret two-dimensional data as columns or as rows. If None, the orientation is inferred by matching the columns and data dimensions. If this does not yield conclusive results, column orientation is used.
- infer_schema_lengthint or None
The maximum number of rows to scan for schema inference. If set to
None
, the full data may be scanned (this can be slow). This parameter only applies if the input data is a sequence or generator of rows; other input is read as-is.- nan_to_nullbool, default False
If the data comes from one or more numpy arrays, can optionally convert input data np.nan values to null instead. This is a no-op for all other input data.
Notes
Polars explicitly does not support subclassing of its core data types. See the following GitHub issue for possible workarounds: pola-rs/polars#2846
Examples
Constructing a DataFrame from a dictionary:
>>> data = {"a": [1, 2], "b": [3, 4]} >>> df = pl.DataFrame(data) >>> df shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 3 │ │ 2 ┆ 4 │ └─────┴─────┘
Notice that the dtypes are automatically inferred as polars Int64:
>>> df.dtypes [Int64, Int64]
To specify a more detailed/specific frame schema you can supply the
schema
parameter with a dictionary of (name,dtype) pairs…>>> data = {"col1": [0, 2], "col2": [3, 7]} >>> df2 = pl.DataFrame(data, schema={"col1": pl.Float32, "col2": pl.Int64}) >>> df2 shape: (2, 2) ┌──────┬──────┐ │ col1 ┆ col2 │ │ --- ┆ --- │ │ f32 ┆ i64 │ ╞══════╪══════╡ │ 0.0 ┆ 3 │ │ 2.0 ┆ 7 │ └──────┴──────┘
…a sequence of (name,dtype) pairs…
>>> data = {"col1": [1, 2], "col2": [3, 4]} >>> df3 = pl.DataFrame(data, schema=[("col1", pl.Float32), ("col2", pl.Int64)]) >>> df3 shape: (2, 2) ┌──────┬──────┐ │ col1 ┆ col2 │ │ --- ┆ --- │ │ f32 ┆ i64 │ ╞══════╪══════╡ │ 1.0 ┆ 3 │ │ 2.0 ┆ 4 │ └──────┴──────┘
…or a list of typed Series.
>>> data = [ ... pl.Series("col1", [1, 2], dtype=pl.Float32), ... pl.Series("col2", [3, 4], dtype=pl.Int64), ... ] >>> df4 = pl.DataFrame(data) >>> df4 shape: (2, 2) ┌──────┬──────┐ │ col1 ┆ col2 │ │ --- ┆ --- │ │ f32 ┆ i64 │ ╞══════╪══════╡ │ 1.0 ┆ 3 │ │ 2.0 ┆ 4 │ └──────┴──────┘
Constructing a DataFrame from a numpy ndarray, specifying column names:
>>> import numpy as np >>> data = np.array([(1, 2), (3, 4)], dtype=np.int64) >>> df5 = pl.DataFrame(data, schema=["a", "b"], orient="col") >>> df5 shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 3 │ │ 2 ┆ 4 │ └─────┴─────┘
Constructing a DataFrame from a list of lists, row orientation specified:
>>> data = [[1, 2, 3], [4, 5, 6]] >>> df6 = pl.DataFrame(data, schema=["a", "b", "c"], orient="row") >>> df6 shape: (2, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ 1 ┆ 2 ┆ 3 │ │ 4 ┆ 5 ┆ 6 │ └─────┴─────┴─────┘
Methods:
Approximate count of unique values.
Return the
k
smallest rows.Cast DataFrame column(s) to the specified dtype(s).
Create an empty (n=0) or
n
-row null-filled (n>0) copy of the DataFrame.Create a copy of this DataFrame.
Get an ordered mapping of column names to their data type.
Return pairwise Pearson product-moment correlation coefficients between columns.
Return the number of non-null elements for each column.
Summary statistics for a DataFrame.
Read a serialized DataFrame from a file.
Remove columns from the dataframe.
Drop a single column in-place and return the dropped column.
Drop all rows that contain one or more NaN values.
Drop all rows that contain null values.
Check whether the DataFrame is equal to another DataFrame.
Return an estimation of the total (heap) allocated size of the
DataFrame
.Explode the dataframe to long format by exploding the given columns.
Extend the memory backed by this
DataFrame
with the values fromother
.Fill floating point NaN values by an Expression evaluation.
Fill null values using the specified value or strategy.
Filter the rows in the DataFrame based on one or more predicate expressions.
Apply a horizontal reduction on a DataFrame.
Take every nth row in the DataFrame and return as a new DataFrame.
Get a single column by name.
Find the index of a column by name.
Get the DataFrame as a List of Series.
Return a dense preview of the DataFrame.
Start a group by operation.
Group based on a time value (or index value of type Int32, Int64).
Hash and combine the rows in this DataFrame.
Get the first
n
rows.Return a new DataFrame grown horizontally by stacking multiple Series to it.
Insert a Series at a certain column index.
Interpolate intermediate values.
Get a mask of all duplicated rows in this DataFrame.
Returns
True
if the DataFrame contains no rows.Get a mask of all unique rows in this DataFrame.
Return the DataFrame as a scalar, or return the element at the given row/column.
Returns an iterator over the columns of this DataFrame.
Returns an iterator over the DataFrame of rows of python-native values.
Returns a non-copying iterator of slices over the underlying DataFrame.
Join in SQL-like fashion.
Perform an asof join.
Perform a join based on one or multiple (in)equality predicates.
Start a lazy query from this point.
Get the first
n
rows.Apply a custom/user-defined function (UDF) over the rows of the DataFrame.
Aggregate the columns of this DataFrame to their maximum value.
Get the maximum value horizontally across columns.
Aggregate the columns of this DataFrame to their mean value.
Take the mean of all values horizontally across columns.
Aggregate the columns of this DataFrame to their median value.
Unpivot a DataFrame from wide to long format.
Take two sorted DataFrames and merge them by the sorted key.
Aggregate the columns of this DataFrame to their minimum value.
Get the minimum value horizontally across columns.
Get number of chunks used by the ChunkedArrays of this DataFrame.
Return the number of unique rows, or the number of unique row-subsets.
Create a new DataFrame that shows the null counts per column.
Group by the given columns and return the groups as separate dataframes.
Offers a structured way to apply a sequence of user-defined functions (UDFs).
Create a spreadsheet-style pivot table as a DataFrame.
Aggregate the columns of this DataFrame to their product values.
Aggregate the columns of this DataFrame to their quantile value.
Rechunk the data in this DataFrame to a contiguous allocation.
Rename column names.
Replace a column at an index location.
Reverse the DataFrame.
Create rolling groups based on a temporal or integer column.
Get the values of a single row, either by index or by predicate.
Returns all data in the DataFrame as a list of rows of python-native values.
Returns all data as a dictionary of python-native values keyed by some column.
Sample from this DataFrame.
Select columns from this DataFrame.
Select columns from this DataFrame.
Serialize this DataFrame to a file or string in JSON format.
Indicate that one or multiple columns are sorted.
Shift values by the given number of indices.
Shrink DataFrame memory usage.
Get a slice of this DataFrame.
Sort the dataframe by the given columns.
Execute a SQL query against the DataFrame.
Aggregate the columns of this DataFrame to their standard deviation value.
Aggregate the columns of this DataFrame to their sum value.
Sum all values horizontally across columns.
Get the last
n
rows.Collect the underlying arrow arrays in an Arrow Table.
Convert DataFrame to a dictionary mapping column name to values.
Convert every row to a dictionary of Python-native values.
Convert categorical variables into dummy/indicator variables.
Convert DataFrame to instantiable string representation.
Convert DataFrame to a Jax Array, or dict of Jax Arrays.
Convert this DataFrame to a NumPy ndarray.
Convert this DataFrame to a pandas DataFrame.
Select column as Series at index location.
Convert a
DataFrame
to aSeries
of typeStruct
.Convert DataFrame to a PyTorch Tensor, Dataset, or dict of Tensors.
Return the
k
largest rows.Transpose a DataFrame over the diagonal.
Drop duplicate rows from this dataframe.
Decompose struct columns into separate columns for each of their fields.
Unpivot a DataFrame from wide to long format.
Unstack a long table to a wide form without doing an aggregation.
Update the values in this
DataFrame
with the values inother
.Upsample a DataFrame at a regular frequency.
Aggregate the columns of this DataFrame to their variance value.
Grow this DataFrame vertically by stacking a DataFrame to it.
Add columns to this DataFrame.
Add columns to this DataFrame.
Add a column at index 0 that counts the rows.
Add a row index as the first column in the DataFrame.
Write to Apache Avro file.
Copy
DataFrame
in csv format to the system clipboard withwrite_csv
.Write to comma-separated values (CSV) file.
Write the data in a Polars DataFrame to a database.
Write DataFrame as delta table.
Write frame data to a table in an Excel workbook/worksheet.
Write to Arrow IPC binary stream or Feather file.
Write to Arrow IPC record batch stream.
Serialize to JSON representation.
Serialize to newline delimited JSON representation.
Write to Apache Parquet file.
Attributes:
Get or set column names.
Get the column data types.
Get flags that are set on the columns of this DataFrame.
Get the number of rows.
Create a plot namespace.
Get an ordered mapping of column names to their data type.
Get the shape of the DataFrame.
Create a Great Table for styling.
Get the number of columns.
- approx_n_unique() DataFrame [source]
Approximate count of unique values.
Deprecated since version 0.20.11: Use
select(pl.all().approx_n_unique())
instead.This is done using the HyperLogLog++ algorithm for cardinality estimation.
Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [1, 2, 1, 1], ... } ... ) >>> df.approx_n_unique() shape: (1, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ u32 ┆ u32 │ ╞═════╪═════╡ │ 4 ┆ 2 │ └─────┴─────┘
- bottom_k( ) DataFrame [source]
Return the
k
smallest rows.Non-null elements are always preferred over null elements, regardless of the value of
reverse
. The output is not guaranteed to be in any particular order, callsort()
after this function if you wish the output to be sorted.- Parameters:
- k
Number of rows to return.
- by
Column(s) used to determine the bottom rows. Accepts expression input. Strings are parsed as column names.
- reverse
Consider the
k
largest elements of theby
column(s) (instead of thek
smallest). This can be specified per column by passing a sequence of booleans.
See also
Examples
>>> df = pl.DataFrame( ... { ... "a": ["a", "b", "a", "b", "b", "c"], ... "b": [2, 1, 1, 3, 2, 1], ... } ... )
Get the rows which contain the 4 smallest values in column b.
>>> df.bottom_k(4, by="b") shape: (4, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════╪═════╡ │ b ┆ 1 │ │ a ┆ 1 │ │ c ┆ 1 │ │ a ┆ 2 │ └─────┴─────┘
Get the rows which contain the 4 smallest values when sorting on column a and b.
>>> df.bottom_k(4, by=["a", "b"]) shape: (4, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════╪═════╡ │ a ┆ 1 │ │ a ┆ 2 │ │ b ┆ 1 │ │ b ┆ 2 │ └─────┴─────┘
- cast(
- dtypes: Mapping[ColumnNameOrSelector | PolarsDataType, PolarsDataType | PythonDataType] | PolarsDataType,
- *,
- strict: bool = True,
Cast DataFrame column(s) to the specified dtype(s).
- Parameters:
- dtypes
Mapping of column names (or selector) to dtypes, or a single dtype to which all columns will be cast.
- strict
Raise if cast is invalid on rows after predicates are pusded down. If
False
, invalid casts will produce null values.
Examples
>>> from datetime import date >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": [date(2020, 1, 2), date(2021, 3, 4), date(2022, 5, 6)], ... } ... )
Cast specific frame columns to the specified dtypes:
>>> df.cast({"foo": pl.Float32, "bar": pl.UInt8}) shape: (3, 3) ┌─────┬─────┬────────────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ f32 ┆ u8 ┆ date │ ╞═════╪═════╪════════════╡ │ 1.0 ┆ 6 ┆ 2020-01-02 │ │ 2.0 ┆ 7 ┆ 2021-03-04 │ │ 3.0 ┆ 8 ┆ 2022-05-06 │ └─────┴─────┴────────────┘
Cast all frame columns matching one dtype (or dtype group) to another dtype:
>>> df.cast({pl.Date: pl.Datetime}) shape: (3, 3) ┌─────┬─────┬─────────────────────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ datetime[μs] │ ╞═════╪═════╪═════════════════════╡ │ 1 ┆ 6.0 ┆ 2020-01-02 00:00:00 │ │ 2 ┆ 7.0 ┆ 2021-03-04 00:00:00 │ │ 3 ┆ 8.0 ┆ 2022-05-06 00:00:00 │ └─────┴─────┴─────────────────────┘
Use selectors to define the columns being cast:
>>> import polars.selectors as cs >>> df.cast({cs.numeric(): pl.UInt32, cs.temporal(): pl.String}) shape: (3, 3) ┌─────┬─────┬────────────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ u32 ┆ u32 ┆ str │ ╞═════╪═════╪════════════╡ │ 1 ┆ 6 ┆ 2020-01-02 │ │ 2 ┆ 7 ┆ 2021-03-04 │ │ 3 ┆ 8 ┆ 2022-05-06 │ └─────┴─────┴────────────┘
Cast all frame columns to the specified dtype:
>>> df.cast(pl.String).to_dict(as_series=False) {'foo': ['1', '2', '3'], 'bar': ['6.0', '7.0', '8.0'], 'ham': ['2020-01-02', '2021-03-04', '2022-05-06']}
- clear(n: int = 0) DataFrame [source]
Create an empty (n=0) or
n
-row null-filled (n>0) copy of the DataFrame.Returns a
n
-row null-filled DataFrame with an identical schema.n
can be greater than the current number of rows in the DataFrame.- Parameters:
- n
Number of (null-filled) rows to return in the cleared frame.
See also
clone
Cheap deepcopy/clone.
Examples
>>> df = pl.DataFrame( ... { ... "a": [None, 2, 3, 4], ... "b": [0.5, None, 2.5, 13], ... "c": [True, True, False, None], ... } ... ) >>> df.clear() shape: (0, 3) ┌─────┬─────┬──────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ bool │ ╞═════╪═════╪══════╡ └─────┴─────┴──────┘
>>> df.clear(n=2) shape: (2, 3) ┌──────┬──────┬──────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ bool │ ╞══════╪══════╪══════╡ │ null ┆ null ┆ null │ │ null ┆ null ┆ null │ └──────┴──────┴──────┘
- clone() DataFrame [source]
Create a copy of this DataFrame.
This is a cheap operation that does not copy data.
See also
clear
Create an empty copy of the current DataFrame, with identical schema but no data.
Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [0.5, 4, 10, 13], ... "c": [True, True, False, True], ... } ... ) >>> df.clone() shape: (4, 3) ┌─────┬──────┬───────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ bool │ ╞═════╪══════╪═══════╡ │ 1 ┆ 0.5 ┆ true │ │ 2 ┆ 4.0 ┆ true │ │ 3 ┆ 10.0 ┆ false │ │ 4 ┆ 13.0 ┆ true │ └─────┴──────┴───────┘
- collect_schema() Schema [source]
Get an ordered mapping of column names to their data type.
This is an alias for the
schema
property.See also
Notes
This method is included to facilitate writing code that is generic for both DataFrame and LazyFrame.
Examples
Determine the schema.
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.collect_schema() Schema({'foo': Int64, 'bar': Float64, 'ham': String})
Access various properties of the schema using the
Schema
object.>>> schema = df.collect_schema() >>> schema["bar"] Float64 >>> schema.names() ['foo', 'bar', 'ham'] >>> schema.dtypes() [Int64, Float64, String] >>> schema.len() 3
- property columns: list[str][source]
Get or set column names.
- Returns:
- list of str
A list containing the name of each column in order.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.columns ['foo', 'bar', 'ham']
Set column names:
>>> df.columns = ["apple", "banana", "orange"] >>> df shape: (3, 3) ┌───────┬────────┬────────┐ │ apple ┆ banana ┆ orange │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═══════╪════════╪════════╡ │ 1 ┆ 6 ┆ a │ │ 2 ┆ 7 ┆ b │ │ 3 ┆ 8 ┆ c │ └───────┴────────┴────────┘
- corr(**kwargs: Any) DataFrame [source]
Return pairwise Pearson product-moment correlation coefficients between columns.
See numpy
corrcoef
for more information: https://numpy.org/doc/stable/reference/generated/numpy.corrcoef.html- Parameters:
- **kwargs
Keyword arguments are passed to numpy
corrcoef
.
Notes
This functionality requires numpy to be installed.
Examples
>>> df = pl.DataFrame({"foo": [1, 2, 3], "bar": [3, 2, 1], "ham": [7, 8, 9]}) >>> df.corr() shape: (3, 3) ┌──────┬──────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ f64 │ ╞══════╪══════╪══════╡ │ 1.0 ┆ -1.0 ┆ 1.0 │ │ -1.0 ┆ 1.0 ┆ -1.0 │ │ 1.0 ┆ -1.0 ┆ 1.0 │ └──────┴──────┴──────┘
- count() DataFrame [source]
Return the number of non-null elements for each column.
Examples
>>> df = pl.DataFrame( ... {"a": [1, 2, 3, 4], "b": [1, 2, 1, None], "c": [None, None, None, None]} ... ) >>> df.count() shape: (1, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ u32 ┆ u32 ┆ u32 │ ╞═════╪═════╪═════╡ │ 4 ┆ 3 ┆ 0 │ └─────┴─────┴─────┘
- describe(
- percentiles: Sequence[float] | float | None = (0.25, 0.5, 0.75),
- *,
- interpolation: RollingInterpolationMethod = 'nearest',
Summary statistics for a DataFrame.
- Parameters:
- percentiles
One or more percentiles to include in the summary statistics. All values must be in the range
[0, 1]
.- interpolation{‘nearest’, ‘higher’, ‘lower’, ‘midpoint’, ‘linear’}
Interpolation method used when calculating percentiles.
Warning
We do not guarantee the output of
describe
to be stable. It will show statistics that we deem informative, and may be updated in the future. Usingdescribe
programmatically (versus interactive exploration) is not recommended for this reason.See also
Notes
The median is included by default as the 50% percentile.
Examples
>>> from datetime import date, time >>> df = pl.DataFrame( ... { ... "float": [1.0, 2.8, 3.0], ... "int": [40, 50, None], ... "bool": [True, False, True], ... "str": ["zz", "xx", "yy"], ... "date": [date(2020, 1, 1), date(2021, 7, 5), date(2022, 12, 31)], ... "time": [time(10, 20, 30), time(14, 45, 50), time(23, 15, 10)], ... } ... )
Show default frame statistics:
>>> df.describe() shape: (9, 7) ┌────────────┬──────────┬──────────┬──────────┬──────┬─────────────────────┬──────────┐ │ statistic ┆ float ┆ int ┆ bool ┆ str ┆ date ┆ time │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ f64 ┆ f64 ┆ f64 ┆ str ┆ str ┆ str │ ╞════════════╪══════════╪══════════╪══════════╪══════╪═════════════════════╪══════════╡ │ count ┆ 3.0 ┆ 2.0 ┆ 3.0 ┆ 3 ┆ 3 ┆ 3 │ │ null_count ┆ 0.0 ┆ 1.0 ┆ 0.0 ┆ 0 ┆ 0 ┆ 0 │ │ mean ┆ 2.266667 ┆ 45.0 ┆ 0.666667 ┆ null ┆ 2021-07-02 16:00:00 ┆ 16:07:10 │ │ std ┆ 1.101514 ┆ 7.071068 ┆ null ┆ null ┆ null ┆ null │ │ min ┆ 1.0 ┆ 40.0 ┆ 0.0 ┆ xx ┆ 2020-01-01 ┆ 10:20:30 │ │ 25% ┆ 2.8 ┆ 40.0 ┆ null ┆ null ┆ 2021-07-05 ┆ 14:45:50 │ │ 50% ┆ 2.8 ┆ 50.0 ┆ null ┆ null ┆ 2021-07-05 ┆ 14:45:50 │ │ 75% ┆ 3.0 ┆ 50.0 ┆ null ┆ null ┆ 2022-12-31 ┆ 23:15:10 │ │ max ┆ 3.0 ┆ 50.0 ┆ 1.0 ┆ zz ┆ 2022-12-31 ┆ 23:15:10 │ └────────────┴──────────┴──────────┴──────────┴──────┴─────────────────────┴──────────┘
Customize which percentiles are displayed, applying linear interpolation:
>>> with pl.Config(tbl_rows=12): ... df.describe( ... percentiles=[0.1, 0.3, 0.5, 0.7, 0.9], ... interpolation="linear", ... ) shape: (11, 7) ┌────────────┬──────────┬──────────┬──────────┬──────┬─────────────────────┬──────────┐ │ statistic ┆ float ┆ int ┆ bool ┆ str ┆ date ┆ time │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ f64 ┆ f64 ┆ f64 ┆ str ┆ str ┆ str │ ╞════════════╪══════════╪══════════╪══════════╪══════╪═════════════════════╪══════════╡ │ count ┆ 3.0 ┆ 2.0 ┆ 3.0 ┆ 3 ┆ 3 ┆ 3 │ │ null_count ┆ 0.0 ┆ 1.0 ┆ 0.0 ┆ 0 ┆ 0 ┆ 0 │ │ mean ┆ 2.266667 ┆ 45.0 ┆ 0.666667 ┆ null ┆ 2021-07-02 16:00:00 ┆ 16:07:10 │ │ std ┆ 1.101514 ┆ 7.071068 ┆ null ┆ null ┆ null ┆ null │ │ min ┆ 1.0 ┆ 40.0 ┆ 0.0 ┆ xx ┆ 2020-01-01 ┆ 10:20:30 │ │ 10% ┆ 1.36 ┆ 41.0 ┆ null ┆ null ┆ 2020-04-20 ┆ 11:13:34 │ │ 30% ┆ 2.08 ┆ 43.0 ┆ null ┆ null ┆ 2020-11-26 ┆ 12:59:42 │ │ 50% ┆ 2.8 ┆ 45.0 ┆ null ┆ null ┆ 2021-07-05 ┆ 14:45:50 │ │ 70% ┆ 2.88 ┆ 47.0 ┆ null ┆ null ┆ 2022-02-07 ┆ 18:09:34 │ │ 90% ┆ 2.96 ┆ 49.0 ┆ null ┆ null ┆ 2022-09-13 ┆ 21:33:18 │ │ max ┆ 3.0 ┆ 50.0 ┆ 1.0 ┆ zz ┆ 2022-12-31 ┆ 23:15:10 │ └────────────┴──────────┴──────────┴──────────┴──────┴─────────────────────┴──────────┘
- classmethod deserialize(
- source: str | Path | IOBase,
- *,
- format: SerializationFormat = 'binary',
Read a serialized DataFrame from a file.
- Parameters:
- source
Path to a file or a file-like object (by file-like object, we refer to objects that have a
read()
method, such as a file handler (e.g. via builtinopen
function) orBytesIO
).- format
The format with which the DataFrame was serialized. Options:
"binary"
: Deserialize from binary format (bytes). This is the default."json"
: Deserialize from JSON format (string).
See also
Notes
Serialization is not stable across Polars versions: a LazyFrame serialized in one Polars version may not be deserializable in another Polars version.
Examples
>>> import io >>> df = pl.DataFrame({"a": [1, 2, 3], "b": [4.0, 5.0, 6.0]}) >>> bytes = df.serialize() >>> pl.DataFrame.deserialize(io.BytesIO(bytes)) shape: (3, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ f64 │ ╞═════╪═════╡ │ 1 ┆ 4.0 │ │ 2 ┆ 5.0 │ │ 3 ┆ 6.0 │ └─────┴─────┘
- drop(
- *columns: ColumnNameOrSelector | Iterable[ColumnNameOrSelector],
- strict: bool = True,
Remove columns from the dataframe.
- Parameters:
- *columns
Names of the columns that should be removed from the dataframe. Accepts column selector input.
- strict
Validate that all column names exist in the current schema, and throw an exception if any do not.
Examples
Drop a single column by passing the name of that column.
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.drop("ham") shape: (3, 2) ┌─────┬─────┐ │ foo ┆ bar │ │ --- ┆ --- │ │ i64 ┆ f64 │ ╞═════╪═════╡ │ 1 ┆ 6.0 │ │ 2 ┆ 7.0 │ │ 3 ┆ 8.0 │ └─────┴─────┘
Drop multiple columns by passing a list of column names.
>>> df.drop(["bar", "ham"]) shape: (3, 1) ┌─────┐ │ foo │ │ --- │ │ i64 │ ╞═════╡ │ 1 │ │ 2 │ │ 3 │ └─────┘
Drop multiple columns by passing a selector.
>>> import polars.selectors as cs >>> df.drop(cs.numeric()) shape: (3, 1) ┌─────┐ │ ham │ │ --- │ │ str │ ╞═════╡ │ a │ │ b │ │ c │ └─────┘
Use positional arguments to drop multiple columns.
>>> df.drop("foo", "ham") shape: (3, 1) ┌─────┐ │ bar │ │ --- │ │ f64 │ ╞═════╡ │ 6.0 │ │ 7.0 │ │ 8.0 │ └─────┘
- drop_in_place(name: str) Series [source]
Drop a single column in-place and return the dropped column.
- Parameters:
- name
Name of the column to drop.
- Returns:
- Series
The dropped column.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.drop_in_place("ham") shape: (3,) Series: 'ham' [str] [ "a" "b" "c" ]
- drop_nans(
- subset: ColumnNameOrSelector | Collection[ColumnNameOrSelector] | None = None,
Drop all rows that contain one or more NaN values.
The original order of the remaining rows is preserved.
- Parameters:
- subset
Column name(s) for which NaN values are considered; if set to
None
(default), use all columns (note that only floating-point columns can contain NaNs).
Examples
>>> df = pl.DataFrame( ... { ... "foo": [-20.5, float("nan"), 80.0], ... "bar": [float("nan"), 110.0, 25.5], ... "ham": ["xxx", "yyy", None], ... } ... )
The default behavior of this method is to drop rows where any single value in the row is NaN:
>>> df.drop_nans() shape: (1, 3) ┌──────┬──────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ str │ ╞══════╪══════╪══════╡ │ 80.0 ┆ 25.5 ┆ null │ └──────┴──────┴──────┘
This behaviour can be constrained to consider only a subset of columns, as defined by name, or with a selector. For example, dropping rows only if there is a NaN in the “bar” column:
>>> df.drop_nans(subset=["bar"]) shape: (2, 3) ┌──────┬───────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ str │ ╞══════╪═══════╪══════╡ │ NaN ┆ 110.0 ┆ yyy │ │ 80.0 ┆ 25.5 ┆ null │ └──────┴───────┴──────┘
Dropping a row only if all values are NaN requires a different formulation:
>>> df = pl.DataFrame( ... { ... "a": [float("nan"), float("nan"), float("nan"), float("nan")], ... "b": [10.0, 2.5, float("nan"), 5.25], ... "c": [65.75, float("nan"), float("nan"), 10.5], ... } ... ) >>> df.filter(~pl.all_horizontal(pl.all().is_nan())) shape: (3, 3) ┌─────┬──────┬───────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ f64 │ ╞═════╪══════╪═══════╡ │ NaN ┆ 10.0 ┆ 65.75 │ │ NaN ┆ 2.5 ┆ NaN │ │ NaN ┆ 5.25 ┆ 10.5 │ └─────┴──────┴───────┘
- drop_nulls(
- subset: ColumnNameOrSelector | Collection[ColumnNameOrSelector] | None = None,
Drop all rows that contain null values.
The original order of the remaining rows is preserved.
- Parameters:
- subset
Column name(s) for which null values are considered. If set to
None
(default), use all columns.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, None, 8], ... "ham": ["a", "b", None], ... } ... )
The default behavior of this method is to drop rows where any single value of the row is null.
>>> df.drop_nulls() shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ └─────┴─────┴─────┘
This behaviour can be constrained to consider only a subset of columns, as defined by name or with a selector. For example, dropping rows if there is a null in any of the integer columns:
>>> import polars.selectors as cs >>> df.drop_nulls(subset=cs.integer()) shape: (2, 3) ┌─────┬─────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪══════╡ │ 1 ┆ 6 ┆ a │ │ 3 ┆ 8 ┆ null │ └─────┴─────┴──────┘
Below are some additional examples that show how to drop null values based on other conditions.
>>> df = pl.DataFrame( ... { ... "a": [None, None, None, None], ... "b": [1, 2, None, 1], ... "c": [1, None, None, 1], ... } ... ) >>> df shape: (4, 3) ┌──────┬──────┬──────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ null ┆ i64 ┆ i64 │ ╞══════╪══════╪══════╡ │ null ┆ 1 ┆ 1 │ │ null ┆ 2 ┆ null │ │ null ┆ null ┆ null │ │ null ┆ 1 ┆ 1 │ └──────┴──────┴──────┘
Drop a row only if all values are null:
>>> df.filter(~pl.all_horizontal(pl.all().is_null())) shape: (3, 3) ┌──────┬─────┬──────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ null ┆ i64 ┆ i64 │ ╞══════╪═════╪══════╡ │ null ┆ 1 ┆ 1 │ │ null ┆ 2 ┆ null │ │ null ┆ 1 ┆ 1 │ └──────┴─────┴──────┘
Drop a column if all values are null:
>>> df[[s.name for s in df if not (s.null_count() == df.height)]] shape: (4, 2) ┌──────┬──────┐ │ b ┆ c │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞══════╪══════╡ │ 1 ┆ 1 │ │ 2 ┆ null │ │ null ┆ null │ │ 1 ┆ 1 │ └──────┴──────┘
- property dtypes: list[DataType][source]
Get the column data types.
The data types can also be found in column headers when printing the DataFrame.
- Returns:
- list of DataType
A list containing the data type of each column in order.
See also
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.dtypes [Int64, Float64, String] >>> df shape: (3, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6.0 ┆ a │ │ 2 ┆ 7.0 ┆ b │ │ 3 ┆ 8.0 ┆ c │ └─────┴─────┴─────┘
- equals(
- other: DataFrame,
- *,
- null_equal: bool = True,
Check whether the DataFrame is equal to another DataFrame.
- Parameters:
- other
DataFrame to compare with.
- null_equal
Consider null values as equal.
See also
Examples
>>> df1 = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": ["a", "b", "c"], ... } ... ) >>> df2 = pl.DataFrame( ... { ... "foo": [3, 2, 1], ... "bar": [8.0, 7.0, 6.0], ... "ham": ["c", "b", "a"], ... } ... ) >>> df1.equals(df1) True >>> df1.equals(df2) False
- estimated_size(unit: SizeUnit = 'b') int | float [source]
Return an estimation of the total (heap) allocated size of the
DataFrame
.Estimated size is given in the specified unit (bytes by default).
This estimation is the sum of the size of its buffers, validity, including nested arrays. Multiple arrays may share buffers and bitmaps. Therefore, the size of 2 arrays is not the sum of the sizes computed from this function. In particular, [
StructArray
]’s size is an upper bound.When an array is sliced, its allocated size remains constant because the buffer unchanged. However, this function will yield a smaller number. This is because this function returns the visible size of the buffer, not its total capacity.
FFI buffers are included in this estimation.
- Parameters:
- unit{‘b’, ‘kb’, ‘mb’, ‘gb’, ‘tb’}
Scale the returned size to the given unit.
Examples
>>> df = pl.DataFrame( ... { ... "x": list(reversed(range(1_000_000))), ... "y": [v / 1000 for v in range(1_000_000)], ... "z": [str(v) for v in range(1_000_000)], ... }, ... schema=[("x", pl.UInt32), ("y", pl.Float64), ("z", pl.String)], ... ) >>> df.estimated_size() 17888890 >>> df.estimated_size("mb") 17.0601749420166
- explode( ) DataFrame [source]
Explode the dataframe to long format by exploding the given columns.
- Parameters:
- columns
Column names, expressions, or a selector defining them. The underlying columns being exploded must be of the
List
orArray
data type.- *more_columns
Additional names of columns to explode, specified as positional arguments.
- Returns:
- DataFrame
Examples
>>> df = pl.DataFrame( ... { ... "letters": ["a", "a", "b", "c"], ... "numbers": [[1], [2, 3], [4, 5], [6, 7, 8]], ... } ... ) >>> df shape: (4, 2) ┌─────────┬───────────┐ │ letters ┆ numbers │ │ --- ┆ --- │ │ str ┆ list[i64] │ ╞═════════╪═══════════╡ │ a ┆ [1] │ │ a ┆ [2, 3] │ │ b ┆ [4, 5] │ │ c ┆ [6, 7, 8] │ └─────────┴───────────┘ >>> df.explode("numbers") shape: (8, 2) ┌─────────┬─────────┐ │ letters ┆ numbers │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════════╪═════════╡ │ a ┆ 1 │ │ a ┆ 2 │ │ a ┆ 3 │ │ b ┆ 4 │ │ b ┆ 5 │ │ c ┆ 6 │ │ c ┆ 7 │ │ c ┆ 8 │ └─────────┴─────────┘
- extend(
- other: DataFrame,
Extend the memory backed by this
DataFrame
with the values fromother
.Different from
vstack
which adds the chunks fromother
to the chunks of thisDataFrame
,extend
appends the data fromother
to the underlying memory locations and thus may cause a reallocation.If this does not cause a reallocation, the resulting data structure will not have any extra chunks and thus will yield faster queries.
Prefer
extend
overvstack
when you want to do a query after a single append. For instance, during online operations where you addn
rows and rerun a query.Prefer
vstack
overextend
when you want to append many times before doing a query. For instance, when you read in multiple files and want to store them in a singleDataFrame
. In the latter case, finish the sequence ofvstack
operations with arechunk
.- Parameters:
- other
DataFrame to vertically add.
Warning
This method modifies the dataframe in-place. The dataframe is returned for convenience only.
See also
Examples
>>> df1 = pl.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) >>> df2 = pl.DataFrame({"foo": [10, 20, 30], "bar": [40, 50, 60]}) >>> df1.extend(df2) shape: (6, 2) ┌─────┬─────┐ │ foo ┆ bar │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 4 │ │ 2 ┆ 5 │ │ 3 ┆ 6 │ │ 10 ┆ 40 │ │ 20 ┆ 50 │ │ 30 ┆ 60 │ └─────┴─────┘
- fill_nan(value: Expr | int | float | None) DataFrame [source]
Fill floating point NaN values by an Expression evaluation.
- Parameters:
- value
Value with which to replace NaN values.
- Returns:
- DataFrame
DataFrame with NaN values replaced by the given value.
Warning
Note that floating point NaNs (Not a Number) are not missing values. To replace missing values, use
fill_null()
.See also
Examples
>>> df = pl.DataFrame( ... { ... "a": [1.5, 2, float("nan"), 4], ... "b": [0.5, 4, float("nan"), 13], ... } ... ) >>> df.fill_nan(99) shape: (4, 2) ┌──────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ f64 ┆ f64 │ ╞══════╪══════╡ │ 1.5 ┆ 0.5 │ │ 2.0 ┆ 4.0 │ │ 99.0 ┆ 99.0 │ │ 4.0 ┆ 13.0 │ └──────┴──────┘
- fill_null(
- value: Any | Expr | None = None,
- strategy: FillNullStrategy | None = None,
- limit: int | None = None,
- *,
- matches_supertype: bool = True,
Fill null values using the specified value or strategy.
- Parameters:
- value
Value used to fill null values.
- strategy{None, ‘forward’, ‘backward’, ‘min’, ‘max’, ‘mean’, ‘zero’, ‘one’}
Strategy used to fill null values.
- limit
Number of consecutive null values to fill when using the ‘forward’ or ‘backward’ strategy.
- matches_supertype
Fill all matching supertype of the fill
value
.
- Returns:
- DataFrame
DataFrame with None values replaced by the filling strategy.
See also
Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 2, None, 4], ... "b": [0.5, 4, None, 13], ... } ... ) >>> df.fill_null(99) shape: (4, 2) ┌─────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ f64 │ ╞═════╪══════╡ │ 1 ┆ 0.5 │ │ 2 ┆ 4.0 │ │ 99 ┆ 99.0 │ │ 4 ┆ 13.0 │ └─────┴──────┘ >>> df.fill_null(strategy="forward") shape: (4, 2) ┌─────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ f64 │ ╞═════╪══════╡ │ 1 ┆ 0.5 │ │ 2 ┆ 4.0 │ │ 2 ┆ 4.0 │ │ 4 ┆ 13.0 │ └─────┴──────┘
>>> df.fill_null(strategy="max") shape: (4, 2) ┌─────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ f64 │ ╞═════╪══════╡ │ 1 ┆ 0.5 │ │ 2 ┆ 4.0 │ │ 4 ┆ 13.0 │ │ 4 ┆ 13.0 │ └─────┴──────┘
>>> df.fill_null(strategy="zero") shape: (4, 2) ┌─────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ f64 │ ╞═════╪══════╡ │ 1 ┆ 0.5 │ │ 2 ┆ 4.0 │ │ 0 ┆ 0.0 │ │ 4 ┆ 13.0 │ └─────┴──────┘
- filter(
- *predicates: IntoExprColumn | Iterable[IntoExprColumn] | bool | list[bool] | np.ndarray[Any, Any],
- **constraints: Any,
Filter the rows in the DataFrame based on one or more predicate expressions.
The original order of the remaining rows is preserved.
Rows where the filter does not evaluate to True are discarded, including nulls.
- Parameters:
- predicates
Expression(s) that evaluates to a boolean Series.
- constraints
Column filters; use
name = value
to filter columns by the supplied value. Each constraint will behave the same aspl.col(name).eq(value)
, and will be implicitly joined with the other filter conditions using&
.
Notes
If you are transitioning from pandas and performing filter operations based on the comparison of two or more columns, please note that in Polars, any comparison involving null values will always result in null. As a result, these rows will be filtered out. Ensure to handle null values appropriately to avoid unintended filtering (See examples below).
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3, None, 4, None, 0], ... "bar": [6, 7, 8, None, None, 9, 0], ... "ham": ["a", "b", "c", None, "d", "e", "f"], ... } ... )
Filter on one condition:
>>> df.filter(pl.col("foo") > 1) shape: (3, 3) ┌─────┬──────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪══════╪═════╡ │ 2 ┆ 7 ┆ b │ │ 3 ┆ 8 ┆ c │ │ 4 ┆ null ┆ d │ └─────┴──────┴─────┘
Filter on multiple conditions, combined with and/or operators:
>>> df.filter((pl.col("foo") < 3) & (pl.col("ham") == "a")) shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ └─────┴─────┴─────┘
>>> df.filter((pl.col("foo") == 1) | (pl.col("ham") == "c")) shape: (2, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ │ 3 ┆ 8 ┆ c │ └─────┴─────┴─────┘
Provide multiple filters using
*args
syntax:>>> df.filter( ... pl.col("foo") <= 2, ... ~pl.col("ham").is_in(["b", "c"]), ... ) shape: (2, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ │ 0 ┆ 0 ┆ f │ └─────┴─────┴─────┘
Provide multiple filters using
**kwargs
syntax:>>> df.filter(foo=2, ham="b") shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 2 ┆ 7 ┆ b │ └─────┴─────┴─────┘
Filter by comparing two columns against each other
>>> df.filter(pl.col("foo") == pl.col("bar")) shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 0 ┆ 0 ┆ f │ └─────┴─────┴─────┘
>>> df.filter(pl.col("foo") != pl.col("bar")) shape: (3, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ │ 2 ┆ 7 ┆ b │ │ 3 ┆ 8 ┆ c │ └─────┴─────┴─────┘
Notice how the row with
None
values is filtered out. In order to keep the same behavior as pandas, use:>>> df.filter(pl.col("foo").ne_missing(pl.col("bar"))) shape: (5, 3) ┌──────┬──────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞══════╪══════╪═════╡ │ 1 ┆ 6 ┆ a │ │ 2 ┆ 7 ┆ b │ │ 3 ┆ 8 ┆ c │ │ 4 ┆ null ┆ d │ │ null ┆ 9 ┆ e │ └──────┴──────┴─────┘
- property flags: dict[str, dict[str, bool]][source]
Get flags that are set on the columns of this DataFrame.
- Returns:
- dict
Mapping from column names to column flags.
- fold(operation: Callable[[Series, Series], Series]) Series [source]
Apply a horizontal reduction on a DataFrame.
This can be used to effectively determine aggregations on a row level, and can be applied to any DataType that can be supercasted (casted to a similar parent type).
An example of the supercast rules when applying an arithmetic operation on two DataTypes are for instance:
Int8 + String = String
Float32 + Int64 = Float32
Float32 + Float64 = Float64
- Parameters:
- operation
function that takes two
Series
and returns aSeries
.
Examples
A horizontal sum operation:
>>> df = pl.DataFrame( ... { ... "a": [2, 1, 3], ... "b": [1, 2, 3], ... "c": [1.0, 2.0, 3.0], ... } ... ) >>> df.fold(lambda s1, s2: s1 + s2) shape: (3,) Series: 'a' [f64] [ 4.0 5.0 9.0 ]
A horizontal minimum operation:
>>> df = pl.DataFrame({"a": [2, 1, 3], "b": [1, 2, 3], "c": [1.0, 2.0, 3.0]}) >>> df.fold(lambda s1, s2: s1.zip_with(s1 < s2, s2)) shape: (3,) Series: 'a' [f64] [ 1.0 1.0 3.0 ]
A horizontal string concatenation:
>>> df = pl.DataFrame( ... { ... "a": ["foo", "bar", None], ... "b": [1, 2, 3], ... "c": [1.0, 2.0, 3.0], ... } ... ) >>> df.fold(lambda s1, s2: s1 + s2) shape: (3,) Series: 'a' [str] [ "foo11.0" "bar22.0" null ]
A horizontal boolean or, similar to a row-wise .any():
>>> df = pl.DataFrame( ... { ... "a": [False, False, True], ... "b": [False, True, False], ... } ... ) >>> df.fold(lambda s1, s2: s1 | s2) shape: (3,) Series: 'a' [bool] [ false true true ]
- gather_every(n: int, offset: int = 0) DataFrame [source]
Take every nth row in the DataFrame and return as a new DataFrame.
- Parameters:
- n
Gather every n-th row.
- offset
Starting index.
Examples
>>> s = pl.DataFrame({"a": [1, 2, 3, 4], "b": [5, 6, 7, 8]}) >>> s.gather_every(2) shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 5 │ │ 3 ┆ 7 │ └─────┴─────┘
>>> s.gather_every(2, offset=1) shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 2 ┆ 6 │ │ 4 ┆ 8 │ └─────┴─────┘
- get_column(
- name: str,
- *,
- default: Any | NoDefault = <no_default>,
Get a single column by name.
- Parameters:
- name
String name of the column to retrieve.
- default
Value to return if the column does not exist; if not explicitly set and the column is not present a
ColumnNotFoundError
exception is raised.
- Returns:
- Series (or arbitrary default value, if specified).
See also
Examples
>>> df = pl.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) >>> df.get_column("foo") shape: (3,) Series: 'foo' [i64] [ 1 2 3 ]
Missing column handling; can optionally provide an arbitrary default value to the method (otherwise a
ColumnNotFoundError
exception is raised).>>> df.get_column("baz", default=pl.Series("baz", ["?", "?", "?"])) shape: (3,) Series: 'baz' [str] [ "?" "?" "?" ] >>> res = df.get_column("baz", default=None) >>> res is None True
- get_column_index(name: str) int [source]
Find the index of a column by name.
- Parameters:
- name
Name of the column to find.
Examples
>>> df = pl.DataFrame( ... {"foo": [1, 2, 3], "bar": [6, 7, 8], "ham": ["a", "b", "c"]} ... ) >>> df.get_column_index("ham") 2 >>> df.get_column_index("sandwich") ColumnNotFoundError: sandwich
- get_columns() list[Series] [source]
Get the DataFrame as a List of Series.
Examples
>>> df = pl.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) >>> df.get_columns() [shape: (3,) Series: 'foo' [i64] [ 1 2 3 ], shape: (3,) Series: 'bar' [i64] [ 4 5 6 ]]
>>> df = pl.DataFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [0.5, 4, 10, 13], ... "c": [True, True, False, True], ... } ... ) >>> df.get_columns() [shape: (4,) Series: 'a' [i64] [ 1 2 3 4 ], shape: (4,) Series: 'b' [f64] [ 0.5 4.0 10.0 13.0 ], shape: (4,) Series: 'c' [bool] [ true true false true ]]
- glimpse( ) str | None [source]
Return a dense preview of the DataFrame.
The formatting shows one line per column so that wide dataframes display cleanly. Each line shows the column name, the data type, and the first few values.
- Parameters:
- max_items_per_column
Maximum number of items to show per column.
- max_colname_length
Maximum length of the displayed column names; values that exceed this value are truncated with a trailing ellipsis.
- return_as_string
If True, return the preview as a string instead of printing to stdout.
Examples
>>> from datetime import date >>> df = pl.DataFrame( ... { ... "a": [1.0, 2.8, 3.0], ... "b": [4, 5, None], ... "c": [True, False, True], ... "d": [None, "b", "c"], ... "e": ["usd", "eur", None], ... "f": [date(2020, 1, 1), date(2021, 1, 2), date(2022, 1, 1)], ... } ... ) >>> df.glimpse() Rows: 3 Columns: 6 $ a <f64> 1.0, 2.8, 3.0 $ b <i64> 4, 5, None $ c <bool> True, False, True $ d <str> None, 'b', 'c' $ e <str> 'usd', 'eur', None $ f <date> 2020-01-01, 2021-01-02, 2022-01-01
- group_by(
- *by: IntoExpr | Iterable[IntoExpr],
- maintain_order: bool = False,
- **named_by: IntoExpr,
Start a group by operation.
- Parameters:
- *by
Column(s) to group by. Accepts expression input. Strings are parsed as column names.
- maintain_order
Ensure that the order of the groups is consistent with the input data. This is slower than a default group by. Settings this to
True
blocks the possibility to run on the streaming engine.Note
Within each group, the order of rows is always preserved, regardless of this argument.
- **named_by
Additional columns to group by, specified as keyword arguments. The columns will be renamed to the keyword used.
- Returns:
- GroupBy
Object which can be used to perform aggregations.
Examples
Group by one column and call
agg
to compute the grouped sum of another column.>>> df = pl.DataFrame( ... { ... "a": ["a", "b", "a", "b", "c"], ... "b": [1, 2, 1, 3, 3], ... "c": [5, 4, 3, 2, 1], ... } ... ) >>> df.group_by("a").agg(pl.col("b").sum()) shape: (3, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════╪═════╡ │ a ┆ 2 │ │ b ┆ 5 │ │ c ┆ 3 │ └─────┴─────┘
Set
maintain_order=True
to ensure the order of the groups is consistent with the input.>>> df.group_by("a", maintain_order=True).agg(pl.col("c")) shape: (3, 2) ┌─────┬───────────┐ │ a ┆ c │ │ --- ┆ --- │ │ str ┆ list[i64] │ ╞═════╪═══════════╡ │ a ┆ [5, 3] │ │ b ┆ [4, 2] │ │ c ┆ [1] │ └─────┴───────────┘
Group by multiple columns by passing a list of column names.
>>> df.group_by(["a", "b"]).agg(pl.max("c")) shape: (4, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ a ┆ 1 ┆ 5 │ │ b ┆ 2 ┆ 4 │ │ b ┆ 3 ┆ 2 │ │ c ┆ 3 ┆ 1 │ └─────┴─────┴─────┘
Or use positional arguments to group by multiple columns in the same way. Expressions are also accepted.
>>> df.group_by("a", pl.col("b") // 2).agg(pl.col("c").mean()) shape: (3, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ f64 │ ╞═════╪═════╪═════╡ │ a ┆ 0 ┆ 4.0 │ │ b ┆ 1 ┆ 3.0 │ │ c ┆ 1 ┆ 1.0 │ └─────┴─────┴─────┘
The
GroupBy
object returned by this method is iterable, returning the name and data of each group.>>> for name, data in df.group_by("a"): ... print(name) ... print(data) ('a',) shape: (2, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ a ┆ 1 ┆ 5 │ │ a ┆ 1 ┆ 3 │ └─────┴─────┴─────┘ ('b',) shape: (2, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ b ┆ 2 ┆ 4 │ │ b ┆ 3 ┆ 2 │ └─────┴─────┴─────┘ ('c',) shape: (1, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ c ┆ 3 ┆ 1 │ └─────┴─────┴─────┘
- group_by_dynamic(
- index_column: IntoExpr,
- *,
- every: str | timedelta,
- period: str | timedelta | None = None,
- offset: str | timedelta | None = None,
- include_boundaries: bool = False,
- closed: ClosedInterval = 'left',
- label: Label = 'left',
- group_by: IntoExpr | Iterable[IntoExpr] | None = None,
- start_by: StartBy = 'window',
Group based on a time value (or index value of type Int32, Int64).
Time windows are calculated and rows are assigned to windows. Different from a normal group by is that a row can be member of multiple groups. By default, the windows look like:
[start, start + period)
[start + every, start + every + period)
[start + 2*every, start + 2*every + period)
…
where
start
is determined bystart_by
,offset
,every
, and the earliest datapoint. See thestart_by
argument description for details.Warning
The index column must be sorted in ascending order. If
group_by
is passed, then the index column must be sorted in ascending order within each group.- Parameters:
- index_column
Column used to group based on the time window. Often of type Date/Datetime. This column must be sorted in ascending order (or, if
group_by
is specified, then it must be sorted in ascending order within each group).In case of a dynamic group by on indices, dtype needs to be one of {Int32, Int64}. Note that Int32 gets temporarily cast to Int64, so if performance matters use an Int64 column.
- every
interval of the window
- period
length of the window, if None it will equal ‘every’
- offset
offset of the window, does not take effect if
start_by
is ‘datapoint’. Defaults to zero.- include_boundaries
Add the lower and upper bound of the window to the “_lower_boundary” and “_upper_boundary” columns. This will impact performance because it’s harder to parallelize
- closed{‘left’, ‘right’, ‘both’, ‘none’}
Define which sides of the temporal interval are closed (inclusive).
- label{‘left’, ‘right’, ‘datapoint’}
Define which label to use for the window:
‘left’: lower boundary of the window
‘right’: upper boundary of the window
‘datapoint’: the first value of the index column in the given window. If you don’t need the label to be at one of the boundaries, choose this option for maximum performance
- group_by
Also group by this column/these columns
- start_by{‘window’, ‘datapoint’, ‘monday’, ‘tuesday’, ‘wednesday’, ‘thursday’, ‘friday’, ‘saturday’, ‘sunday’}
The strategy to determine the start of the first window by.
‘window’: Start by taking the earliest timestamp, truncating it with
every
, and then addingoffset
. Note that weekly windows start on Monday.‘datapoint’: Start from the first encountered data point.
a day of the week (only takes effect if
every
contains'w'
):‘monday’: Start the window on the Monday before the first data point.
‘tuesday’: Start the window on the Tuesday before the first data point.
…
‘sunday’: Start the window on the Sunday before the first data point.
The resulting window is then shifted back until the earliest datapoint is in or in front of it.
- Returns:
- DynamicGroupBy
Object you can call
.agg
on to aggregate by groups, the result of which will be sorted byindex_column
(but note that ifgroup_by
columns are passed, it will only be sorted within each group).
See also
Notes
If you’re coming from pandas, then
# polars df.group_by_dynamic("ts", every="1d").agg(pl.col("value").sum())
is equivalent to
# pandas df.set_index("ts").resample("D")["value"].sum().reset_index()
though note that, unlike pandas, polars doesn’t add extra rows for empty windows. If you need
index_column
to be evenly spaced, then please combine withDataFrame.upsample()
.The
every
,period
andoffset
arguments are created with the following string language:1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
1i (1 index count)
Or combine them: “3d12h4m25s” # 3 days, 12 hours, 4 minutes, and 25 seconds
By “calendar day”, we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for “calendar week”, “calendar month”, “calendar quarter”, and “calendar year”.
In case of a group_by_dynamic on an integer column, the windows are defined by:
“1i” # length 1
“10i” # length 10
Examples
>>> from datetime import datetime >>> df = pl.DataFrame( ... { ... "time": pl.datetime_range( ... start=datetime(2021, 12, 16), ... end=datetime(2021, 12, 16, 3), ... interval="30m", ... eager=True, ... ), ... "n": range(7), ... } ... ) >>> df shape: (7, 2) ┌─────────────────────┬─────┐ │ time ┆ n │ │ --- ┆ --- │ │ datetime[μs] ┆ i64 │ ╞═════════════════════╪═════╡ │ 2021-12-16 00:00:00 ┆ 0 │ │ 2021-12-16 00:30:00 ┆ 1 │ │ 2021-12-16 01:00:00 ┆ 2 │ │ 2021-12-16 01:30:00 ┆ 3 │ │ 2021-12-16 02:00:00 ┆ 4 │ │ 2021-12-16 02:30:00 ┆ 5 │ │ 2021-12-16 03:00:00 ┆ 6 │ └─────────────────────┴─────┘
Group by windows of 1 hour.
>>> df.group_by_dynamic("time", every="1h", closed="right").agg(pl.col("n")) shape: (4, 2) ┌─────────────────────┬───────────┐ │ time ┆ n │ │ --- ┆ --- │ │ datetime[μs] ┆ list[i64] │ ╞═════════════════════╪═══════════╡ │ 2021-12-15 23:00:00 ┆ [0] │ │ 2021-12-16 00:00:00 ┆ [1, 2] │ │ 2021-12-16 01:00:00 ┆ [3, 4] │ │ 2021-12-16 02:00:00 ┆ [5, 6] │ └─────────────────────┴───────────┘
The window boundaries can also be added to the aggregation result
>>> df.group_by_dynamic( ... "time", every="1h", include_boundaries=True, closed="right" ... ).agg(pl.col("n").mean()) shape: (4, 4) ┌─────────────────────┬─────────────────────┬─────────────────────┬─────┐ │ _lower_boundary ┆ _upper_boundary ┆ time ┆ n │ │ --- ┆ --- ┆ --- ┆ --- │ │ datetime[μs] ┆ datetime[μs] ┆ datetime[μs] ┆ f64 │ ╞═════════════════════╪═════════════════════╪═════════════════════╪═════╡ │ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-15 23:00:00 ┆ 0.0 │ │ 2021-12-16 00:00:00 ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 00:00:00 ┆ 1.5 │ │ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 3.5 │ │ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 5.5 │ └─────────────────────┴─────────────────────┴─────────────────────┴─────┘
When closed=”left”, the window excludes the right end of interval: [lower_bound, upper_bound)
>>> df.group_by_dynamic("time", every="1h", closed="left").agg(pl.col("n")) shape: (4, 2) ┌─────────────────────┬───────────┐ │ time ┆ n │ │ --- ┆ --- │ │ datetime[μs] ┆ list[i64] │ ╞═════════════════════╪═══════════╡ │ 2021-12-16 00:00:00 ┆ [0, 1] │ │ 2021-12-16 01:00:00 ┆ [2, 3] │ │ 2021-12-16 02:00:00 ┆ [4, 5] │ │ 2021-12-16 03:00:00 ┆ [6] │ └─────────────────────┴───────────┘
When closed=”both” the time values at the window boundaries belong to 2 groups.
>>> df.group_by_dynamic("time", every="1h", closed="both").agg(pl.col("n")) shape: (4, 2) ┌─────────────────────┬───────────┐ │ time ┆ n │ │ --- ┆ --- │ │ datetime[μs] ┆ list[i64] │ ╞═════════════════════╪═══════════╡ │ 2021-12-16 00:00:00 ┆ [0, 1, 2] │ │ 2021-12-16 01:00:00 ┆ [2, 3, 4] │ │ 2021-12-16 02:00:00 ┆ [4, 5, 6] │ │ 2021-12-16 03:00:00 ┆ [6] │ └─────────────────────┴───────────┘
Dynamic group bys can also be combined with grouping on normal keys
>>> df = df.with_columns(groups=pl.Series(["a", "a", "a", "b", "b", "a", "a"])) >>> df shape: (7, 3) ┌─────────────────────┬─────┬────────┐ │ time ┆ n ┆ groups │ │ --- ┆ --- ┆ --- │ │ datetime[μs] ┆ i64 ┆ str │ ╞═════════════════════╪═════╪════════╡ │ 2021-12-16 00:00:00 ┆ 0 ┆ a │ │ 2021-12-16 00:30:00 ┆ 1 ┆ a │ │ 2021-12-16 01:00:00 ┆ 2 ┆ a │ │ 2021-12-16 01:30:00 ┆ 3 ┆ b │ │ 2021-12-16 02:00:00 ┆ 4 ┆ b │ │ 2021-12-16 02:30:00 ┆ 5 ┆ a │ │ 2021-12-16 03:00:00 ┆ 6 ┆ a │ └─────────────────────┴─────┴────────┘ >>> df.group_by_dynamic( ... "time", ... every="1h", ... closed="both", ... group_by="groups", ... include_boundaries=True, ... ).agg(pl.col("n")) shape: (6, 5) ┌────────┬─────────────────────┬─────────────────────┬─────────────────────┬───────────┐ │ groups ┆ _lower_boundary ┆ _upper_boundary ┆ time ┆ n │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ datetime[μs] ┆ datetime[μs] ┆ datetime[μs] ┆ list[i64] │ ╞════════╪═════════════════════╪═════════════════════╪═════════════════════╪═══════════╡ │ a ┆ 2021-12-16 00:00:00 ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 00:00:00 ┆ [0, 1, 2] │ │ a ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ [2] │ │ a ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ [5, 6] │ │ a ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 04:00:00 ┆ 2021-12-16 03:00:00 ┆ [6] │ │ b ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ [3, 4] │ │ b ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ [4] │ └────────┴─────────────────────┴─────────────────────┴─────────────────────┴───────────┘
Dynamic group by on an index column
>>> df = pl.DataFrame( ... { ... "idx": pl.int_range(0, 6, eager=True), ... "A": ["A", "A", "B", "B", "B", "C"], ... } ... ) >>> ( ... df.group_by_dynamic( ... "idx", ... every="2i", ... period="3i", ... include_boundaries=True, ... closed="right", ... ).agg(pl.col("A").alias("A_agg_list")) ... ) shape: (4, 4) ┌─────────────────┬─────────────────┬─────┬─────────────────┐ │ _lower_boundary ┆ _upper_boundary ┆ idx ┆ A_agg_list │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 ┆ list[str] │ ╞═════════════════╪═════════════════╪═════╪═════════════════╡ │ -2 ┆ 1 ┆ -2 ┆ ["A", "A"] │ │ 0 ┆ 3 ┆ 0 ┆ ["A", "B", "B"] │ │ 2 ┆ 5 ┆ 2 ┆ ["B", "B", "C"] │ │ 4 ┆ 7 ┆ 4 ┆ ["C"] │ └─────────────────┴─────────────────┴─────┴─────────────────┘
- hash_rows( ) Series [source]
Hash and combine the rows in this DataFrame.
The hash value is of type
UInt64
.- Parameters:
- seed
Random seed parameter. Defaults to 0.
- seed_1
Random seed parameter. Defaults to
seed
if not set.- seed_2
Random seed parameter. Defaults to
seed
if not set.- seed_3
Random seed parameter. Defaults to
seed
if not set.
Notes
This implementation of
hash_rows
does not guarantee stable results across different Polars versions. Its stability is only guaranteed within a single version.Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, None, 3, 4], ... "ham": ["a", "b", None, "d"], ... } ... ) >>> df.hash_rows(seed=42) shape: (4,) Series: '' [u64] [ 10783150408545073287 1438741209321515184 10047419486152048166 2047317070637311557 ]
- head(n: int = 5) DataFrame [source]
Get the first
n
rows.- Parameters:
- n
Number of rows to return. If a negative value is passed, return all rows except the last
abs(n)
.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3, 4, 5], ... "bar": [6, 7, 8, 9, 10], ... "ham": ["a", "b", "c", "d", "e"], ... } ... ) >>> df.head(3) shape: (3, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ │ 2 ┆ 7 ┆ b │ │ 3 ┆ 8 ┆ c │ └─────┴─────┴─────┘
Pass a negative value to get all rows
except
the lastabs(n)
.>>> df.head(-3) shape: (2, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ │ 2 ┆ 7 ┆ b │ └─────┴─────┴─────┘
- property height: int[source]
Get the number of rows.
- Returns:
- int
Examples
>>> df = pl.DataFrame({"foo": [1, 2, 3, 4, 5]}) >>> df.height 5
- hstack( ) DataFrame [source]
Return a new DataFrame grown horizontally by stacking multiple Series to it.
- Parameters:
- columns
Series to stack.
- in_place
Modify in place.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> x = pl.Series("apple", [10, 20, 30]) >>> df.hstack([x]) shape: (3, 4) ┌─────┬─────┬─────┬───────┐ │ foo ┆ bar ┆ ham ┆ apple │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str ┆ i64 │ ╞═════╪═════╪═════╪═══════╡ │ 1 ┆ 6 ┆ a ┆ 10 │ │ 2 ┆ 7 ┆ b ┆ 20 │ │ 3 ┆ 8 ┆ c ┆ 30 │ └─────┴─────┴─────┴───────┘
- insert_column(index: int, column: IntoExprColumn) DataFrame [source]
Insert a Series at a certain column index.
This operation is in place.
- Parameters:
- index
Index at which to insert the new column.
- column
Series
or expression to insert.
Examples
>>> df = pl.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) >>> s = pl.Series("baz", [97, 98, 99]) >>> df.insert_column(1, s) shape: (3, 3) ┌─────┬─────┬─────┐ │ foo ┆ baz ┆ bar │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ 1 ┆ 97 ┆ 4 │ │ 2 ┆ 98 ┆ 5 │ │ 3 ┆ 99 ┆ 6 │ └─────┴─────┴─────┘
>>> df = pl.DataFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [0.5, 4, 10, 13], ... "c": [True, True, False, True], ... } ... ) >>> s = pl.Series("d", [-2.5, 15, 20.5, 0]) >>> df.insert_column(3, s) shape: (4, 4) ┌─────┬──────┬───────┬──────┐ │ a ┆ b ┆ c ┆ d │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ bool ┆ f64 │ ╞═════╪══════╪═══════╪══════╡ │ 1 ┆ 0.5 ┆ true ┆ -2.5 │ │ 2 ┆ 4.0 ┆ true ┆ 15.0 │ │ 3 ┆ 10.0 ┆ false ┆ 20.5 │ │ 4 ┆ 13.0 ┆ true ┆ 0.0 │ └─────┴──────┴───────┴──────┘
- interpolate() DataFrame [source]
Interpolate intermediate values. The interpolation method is linear.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, None, 9, 10], ... "bar": [6, 7, 9, None], ... "baz": [1, None, None, 9], ... } ... ) >>> df.interpolate() shape: (4, 3) ┌──────┬──────┬──────────┐ │ foo ┆ bar ┆ baz │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ f64 │ ╞══════╪══════╪══════════╡ │ 1.0 ┆ 6.0 ┆ 1.0 │ │ 5.0 ┆ 7.0 ┆ 3.666667 │ │ 9.0 ┆ 9.0 ┆ 6.333333 │ │ 10.0 ┆ null ┆ 9.0 │ └──────┴──────┴──────────┘
- is_duplicated() Series [source]
Get a mask of all duplicated rows in this DataFrame.
Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 2, 3, 1], ... "b": ["x", "y", "z", "x"], ... } ... ) >>> df.is_duplicated() shape: (4,) Series: '' [bool] [ true false false true ]
This mask can be used to visualize the duplicated lines like this:
>>> df.filter(df.is_duplicated()) shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ str │ ╞═════╪═════╡ │ 1 ┆ x │ │ 1 ┆ x │ └─────┴─────┘
- is_empty() bool [source]
Returns
True
if the DataFrame contains no rows.Examples
>>> df = pl.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) >>> df.is_empty() False >>> df.filter(pl.col("foo") > 99).is_empty() True
- is_unique() Series [source]
Get a mask of all unique rows in this DataFrame.
Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 2, 3, 1], ... "b": ["x", "y", "z", "x"], ... } ... ) >>> df.is_unique() shape: (4,) Series: '' [bool] [ false true true false ]
This mask can be used to visualize the unique lines like this:
>>> df.filter(df.is_unique()) shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ str │ ╞═════╪═════╡ │ 2 ┆ y │ │ 3 ┆ z │ └─────┴─────┘
- item(row: int | None = None, column: int | str | None = None) Any [source]
Return the DataFrame as a scalar, or return the element at the given row/column.
- Parameters:
- row
Optional row index.
- column
Optional column index or name.
See also
row
Get the values of a single row, either by index or by predicate.
Notes
If row/col not provided, this is equivalent to
df[0,0]
, with a check that the shape is (1,1). With row/col, this is equivalent todf[row,col]
.Examples
>>> df = pl.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}) >>> df.select((pl.col("a") * pl.col("b")).sum()).item() 32 >>> df.item(1, 1) 5 >>> df.item(2, "b") 6
- iter_columns() Iterator[Series] [source]
Returns an iterator over the columns of this DataFrame.
- Yields:
- Series
Notes
Consider whether you can use
all()
instead. If you can, it will be more efficient.Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 3, 5], ... "b": [2, 4, 6], ... } ... ) >>> [s.name for s in df.iter_columns()] ['a', 'b']
If you’re using this to modify a dataframe’s columns, e.g.
>>> # Do NOT do this >>> pl.DataFrame(column * 2 for column in df.iter_columns()) shape: (3, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 2 ┆ 4 │ │ 6 ┆ 8 │ │ 10 ┆ 12 │ └─────┴─────┘
then consider whether you can use
all()
instead:>>> df.select(pl.all() * 2) shape: (3, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 2 ┆ 4 │ │ 6 ┆ 8 │ │ 10 ┆ 12 │ └─────┴─────┘
- iter_rows( ) Iterator[tuple[Any, ...]] | Iterator[dict[str, Any]] [source]
Returns an iterator over the DataFrame of rows of python-native values.
- Parameters:
- named
Return dictionaries instead of tuples. The dictionaries are a mapping of column name to row value. This is more expensive than returning a regular tuple, but allows for accessing values by column name.
- buffer_size
Determines the number of rows that are buffered internally while iterating over the data; you should only modify this in very specific cases where the default value is determined not to be a good fit to your access pattern, as the speedup from using the buffer is significant (~2-4x). Setting this value to zero disables row buffering (not recommended).
- Returns:
- iterator of tuples (default) or dictionaries (if named) of python row values
Warning
Row iteration is not optimal as the underlying data is stored in columnar form; where possible, prefer export via one of the dedicated export/output methods that deals with columnar data.
See also
rows
Materialises all frame data as a list of rows (potentially expensive).
rows_by_key
Materialises frame data as a key-indexed dictionary.
Notes
If you have
ns
-precision temporal values you should be aware that Python natively only supports up toμs
-precision;ns
-precision values will be truncated to microseconds on conversion to Python. If this matters to your use-case you should export to a different format (such as Arrow or NumPy).Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 3, 5], ... "b": [2, 4, 6], ... } ... ) >>> [row[0] for row in df.iter_rows()] [1, 3, 5] >>> [row["b"] for row in df.iter_rows(named=True)] [2, 4, 6]
- iter_slices(n_rows: int = 10000) Iterator[DataFrame] [source]
Returns a non-copying iterator of slices over the underlying DataFrame.
- Parameters:
- n_rows
Determines the number of rows contained in each DataFrame slice.
See also
iter_rows
Row iterator over frame data (does not materialise all rows).
partition_by
Split into multiple DataFrames, partitioned by groups.
Examples
>>> from datetime import date >>> df = pl.DataFrame( ... data={ ... "a": range(17_500), ... "b": date(2023, 1, 1), ... "c": "klmnoopqrstuvwxyz", ... }, ... schema_overrides={"a": pl.Int32}, ... ) >>> for idx, frame in enumerate(df.iter_slices()): ... print(f"{type(frame).__name__}:[{idx}]:{len(frame)}") DataFrame:[0]:10000 DataFrame:[1]:7500
Using
iter_slices
is an efficient way to chunk-iterate over DataFrames and any supported frame export/conversion types; for example, as RecordBatches:>>> for frame in df.iter_slices(n_rows=15_000): ... record_batch = frame.to_arrow().to_batches()[0] ... print(f"{record_batch.schema}\n<< {len(record_batch)}") a: int32 b: date32[day] c: large_string << 15000 a: int32 b: date32[day] c: large_string << 2500
- join(
- other: DataFrame,
- on: str | Expr | Sequence[str | Expr] | None = None,
- how: JoinStrategy = 'inner',
- *,
- left_on: str | Expr | Sequence[str | Expr] | None = None,
- right_on: str | Expr | Sequence[str | Expr] | None = None,
- suffix: str = '_right',
- validate: JoinValidation = 'm:m',
- join_nulls: bool = False,
- coalesce: bool | None = None,
- maintain_order: MaintainOrderJoin | None = None,
Join in SQL-like fashion.
- Parameters:
- other
DataFrame to join with.
- on
Name(s) of the join columns in both DataFrames.
- how{‘inner’, ‘left’, ‘right’, ‘full’, ‘semi’, ‘anti’, ‘cross’}
Join strategy.
- inner
Returns rows that have matching values in both tables
- left
Returns all rows from the left table, and the matched rows from the right table
- right
Returns all rows from the right table, and the matched rows from the left table
- full
Returns all rows when there is a match in either left or right table
- cross
Returns the Cartesian product of rows from both tables
- semi
Returns rows from the left table that have a match in the right table.
- anti
Returns rows from the left table that have no match in the right table.
- left_on
Name(s) of the left join column(s).
- right_on
Name(s) of the right join column(s).
- suffix
Suffix to append to columns with a duplicate name.
- validate: {‘m:m’, ‘m:1’, ‘1:m’, ‘1:1’}
Checks if join is of specified type.
- many_to_many
“m:m”: default, does not result in checks
- one_to_one
“1:1”: check if join keys are unique in both left and right datasets
- one_to_many
“1:m”: check if join keys are unique in left dataset
- many_to_one
“m:1”: check if join keys are unique in right dataset
Note
This is currently not supported by the streaming engine.
- join_nulls
Join on null values. By default null values will never produce matches.
- coalesce
Coalescing behavior (merging of join columns).
None: -> join specific.
True: -> Always coalesce join columns.
False: -> Never coalesce join columns.
Note
Joining on any other expressions than
col
will turn off coalescing.- maintain_order{‘none’, ‘left’, ‘right’, ‘left_right’, ‘right_left’}
Which DataFrame row order to preserve, if any. Do not rely on any observed ordering without explicitly setting this parameter, as your code may break in a future release. Not specifying any ordering can improve performance Supported for inner, left, right and full joins
- none
No specific ordering is desired. The ordering might differ across Polars versions or even between different runs.
- left
Preserves the order of the left DataFrame.
- right
Preserves the order of the right DataFrame.
- left_right
First preserves the order of the left DataFrame, then the right.
- right_left
First preserves the order of the right DataFrame, then the left.
See also
Notes
For joining on columns with categorical data, see
polars.StringCache
.Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": ["a", "b", "c"], ... } ... ) >>> other_df = pl.DataFrame( ... { ... "apple": ["x", "y", "z"], ... "ham": ["a", "b", "d"], ... } ... ) >>> df.join(other_df, on="ham") shape: (2, 4) ┌─────┬─────┬─────┬───────┐ │ foo ┆ bar ┆ ham ┆ apple │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str ┆ str │ ╞═════╪═════╪═════╪═══════╡ │ 1 ┆ 6.0 ┆ a ┆ x │ │ 2 ┆ 7.0 ┆ b ┆ y │ └─────┴─────┴─────┴───────┘
>>> df.join(other_df, on="ham", how="full") shape: (4, 5) ┌──────┬──────┬──────┬───────┬───────────┐ │ foo ┆ bar ┆ ham ┆ apple ┆ ham_right │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str ┆ str ┆ str │ ╞══════╪══════╪══════╪═══════╪═══════════╡ │ 1 ┆ 6.0 ┆ a ┆ x ┆ a │ │ 2 ┆ 7.0 ┆ b ┆ y ┆ b │ │ null ┆ null ┆ null ┆ z ┆ d │ │ 3 ┆ 8.0 ┆ c ┆ null ┆ null │ └──────┴──────┴──────┴───────┴───────────┘
>>> df.join(other_df, on="ham", how="left", coalesce=True) shape: (3, 4) ┌─────┬─────┬─────┬───────┐ │ foo ┆ bar ┆ ham ┆ apple │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str ┆ str │ ╞═════╪═════╪═════╪═══════╡ │ 1 ┆ 6.0 ┆ a ┆ x │ │ 2 ┆ 7.0 ┆ b ┆ y │ │ 3 ┆ 8.0 ┆ c ┆ null │ └─────┴─────┴─────┴───────┘
>>> df.join(other_df, on="ham", how="semi") shape: (2, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6.0 ┆ a │ │ 2 ┆ 7.0 ┆ b │ └─────┴─────┴─────┘
>>> df.join(other_df, on="ham", how="anti") shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞═════╪═════╪═════╡ │ 3 ┆ 8.0 ┆ c │ └─────┴─────┴─────┘
- join_asof(
- other: DataFrame,
- *,
- left_on: str | None | Expr = None,
- right_on: str | None | Expr = None,
- on: str | None | Expr = None,
- by_left: str | Sequence[str] | None = None,
- by_right: str | Sequence[str] | None = None,
- by: str | Sequence[str] | None = None,
- strategy: AsofJoinStrategy = 'backward',
- suffix: str = '_right',
- tolerance: str | int | float | timedelta | None = None,
- allow_parallel: bool = True,
- force_parallel: bool = False,
- coalesce: bool = True,
Perform an asof join.
This is similar to a left-join except that we match on nearest key rather than equal keys.
Both DataFrames must be sorted by the asof_join key.
For each row in the left DataFrame:
A “backward” search selects the last row in the right DataFrame whose ‘on’ key is less than or equal to the left’s key.
A “forward” search selects the first row in the right DataFrame whose ‘on’ key is greater than or equal to the left’s key.
A “nearest” search selects the last row in the right DataFrame whose value is nearest to the left’s key. String keys are not currently supported for a nearest search.
The default is “backward”.
- Parameters:
- other
Lazy DataFrame to join with.
- left_on
Join column of the left DataFrame.
- right_on
Join column of the right DataFrame.
- on
Join column of both DataFrames. If set,
left_on
andright_on
should be None.- by
join on these columns before doing asof join
- by_left
join on these columns before doing asof join
- by_right
join on these columns before doing asof join
- strategy{‘backward’, ‘forward’, ‘nearest’}
Join strategy.
- suffix
Suffix to append to columns with a duplicate name.
- tolerance
Numeric tolerance. By setting this the join will only be done if the near keys are within this distance. If an asof join is done on columns of dtype “Date”, “Datetime”, “Duration” or “Time”, use either a datetime.timedelta object or the following string language:
1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
Or combine them: “3d12h4m25s” # 3 days, 12 hours, 4 minutes, and 25 seconds
By “calendar day”, we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for “calendar week”, “calendar month”, “calendar quarter”, and “calendar year”.
- allow_parallel
Allow the physical plan to optionally evaluate the computation of both DataFrames up to the join in parallel.
- force_parallel
Force the physical plan to evaluate the computation of both DataFrames up to the join in parallel.
- coalesce
Coalescing behavior (merging of
on
/left_on
/right_on
columns):True: -> Always coalesce join columns.
False: -> Never coalesce join columns.
Note that joining on any other expressions than
col
will turn off coalescing.
Examples
>>> from datetime import date >>> gdp = pl.DataFrame( ... { ... "date": pl.date_range( ... date(2016, 1, 1), ... date(2020, 1, 1), ... "1y", ... eager=True, ... ), ... "gdp": [4164, 4411, 4566, 4696, 4827], ... } ... ) >>> gdp shape: (5, 2) ┌────────────┬──────┐ │ date ┆ gdp │ │ --- ┆ --- │ │ date ┆ i64 │ ╞════════════╪══════╡ │ 2016-01-01 ┆ 4164 │ │ 2017-01-01 ┆ 4411 │ │ 2018-01-01 ┆ 4566 │ │ 2019-01-01 ┆ 4696 │ │ 2020-01-01 ┆ 4827 │ └────────────┴──────┘
>>> population = pl.DataFrame( ... { ... "date": [date(2016, 3, 1), date(2018, 8, 1), date(2019, 1, 1)], ... "population": [82.19, 82.66, 83.12], ... } ... ).sort("date") >>> population shape: (3, 2) ┌────────────┬────────────┐ │ date ┆ population │ │ --- ┆ --- │ │ date ┆ f64 │ ╞════════════╪════════════╡ │ 2016-03-01 ┆ 82.19 │ │ 2018-08-01 ┆ 82.66 │ │ 2019-01-01 ┆ 83.12 │ └────────────┴────────────┘
Note how the dates don’t quite match. If we join them using
join_asof
andstrategy='backward'
, then each date frompopulation
which doesn’t have an exact match is matched with the closest earlier date fromgdp
:>>> population.join_asof(gdp, on="date", strategy="backward") shape: (3, 3) ┌────────────┬────────────┬──────┐ │ date ┆ population ┆ gdp │ │ --- ┆ --- ┆ --- │ │ date ┆ f64 ┆ i64 │ ╞════════════╪════════════╪══════╡ │ 2016-03-01 ┆ 82.19 ┆ 4164 │ │ 2018-08-01 ┆ 82.66 ┆ 4566 │ │ 2019-01-01 ┆ 83.12 ┆ 4696 │ └────────────┴────────────┴──────┘
Note how:
date
2016-03-01
frompopulation
is matched with2016-01-01
fromgdp
;date
2018-08-01
frompopulation
is matched with2018-01-01
fromgdp
.
You can verify this by passing
coalesce=False
:>>> population.join_asof(gdp, on="date", strategy="backward", coalesce=False) shape: (3, 4) ┌────────────┬────────────┬────────────┬──────┐ │ date ┆ population ┆ date_right ┆ gdp │ │ --- ┆ --- ┆ --- ┆ --- │ │ date ┆ f64 ┆ date ┆ i64 │ ╞════════════╪════════════╪════════════╪══════╡ │ 2016-03-01 ┆ 82.19 ┆ 2016-01-01 ┆ 4164 │ │ 2018-08-01 ┆ 82.66 ┆ 2018-01-01 ┆ 4566 │ │ 2019-01-01 ┆ 83.12 ┆ 2019-01-01 ┆ 4696 │ └────────────┴────────────┴────────────┴──────┘
If we instead use
strategy='forward'
, then each date frompopulation
which doesn’t have an exact match is matched with the closest later date fromgdp
:>>> population.join_asof(gdp, on="date", strategy="forward") shape: (3, 3) ┌────────────┬────────────┬──────┐ │ date ┆ population ┆ gdp │ │ --- ┆ --- ┆ --- │ │ date ┆ f64 ┆ i64 │ ╞════════════╪════════════╪══════╡ │ 2016-03-01 ┆ 82.19 ┆ 4411 │ │ 2018-08-01 ┆ 82.66 ┆ 4696 │ │ 2019-01-01 ┆ 83.12 ┆ 4696 │ └────────────┴────────────┴──────┘
Note how:
date
2016-03-01
frompopulation
is matched with2017-01-01
fromgdp
;date
2018-08-01
frompopulation
is matched with2019-01-01
fromgdp
.
Finally,
strategy='nearest'
gives us a mix of the two results above, as each date frompopulation
which doesn’t have an exact match is matched with the closest date fromgdp
, regardless of whether it’s earlier or later:>>> population.join_asof(gdp, on="date", strategy="nearest") shape: (3, 3) ┌────────────┬────────────┬──────┐ │ date ┆ population ┆ gdp │ │ --- ┆ --- ┆ --- │ │ date ┆ f64 ┆ i64 │ ╞════════════╪════════════╪══════╡ │ 2016-03-01 ┆ 82.19 ┆ 4164 │ │ 2018-08-01 ┆ 82.66 ┆ 4696 │ │ 2019-01-01 ┆ 83.12 ┆ 4696 │ └────────────┴────────────┴──────┘
Note how:
date
2016-03-01
frompopulation
is matched with2016-01-01
fromgdp
;date
2018-08-01
frompopulation
is matched with2019-01-01
fromgdp
.
They
by
argument allows joining on another column first, before the asof join. In this example we join bycountry
first, then asof join by date, as above.>>> gdp_dates = pl.date_range( # fmt: skip ... date(2016, 1, 1), date(2020, 1, 1), "1y", eager=True ... ) >>> gdp2 = pl.DataFrame( ... { ... "country": ["Germany"] * 5 + ["Netherlands"] * 5, ... "date": pl.concat([gdp_dates, gdp_dates]), ... "gdp": [4164, 4411, 4566, 4696, 4827, 784, 833, 914, 910, 909], ... } ... ).sort("country", "date") >>> >>> gdp2 shape: (10, 3) ┌─────────────┬────────────┬──────┐ │ country ┆ date ┆ gdp │ │ --- ┆ --- ┆ --- │ │ str ┆ date ┆ i64 │ ╞═════════════╪════════════╪══════╡ │ Germany ┆ 2016-01-01 ┆ 4164 │ │ Germany ┆ 2017-01-01 ┆ 4411 │ │ Germany ┆ 2018-01-01 ┆ 4566 │ │ Germany ┆ 2019-01-01 ┆ 4696 │ │ Germany ┆ 2020-01-01 ┆ 4827 │ │ Netherlands ┆ 2016-01-01 ┆ 784 │ │ Netherlands ┆ 2017-01-01 ┆ 833 │ │ Netherlands ┆ 2018-01-01 ┆ 914 │ │ Netherlands ┆ 2019-01-01 ┆ 910 │ │ Netherlands ┆ 2020-01-01 ┆ 909 │ └─────────────┴────────────┴──────┘ >>> pop2 = pl.DataFrame( ... { ... "country": ["Germany"] * 3 + ["Netherlands"] * 3, ... "date": [ ... date(2016, 3, 1), ... date(2018, 8, 1), ... date(2019, 1, 1), ... date(2016, 3, 1), ... date(2018, 8, 1), ... date(2019, 1, 1), ... ], ... "population": [82.19, 82.66, 83.12, 17.11, 17.32, 17.40], ... } ... ).sort("country", "date") >>> >>> pop2 shape: (6, 3) ┌─────────────┬────────────┬────────────┐ │ country ┆ date ┆ population │ │ --- ┆ --- ┆ --- │ │ str ┆ date ┆ f64 │ ╞═════════════╪════════════╪════════════╡ │ Germany ┆ 2016-03-01 ┆ 82.19 │ │ Germany ┆ 2018-08-01 ┆ 82.66 │ │ Germany ┆ 2019-01-01 ┆ 83.12 │ │ Netherlands ┆ 2016-03-01 ┆ 17.11 │ │ Netherlands ┆ 2018-08-01 ┆ 17.32 │ │ Netherlands ┆ 2019-01-01 ┆ 17.4 │ └─────────────┴────────────┴────────────┘ >>> pop2.join_asof(gdp2, by="country", on="date", strategy="nearest") shape: (6, 4) ┌─────────────┬────────────┬────────────┬──────┐ │ country ┆ date ┆ population ┆ gdp │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ date ┆ f64 ┆ i64 │ ╞═════════════╪════════════╪════════════╪══════╡ │ Germany ┆ 2016-03-01 ┆ 82.19 ┆ 4164 │ │ Germany ┆ 2018-08-01 ┆ 82.66 ┆ 4696 │ │ Germany ┆ 2019-01-01 ┆ 83.12 ┆ 4696 │ │ Netherlands ┆ 2016-03-01 ┆ 17.11 ┆ 784 │ │ Netherlands ┆ 2018-08-01 ┆ 17.32 ┆ 910 │ │ Netherlands ┆ 2019-01-01 ┆ 17.4 ┆ 910 │ └─────────────┴────────────┴────────────┴──────┘
- join_where(
- other: DataFrame,
- *predicates: Expr | Iterable[Expr],
- suffix: str = '_right',
Perform a join based on one or multiple (in)equality predicates.
This performs an inner join, so only rows where all predicates are true are included in the result, and a row from either DataFrame may be included multiple times in the result.
Note
The row order of the input DataFrames is not preserved.
Warning
This functionality is experimental. It may be changed at any point without it being considered a breaking change.
- Parameters:
- other
DataFrame to join with.
- *predicates
(In)Equality condition to join the two tables on. When a column name occurs in both tables, the proper suffix must be applied in the predicate.
- suffix
Suffix to append to columns with a duplicate name.
Examples
>>> east = pl.DataFrame( ... { ... "id": [100, 101, 102], ... "dur": [120, 140, 160], ... "rev": [12, 14, 16], ... "cores": [2, 8, 4], ... } ... ) >>> west = pl.DataFrame( ... { ... "t_id": [404, 498, 676, 742], ... "time": [90, 130, 150, 170], ... "cost": [9, 13, 15, 16], ... "cores": [4, 2, 1, 4], ... } ... ) >>> east.join_where( ... west, ... pl.col("dur") < pl.col("time"), ... pl.col("rev") < pl.col("cost"), ... ) shape: (5, 8) ┌─────┬─────┬─────┬───────┬──────┬──────┬──────┬─────────────┐ │ id ┆ dur ┆ rev ┆ cores ┆ t_id ┆ time ┆ cost ┆ cores_right │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╪═══════╪══════╪══════╪══════╪═════════════╡ │ 100 ┆ 120 ┆ 12 ┆ 2 ┆ 498 ┆ 130 ┆ 13 ┆ 2 │ │ 100 ┆ 120 ┆ 12 ┆ 2 ┆ 676 ┆ 150 ┆ 15 ┆ 1 │ │ 100 ┆ 120 ┆ 12 ┆ 2 ┆ 742 ┆ 170 ┆ 16 ┆ 4 │ │ 101 ┆ 140 ┆ 14 ┆ 8 ┆ 676 ┆ 150 ┆ 15 ┆ 1 │ │ 101 ┆ 140 ┆ 14 ┆ 8 ┆ 742 ┆ 170 ┆ 16 ┆ 4 │ └─────┴─────┴─────┴───────┴──────┴──────┴──────┴─────────────┘
- lazy() LazyFrame [source]
Start a lazy query from this point. This returns a
LazyFrame
object.Operations on a
LazyFrame
are not executed until this is triggered by calling one of:.collect()
(run on all data)
.explain()
(print the query plan)
.show_graph()
(show the query plan as graphviz graph)
.collect_schema()
(return the final frame schema)
Lazy operations are recommended because they allow for query optimization and additional parallelism.
- Returns:
- LazyFrame
Examples
>>> df = pl.DataFrame( ... { ... "a": [None, 2, 3, 4], ... "b": [0.5, None, 2.5, 13], ... "c": [True, True, False, None], ... } ... ) >>> df.lazy() <LazyFrame at ...>
- limit(n: int = 5) DataFrame [source]
Get the first
n
rows.Alias for
DataFrame.head()
.- Parameters:
- n
Number of rows to return. If a negative value is passed, return all rows except the last
abs(n)
.
See also
Examples
Get the first 3 rows of a DataFrame.
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3, 4, 5], ... "bar": [6, 7, 8, 9, 10], ... "ham": ["a", "b", "c", "d", "e"], ... } ... ) >>> df.limit(3) shape: (3, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ │ 2 ┆ 7 ┆ b │ │ 3 ┆ 8 ┆ c │ └─────┴─────┴─────┘
- map_rows(
- function: Callable[[tuple[Any, ...]], Any],
- return_dtype: PolarsDataType | None = None,
- *,
- inference_size: int = 256,
Apply a custom/user-defined function (UDF) over the rows of the DataFrame.
Warning
This method is much slower than the native expressions API. Only use it if you cannot implement your logic otherwise.
The UDF will receive each row as a tuple of values:
udf(row)
.Implementing logic using a Python function is almost always significantly slower and more memory intensive than implementing the same logic using the native expression API because:
The native expression engine runs in Rust; UDFs run in Python.
Use of Python UDFs forces the DataFrame to be materialized in memory.
Polars-native expressions can be parallelised (UDFs typically cannot).
Polars-native expressions can be logically optimised (UDFs cannot).
Wherever possible you should strongly prefer the native expression API to achieve the best performance.
- Parameters:
- function
Custom function or lambda.
- return_dtype
Output type of the operation. If none given, Polars tries to infer the type.
- inference_size
Only used in the case when the custom function returns rows. This uses the first
n
rows to determine the output schema.
Notes
The frame-level
map_rows
cannot track column names (as the UDF is a black-box that may arbitrarily drop, rearrange, transform, or add new columns); if you want to apply a UDF such that column names are preserved, you should use the expression-levelmap_elements
syntax instead.If your function is expensive and you don’t want it to be called more than once for a given input, consider applying an
@lru_cache
decorator to it. If your data is suitable you may achieve significant speedups.
Examples
>>> df = pl.DataFrame({"foo": [1, 2, 3], "bar": [-1, 5, 8]})
Return a DataFrame by mapping each row to a tuple:
>>> df.map_rows(lambda t: (t[0] * 2, t[1] * 3)) shape: (3, 2) ┌──────────┬──────────┐ │ column_0 ┆ column_1 │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞══════════╪══════════╡ │ 2 ┆ -3 │ │ 4 ┆ 15 │ │ 6 ┆ 24 │ └──────────┴──────────┘
However, it is much better to implement this with a native expression:
>>> df.select( ... pl.col("foo") * 2, ... pl.col("bar") * 3, ... )
Return a DataFrame with a single column by mapping each row to a scalar:
>>> df.map_rows(lambda t: (t[0] * 2 + t[1])) shape: (3, 1) ┌─────┐ │ map │ │ --- │ │ i64 │ ╞═════╡ │ 1 │ │ 9 │ │ 14 │ └─────┘
In this case it is better to use the following native expression:
>>> df.select(pl.col("foo") * 2 + pl.col("bar"))
- max() DataFrame [source]
Aggregate the columns of this DataFrame to their maximum value.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.max() shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 3 ┆ 8 ┆ c │ └─────┴─────┴─────┘
- max_horizontal() Series [source]
Get the maximum value horizontally across columns.
- Returns:
- Series
A Series named
"max"
.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [4.0, 5.0, 6.0], ... } ... ) >>> df.max_horizontal() shape: (3,) Series: 'max' [f64] [ 4.0 5.0 6.0 ]
- mean() DataFrame [source]
Aggregate the columns of this DataFrame to their mean value.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... "spam": [True, False, None], ... } ... ) >>> df.mean() shape: (1, 4) ┌─────┬─────┬──────┬──────┐ │ foo ┆ bar ┆ ham ┆ spam │ │ --- ┆ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ str ┆ f64 │ ╞═════╪═════╪══════╪══════╡ │ 2.0 ┆ 7.0 ┆ null ┆ 0.5 │ └─────┴─────┴──────┴──────┘
- mean_horizontal(*, ignore_nulls: bool = True) Series [source]
Take the mean of all values horizontally across columns.
- Parameters:
- ignore_nulls
Ignore null values (default). If set to
False
, any null value in the input will lead to a null output.
- Returns:
- Series
A Series named
"mean"
.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [4.0, 5.0, 6.0], ... } ... ) >>> df.mean_horizontal() shape: (3,) Series: 'mean' [f64] [ 2.5 3.5 4.5 ]
- median() DataFrame [source]
Aggregate the columns of this DataFrame to their median value.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.median() shape: (1, 3) ┌─────┬─────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ str │ ╞═════╪═════╪══════╡ │ 2.0 ┆ 7.0 ┆ null │ └─────┴─────┴──────┘
- melt(
- id_vars: ColumnNameOrSelector | Sequence[ColumnNameOrSelector] | None = None,
- value_vars: ColumnNameOrSelector | Sequence[ColumnNameOrSelector] | None = None,
- variable_name: str | None = None,
- value_name: str | None = None,
Unpivot a DataFrame from wide to long format.
Optionally leaves identifiers set.
This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars) while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis leaving just two non-identifier columns, ‘variable’ and ‘value’.
Deprecated since version 1.0.0: Please use
unpivot()
instead.- Parameters:
- id_vars
Column(s) or selector(s) to use as identifier variables.
- value_vars
Column(s) or selector(s) to use as values variables; if
value_vars
is empty all columns that are not inid_vars
will be used.- variable_name
Name to give to the
variable
column. Defaults to “variable”- value_name
Name to give to the
value
column. Defaults to “value”
- merge_sorted(
- other: DataFrame,
- key: str,
Take two sorted DataFrames and merge them by the sorted key.
The output of this operation will also be sorted. It is the callers responsibility that the frames are sorted by that key otherwise the output will not make sense.
The schemas of both DataFrames must be equal.
- Parameters:
- other
Other DataFrame that must be merged
- key
Key that is sorted.
Examples
>>> df0 = pl.DataFrame( ... {"name": ["steve", "elise", "bob"], "age": [42, 44, 18]} ... ).sort("age") >>> df0 shape: (3, 2) ┌───────┬─────┐ │ name ┆ age │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═══════╪═════╡ │ bob ┆ 18 │ │ steve ┆ 42 │ │ elise ┆ 44 │ └───────┴─────┘ >>> df1 = pl.DataFrame( ... {"name": ["anna", "megan", "steve", "thomas"], "age": [21, 33, 42, 20]} ... ).sort("age") >>> df1 shape: (4, 2) ┌────────┬─────┐ │ name ┆ age │ │ --- ┆ --- │ │ str ┆ i64 │ ╞════════╪═════╡ │ thomas ┆ 20 │ │ anna ┆ 21 │ │ megan ┆ 33 │ │ steve ┆ 42 │ └────────┴─────┘ >>> df0.merge_sorted(df1, key="age") shape: (7, 2) ┌────────┬─────┐ │ name ┆ age │ │ --- ┆ --- │ │ str ┆ i64 │ ╞════════╪═════╡ │ bob ┆ 18 │ │ thomas ┆ 20 │ │ anna ┆ 21 │ │ megan ┆ 33 │ │ steve ┆ 42 │ │ steve ┆ 42 │ │ elise ┆ 44 │ └────────┴─────┘
- min() DataFrame [source]
Aggregate the columns of this DataFrame to their minimum value.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.min() shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ └─────┴─────┴─────┘
- min_horizontal() Series [source]
Get the minimum value horizontally across columns.
- Returns:
- Series
A Series named
"min"
.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [4.0, 5.0, 6.0], ... } ... ) >>> df.min_horizontal() shape: (3,) Series: 'min' [f64] [ 1.0 2.0 3.0 ]
- n_chunks(strategy: Literal['first', 'all'] = 'first') int | list[int] [source]
Get number of chunks used by the ChunkedArrays of this DataFrame.
- Parameters:
- strategy{‘first’, ‘all’}
Return the number of chunks of the ‘first’ column, or ‘all’ columns in this DataFrame.
Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [0.5, 4, 10, 13], ... "c": [True, True, False, True], ... } ... ) >>> df.n_chunks() 1 >>> df.n_chunks(strategy="all") [1, 1, 1]
- n_unique(subset: str | Expr | Sequence[str | Expr] | None = None) int [source]
Return the number of unique rows, or the number of unique row-subsets.
- Parameters:
- subset
One or more columns/expressions that define what to count; omit to return the count of unique rows.
Notes
This method operates at the
DataFrame
level; to operate on subsets at the expression level you can make use of struct-packing instead, for example:>>> expr_unique_subset = pl.struct("a", "b").n_unique()
If instead you want to count the number of unique values per-column, you can also use expression-level syntax to return a new frame containing that result:
>>> df = pl.DataFrame( ... [[1, 2, 3], [1, 2, 4]], schema=["a", "b", "c"], orient="row" ... ) >>> df_nunique = df.select(pl.all().n_unique())
In aggregate context there is also an equivalent method for returning the unique values per-group:
>>> df_agg_nunique = df.group_by("a").n_unique()
Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 1, 2, 3, 4, 5], ... "b": [0.5, 0.5, 1.0, 2.0, 3.0, 3.0], ... "c": [True, True, True, False, True, True], ... } ... ) >>> df.n_unique() 5
Simple columns subset.
>>> df.n_unique(subset=["b", "c"]) 4
Expression subset.
>>> df.n_unique( ... subset=[ ... (pl.col("a") // 2), ... (pl.col("c") | (pl.col("b") >= 2)), ... ], ... ) 3
- null_count() DataFrame [source]
Create a new DataFrame that shows the null counts per column.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, None, 3], ... "bar": [6, 7, None], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.null_count() shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ u32 ┆ u32 ┆ u32 │ ╞═════╪═════╪═════╡ │ 1 ┆ 1 ┆ 0 │ └─────┴─────┴─────┘
- partition_by(
- by: ColumnNameOrSelector | Sequence[ColumnNameOrSelector],
- *more_by: ColumnNameOrSelector,
- maintain_order: bool = True,
- include_key: bool = True,
- as_dict: bool = False,
Group by the given columns and return the groups as separate dataframes.
- Parameters:
- by
Column name(s) or selector(s) to group by.
- *more_by
Additional names of columns to group by, specified as positional arguments.
- maintain_order
Ensure that the order of the groups is consistent with the input data. This is slower than a default partition by operation.
- include_key
Include the columns used to partition the DataFrame in the output.
- as_dict
Return a dictionary instead of a list. The dictionary keys are tuples of the distinct group values that identify each group.
Examples
Pass a single column name to partition by that column.
>>> df = pl.DataFrame( ... { ... "a": ["a", "b", "a", "b", "c"], ... "b": [1, 2, 1, 3, 3], ... "c": [5, 4, 3, 2, 1], ... } ... ) >>> df.partition_by("a") [shape: (2, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ a ┆ 1 ┆ 5 │ │ a ┆ 1 ┆ 3 │ └─────┴─────┴─────┘, shape: (2, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ b ┆ 2 ┆ 4 │ │ b ┆ 3 ┆ 2 │ └─────┴─────┴─────┘, shape: (1, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ c ┆ 3 ┆ 1 │ └─────┴─────┴─────┘]
Partition by multiple columns by either passing a list of column names, or by specifying each column name as a positional argument.
>>> df.partition_by("a", "b") [shape: (2, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ a ┆ 1 ┆ 5 │ │ a ┆ 1 ┆ 3 │ └─────┴─────┴─────┘, shape: (1, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ b ┆ 2 ┆ 4 │ └─────┴─────┴─────┘, shape: (1, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ b ┆ 3 ┆ 2 │ └─────┴─────┴─────┘, shape: (1, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ c ┆ 3 ┆ 1 │ └─────┴─────┴─────┘]
Return the partitions as a dictionary by specifying
as_dict=True
.>>> import polars.selectors as cs >>> df.partition_by(cs.string(), as_dict=True) {('a',): shape: (2, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ a ┆ 1 ┆ 5 │ │ a ┆ 1 ┆ 3 │ └─────┴─────┴─────┘, ('b',): shape: (2, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ b ┆ 2 ┆ 4 │ │ b ┆ 3 ┆ 2 │ └─────┴─────┴─────┘, ('c',): shape: (1, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ c ┆ 3 ┆ 1 │ └─────┴─────┴─────┘}
- pipe(
- function: Callable[Concatenate[DataFrame, P], T],
- *args: P.args,
- **kwargs: P.kwargs,
Offers a structured way to apply a sequence of user-defined functions (UDFs).
- Parameters:
- function
Callable; will receive the frame as the first parameter, followed by any given args/kwargs.
- *args
Arguments to pass to the UDF.
- **kwargs
Keyword arguments to pass to the UDF.
Notes
It is recommended to use LazyFrame when piping operations, in order to fully take advantage of query optimization and parallelization. See
df.lazy()
.Examples
>>> def cast_str_to_int(data, col_name): ... return data.with_columns(pl.col(col_name).cast(pl.Int64)) >>> df = pl.DataFrame({"a": [1, 2, 3, 4], "b": ["10", "20", "30", "40"]}) >>> df.pipe(cast_str_to_int, col_name="b") shape: (4, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 10 │ │ 2 ┆ 20 │ │ 3 ┆ 30 │ │ 4 ┆ 40 │ └─────┴─────┘
>>> df = pl.DataFrame({"b": [1, 2], "a": [3, 4]}) >>> df shape: (2, 2) ┌─────┬─────┐ │ b ┆ a │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 3 │ │ 2 ┆ 4 │ └─────┴─────┘ >>> df.pipe(lambda tdf: tdf.select(sorted(tdf.columns))) shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 3 ┆ 1 │ │ 4 ┆ 2 │ └─────┴─────┘
- pivot(
- on: ColumnNameOrSelector | Sequence[ColumnNameOrSelector],
- *,
- index: ColumnNameOrSelector | Sequence[ColumnNameOrSelector] | None = None,
- values: ColumnNameOrSelector | Sequence[ColumnNameOrSelector] | None = None,
- aggregate_function: PivotAgg | Expr | None = None,
- maintain_order: bool = True,
- sort_columns: bool = False,
- separator: str = '_',
Create a spreadsheet-style pivot table as a DataFrame.
Only available in eager mode. See “Examples” section below for how to do a “lazy pivot” if you know the unique column values in advance.
- Parameters:
- on
The column(s) whose values will be used as the new columns of the output DataFrame.
- index
The column(s) that remain from the input to the output. The output DataFrame will have one row for each unique combination of the
index
’s values. If None, all remaining columns not specified onon
andvalues
will be used. At least one ofindex
andvalues
must be specified.- values
The existing column(s) of values which will be moved under the new columns from index. If an aggregation is specified, these are the values on which the aggregation will be computed. If None, all remaining columns not specified on
on
andindex
will be used. At least one ofindex
andvalues
must be specified.- aggregate_function
Choose from:
None: no aggregation takes place, will raise error if multiple values are in group.
A predefined aggregate function string, one of {‘min’, ‘max’, ‘first’, ‘last’, ‘sum’, ‘mean’, ‘median’, ‘len’}
An expression to do the aggregation.
- maintain_order
Sort the grouped keys so that the output order is predictable.
- sort_columns
Sort the transposed columns by name. Default is by order of discovery.
- separator
Used as separator/delimiter in generated column names in case of multiple
values
columns.
- Returns:
- DataFrame
Notes
In some other frameworks, you might know this operation as
pivot_wider
.Examples
You can use
pivot
to reshape a dataframe from “long” to “wide” format.For example, suppose we have a dataframe of test scores achieved by some students, where each row represents a distinct test.
>>> df = pl.DataFrame( ... { ... "name": ["Cady", "Cady", "Karen", "Karen"], ... "subject": ["maths", "physics", "maths", "physics"], ... "test_1": [98, 99, 61, 58], ... "test_2": [100, 100, 60, 60], ... } ... ) >>> df shape: (4, 4) ┌───────┬─────────┬────────┬────────┐ │ name ┆ subject ┆ test_1 ┆ test_2 │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ i64 ┆ i64 │ ╞═══════╪═════════╪════════╪════════╡ │ Cady ┆ maths ┆ 98 ┆ 100 │ │ Cady ┆ physics ┆ 99 ┆ 100 │ │ Karen ┆ maths ┆ 61 ┆ 60 │ │ Karen ┆ physics ┆ 58 ┆ 60 │ └───────┴─────────┴────────┴────────┘
Using
pivot
, we can reshape so we have one row per student, with different subjects as columns, and theirtest_1
scores as values:>>> df.pivot("subject", index="name", values="test_1") shape: (2, 3) ┌───────┬───────┬─────────┐ │ name ┆ maths ┆ physics │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═══════╪═══════╪═════════╡ │ Cady ┆ 98 ┆ 99 │ │ Karen ┆ 61 ┆ 58 │ └───────┴───────┴─────────┘
You can use selectors too - here we include all test scores in the pivoted table:
>>> import polars.selectors as cs >>> df.pivot("subject", values=cs.starts_with("test")) shape: (2, 5) ┌───────┬──────────────┬────────────────┬──────────────┬────────────────┐ │ name ┆ test_1_maths ┆ test_1_physics ┆ test_2_maths ┆ test_2_physics │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 │ ╞═══════╪══════════════╪════════════════╪══════════════╪════════════════╡ │ Cady ┆ 98 ┆ 99 ┆ 100 ┆ 100 │ │ Karen ┆ 61 ┆ 58 ┆ 60 ┆ 60 │ └───────┴──────────────┴────────────────┴──────────────┴────────────────┘
If you end up with multiple values per cell, you can specify how to aggregate them with
aggregate_function
:>>> df = pl.DataFrame( ... { ... "ix": [1, 1, 2, 2, 1, 2], ... "col": ["a", "a", "a", "a", "b", "b"], ... "foo": [0, 1, 2, 2, 7, 1], ... "bar": [0, 2, 0, 0, 9, 4], ... } ... ) >>> df.pivot("col", index="ix", aggregate_function="sum") shape: (2, 5) ┌─────┬───────┬───────┬───────┬───────┐ │ ix ┆ foo_a ┆ foo_b ┆ bar_a ┆ bar_b │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │ ╞═════╪═══════╪═══════╪═══════╪═══════╡ │ 1 ┆ 1 ┆ 7 ┆ 2 ┆ 9 │ │ 2 ┆ 4 ┆ 1 ┆ 0 ┆ 4 │ └─────┴───────┴───────┴───────┴───────┘
You can also pass a custom aggregation function using
polars.element()
:>>> df = pl.DataFrame( ... { ... "col1": ["a", "a", "a", "b", "b", "b"], ... "col2": ["x", "x", "x", "x", "y", "y"], ... "col3": [6, 7, 3, 2, 5, 7], ... } ... ) >>> df.pivot( ... "col2", ... index="col1", ... values="col3", ... aggregate_function=pl.element().tanh().mean(), ... ) shape: (2, 3) ┌──────┬──────────┬──────────┐ │ col1 ┆ x ┆ y │ │ --- ┆ --- ┆ --- │ │ str ┆ f64 ┆ f64 │ ╞══════╪══════════╪══════════╡ │ a ┆ 0.998347 ┆ null │ │ b ┆ 0.964028 ┆ 0.999954 │ └──────┴──────────┴──────────┘
Note that
pivot
is only available in eager mode. If you know the unique column values in advance, you can usepolars.LazyFrame.group_by()
to get the same result as above in lazy mode:>>> index = pl.col("col1") >>> on = pl.col("col2") >>> values = pl.col("col3") >>> unique_column_values = ["x", "y"] >>> aggregate_function = lambda col: col.tanh().mean() >>> df.lazy().group_by(index).agg( ... aggregate_function(values.filter(on == value)).alias(value) ... for value in unique_column_values ... ).collect() shape: (2, 3) ┌──────┬──────────┬──────────┐ │ col1 ┆ x ┆ y │ │ --- ┆ --- ┆ --- │ │ str ┆ f64 ┆ f64 │ ╞══════╪══════════╪══════════╡ │ a ┆ 0.998347 ┆ null │ │ b ┆ 0.964028 ┆ 0.999954 │ └──────┴──────────┴──────────┘
- property plot: DataFramePlot[source]
Create a plot namespace.
Warning
This functionality is currently considered unstable. It may be changed at any point without it being considered a breaking change.
Changed in version 1.6.0: In prior versions of Polars, HvPlot was the plotting backend. If you would like to restore the previous plotting functionality, all you need to do is add
import hvplot.polars
at the top of your script and replacedf.plot
withdf.hvplot
.Polars does not implement plotting logic itself, but instead defers to Altair:
df.plot.line(**kwargs)
is shorthand foralt.Chart(df).mark_line(tooltip=True).encode(**kwargs).interactive()
df.plot.point(**kwargs)
is shorthand foralt.Chart(df).mark_point(tooltip=True).encode(**kwargs).interactive()
(andplot.scatter
is provided as an alias)df.plot.bar(**kwargs)
is shorthand foralt.Chart(df).mark_bar(tooltip=True).encode(**kwargs).interactive()
for any other attribute
attr
,df.plot.attr(**kwargs)
is shorthand foralt.Chart(df).mark_attr(tooltip=True).encode(**kwargs).interactive()
For configuration, we suggest reading Chart Configuration. For example, you can:
Change the width/height/title with
.properties(width=500, height=350, title="My amazing plot")
.Change the x-axis label rotation with
.configure_axisX(labelAngle=30)
.Change the opacity of the points in your scatter plot with
.configure_point(opacity=.5)
.
Examples
Scatter plot:
>>> df = pl.DataFrame( ... { ... "length": [1, 4, 6], ... "width": [4, 5, 6], ... "species": ["setosa", "setosa", "versicolor"], ... } ... ) >>> df.plot.point(x="length", y="width", color="species")
Set the x-axis title by using
altair.X
:>>> import altair as alt >>> df.plot.point( ... x=alt.X("length", title="Length"), y="width", color="species" ... )
Line plot:
>>> from datetime import date >>> df = pl.DataFrame( ... { ... "date": [date(2020, 1, 2), date(2020, 1, 3), date(2020, 1, 4)] * 2, ... "price": [1, 4, 6, 1, 5, 2], ... "stock": ["a", "a", "a", "b", "b", "b"], ... } ... ) >>> df.plot.line(x="date", y="price", color="stock")
Bar plot:
>>> df = pl.DataFrame( ... { ... "day": ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"] * 2, ... "group": ["a"] * 7 + ["b"] * 7, ... "value": [1, 3, 2, 4, 5, 6, 1, 1, 3, 2, 4, 5, 1, 2], ... } ... ) >>> df.plot.bar( ... x="day", y="value", color="day", column="group" ... )
Or, to make a stacked version of the plot above:
>>> df.plot.bar(x="day", y="value", color="group")
- product() DataFrame [source]
Aggregate the columns of this DataFrame to their product values.
Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 2, 3], ... "b": [0.5, 4, 10], ... "c": [True, True, False], ... } ... )
>>> df.product() shape: (1, 3) ┌─────┬──────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ i64 │ ╞═════╪══════╪═════╡ │ 6 ┆ 20.0 ┆ 0 │ └─────┴──────┴─────┘
- quantile(
- quantile: float,
- interpolation: RollingInterpolationMethod = 'nearest',
Aggregate the columns of this DataFrame to their quantile value.
- Parameters:
- quantile
Quantile between 0.0 and 1.0.
- interpolation{‘nearest’, ‘higher’, ‘lower’, ‘midpoint’, ‘linear’}
Interpolation method.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.quantile(0.5, "nearest") shape: (1, 3) ┌─────┬─────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ str │ ╞═════╪═════╪══════╡ │ 2.0 ┆ 7.0 ┆ null │ └─────┴─────┴──────┘
- rechunk() DataFrame [source]
Rechunk the data in this DataFrame to a contiguous allocation.
This will make sure all subsequent operations have optimal and predictable performance.
- rename( ) DataFrame [source]
Rename column names.
- Parameters:
- mapping
Key value pairs that map from old name to new name, or a function that takes the old name as input and returns the new name.
- strict
Validate that all column names exist in the current schema, and throw an exception if any do not. (Note that this parameter is a no-op when passing a function to
mapping
).
Examples
>>> df = pl.DataFrame( ... {"foo": [1, 2, 3], "bar": [6, 7, 8], "ham": ["a", "b", "c"]} ... ) >>> df.rename({"foo": "apple"}) shape: (3, 3) ┌───────┬─────┬─────┐ │ apple ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═══════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ │ 2 ┆ 7 ┆ b │ │ 3 ┆ 8 ┆ c │ └───────┴─────┴─────┘ >>> df.rename(lambda column_name: "c" + column_name[1:]) shape: (3, 3) ┌─────┬─────┬─────┐ │ coo ┆ car ┆ cam │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ │ 2 ┆ 7 ┆ b │ │ 3 ┆ 8 ┆ c │ └─────┴─────┴─────┘
- replace_column(index: int, column: Series) DataFrame [source]
Replace a column at an index location.
This operation is in place.
- Parameters:
- index
Column index.
- column
Series that will replace the column.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> s = pl.Series("apple", [10, 20, 30]) >>> df.replace_column(0, s) shape: (3, 3) ┌───────┬─────┬─────┐ │ apple ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═══════╪═════╪═════╡ │ 10 ┆ 6 ┆ a │ │ 20 ┆ 7 ┆ b │ │ 30 ┆ 8 ┆ c │ └───────┴─────┴─────┘
- reverse() DataFrame [source]
Reverse the DataFrame.
Examples
>>> df = pl.DataFrame( ... { ... "key": ["a", "b", "c"], ... "val": [1, 2, 3], ... } ... ) >>> df.reverse() shape: (3, 2) ┌─────┬─────┐ │ key ┆ val │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════╪═════╡ │ c ┆ 3 │ │ b ┆ 2 │ │ a ┆ 1 │ └─────┴─────┘
- rolling(
- index_column: IntoExpr,
- *,
- period: str | timedelta,
- offset: str | timedelta | None = None,
- closed: ClosedInterval = 'right',
- group_by: IntoExpr | Iterable[IntoExpr] | None = None,
Create rolling groups based on a temporal or integer column.
Different from a
group_by_dynamic
the windows are now determined by the individual values and are not of constant intervals. For constant intervals useDataFrame.group_by_dynamic()
.If you have a time series
<t_0, t_1, ..., t_n>
, then by default the windows created will be(t_0 - period, t_0]
(t_1 - period, t_1]
…
(t_n - period, t_n]
whereas if you pass a non-default
offset
, then the windows will be(t_0 + offset, t_0 + offset + period]
(t_1 + offset, t_1 + offset + period]
…
(t_n + offset, t_n + offset + period]
The
period
andoffset
arguments are created either from a timedelta, or by using the following string language:1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
1i (1 index count)
Or combine them: “3d12h4m25s” # 3 days, 12 hours, 4 minutes, and 25 seconds
By “calendar day”, we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for “calendar week”, “calendar month”, “calendar quarter”, and “calendar year”.
- Parameters:
- index_column
Column used to group based on the time window. Often of type Date/Datetime. This column must be sorted in ascending order (or, if
group_by
is specified, then it must be sorted in ascending order within each group).In case of a rolling operation on indices, dtype needs to be one of {UInt32, UInt64, Int32, Int64}. Note that the first three get temporarily cast to Int64, so if performance matters use an Int64 column.
- period
Length of the window - must be non-negative.
- offset
Offset of the window. Default is
-period
.- closed{‘right’, ‘left’, ‘both’, ‘none’}
Define which sides of the temporal interval are closed (inclusive).
- group_by
Also group by this column/these columns
- Returns:
- RollingGroupBy
Object you can call
.agg
on to aggregate by groups, the result of which will be sorted byindex_column
(but note that ifgroup_by
columns are passed, it will only be sorted within each group).
See also
Examples
>>> dates = [ ... "2020-01-01 13:45:48", ... "2020-01-01 16:42:13", ... "2020-01-01 16:45:09", ... "2020-01-02 18:12:48", ... "2020-01-03 19:45:32", ... "2020-01-08 23:16:43", ... ] >>> df = pl.DataFrame({"dt": dates, "a": [3, 7, 5, 9, 2, 1]}).with_columns( ... pl.col("dt").str.strptime(pl.Datetime).set_sorted() ... ) >>> out = df.rolling(index_column="dt", period="2d").agg( ... [ ... pl.sum("a").alias("sum_a"), ... pl.min("a").alias("min_a"), ... pl.max("a").alias("max_a"), ... ] ... ) >>> assert out["sum_a"].to_list() == [3, 10, 15, 24, 11, 1] >>> assert out["max_a"].to_list() == [3, 7, 7, 9, 9, 1] >>> assert out["min_a"].to_list() == [3, 3, 3, 3, 2, 1] >>> out shape: (6, 4) ┌─────────────────────┬───────┬───────┬───────┐ │ dt ┆ sum_a ┆ min_a ┆ max_a │ │ --- ┆ --- ┆ --- ┆ --- │ │ datetime[μs] ┆ i64 ┆ i64 ┆ i64 │ ╞═════════════════════╪═══════╪═══════╪═══════╡ │ 2020-01-01 13:45:48 ┆ 3 ┆ 3 ┆ 3 │ │ 2020-01-01 16:42:13 ┆ 10 ┆ 3 ┆ 7 │ │ 2020-01-01 16:45:09 ┆ 15 ┆ 3 ┆ 7 │ │ 2020-01-02 18:12:48 ┆ 24 ┆ 3 ┆ 9 │ │ 2020-01-03 19:45:32 ┆ 11 ┆ 2 ┆ 9 │ │ 2020-01-08 23:16:43 ┆ 1 ┆ 1 ┆ 1 │ └─────────────────────┴───────┴───────┴───────┘
If you use an index count in
period
oroffset
, then it’s based on the values inindex_column
:>>> df = pl.DataFrame({"int": [0, 4, 5, 6, 8], "value": [1, 4, 2, 4, 1]}) >>> df.rolling("int", period="3i").agg(pl.col("int").alias("aggregated")) shape: (5, 2) ┌─────┬────────────┐ │ int ┆ aggregated │ │ --- ┆ --- │ │ i64 ┆ list[i64] │ ╞═════╪════════════╡ │ 0 ┆ [0] │ │ 4 ┆ [4] │ │ 5 ┆ [4, 5] │ │ 6 ┆ [4, 5, 6] │ │ 8 ┆ [6, 8] │ └─────┴────────────┘
If you want the index count to be based on row number, then you may want to combine
rolling
withwith_row_index()
.
- row( ) tuple[Any, ...] | dict[str, Any] [source]
Get the values of a single row, either by index or by predicate.
- Parameters:
- index
Row index.
- by_predicate
Select the row according to a given expression/predicate.
- named
Return a dictionary instead of a tuple. The dictionary is a mapping of column name to row value. This is more expensive than returning a regular tuple, but allows for accessing values by column name.
- Returns:
- tuple (default) or dictionary of row values
Warning
You should NEVER use this method to iterate over a DataFrame; if you require row-iteration you should strongly prefer use of
iter_rows()
instead.See also
Notes
The
index
andby_predicate
params are mutually exclusive. Additionally, to ensure clarity, theby_predicate
parameter must be supplied by keyword.When using
by_predicate
it is an error condition if anything other than one row is returned; more than one row raisesTooManyRowsReturnedError
, and zero rows will raiseNoRowsReturnedError
(both inherit fromRowsError
).Examples
Specify an index to return the row at the given index as a tuple.
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.row(2) (3, 8, 'c')
Specify
named=True
to get a dictionary instead with a mapping of column names to row values.>>> df.row(2, named=True) {'foo': 3, 'bar': 8, 'ham': 'c'}
Use
by_predicate
to return the row that matches the given predicate.>>> df.row(by_predicate=(pl.col("ham") == "b")) (2, 7, 'b')
- rows(
- *,
- named: bool = False,
Returns all data in the DataFrame as a list of rows of python-native values.
By default, each row is returned as a tuple of values given in the same order as the frame columns. Setting
named=True
will return rows of dictionaries instead.- Parameters:
- named
Return dictionaries instead of tuples. The dictionaries are a mapping of column name to row value. This is more expensive than returning a regular tuple, but allows for accessing values by column name.
- Returns:
- list of row value tuples (default), or list of dictionaries (if
named=True
).
- list of row value tuples (default), or list of dictionaries (if
Warning
Row-iteration is not optimal as the underlying data is stored in columnar form; where possible, prefer export via one of the dedicated export/output methods. You should also consider using
iter_rows
instead, to avoid materialising all the data at once; there is little performance difference between the two, but peak memory can be reduced if processing rows in batches.See also
iter_rows
Row iterator over frame data (does not materialise all rows).
rows_by_key
Materialises frame data as a key-indexed dictionary.
Notes
If you have
ns
-precision temporal values you should be aware that Python natively only supports up toμs
-precision;ns
-precision values will be truncated to microseconds on conversion to Python. If this matters to your use-case you should export to a different format (such as Arrow or NumPy).Examples
>>> df = pl.DataFrame( ... { ... "x": ["a", "b", "b", "a"], ... "y": [1, 2, 3, 4], ... "z": [0, 3, 6, 9], ... } ... ) >>> df.rows() [('a', 1, 0), ('b', 2, 3), ('b', 3, 6), ('a', 4, 9)] >>> df.rows(named=True) [{'x': 'a', 'y': 1, 'z': 0}, {'x': 'b', 'y': 2, 'z': 3}, {'x': 'b', 'y': 3, 'z': 6}, {'x': 'a', 'y': 4, 'z': 9}]
- rows_by_key(
- key: ColumnNameOrSelector | Sequence[ColumnNameOrSelector],
- *,
- named: bool = False,
- include_key: bool = False,
- unique: bool = False,
Returns all data as a dictionary of python-native values keyed by some column.
This method is like
rows
, but instead of returning rows in a flat list, rows are grouped by the values in thekey
column(s) and returned as a dictionary.Note that this method should not be used in place of native operations, due to the high cost of materializing all frame data out into a dictionary; it should be used only when you need to move the values out into a Python data structure or other object that cannot operate directly with Polars/Arrow.
- Parameters:
- key
The column(s) to use as the key for the returned dictionary. If multiple columns are specified, the key will be a tuple of those values, otherwise it will be a string.
- named
Return dictionary rows instead of tuples, mapping column name to row value.
- include_key
Include key values inline with the associated data (by default the key values are omitted as a memory/performance optimisation, as they can be reoconstructed from the key).
- unique
Indicate that the key is unique; this will result in a 1:1 mapping from key to a single associated row. Note that if the key is not actually unique the last row with the given key will be returned.
See also
Notes
If you have
ns
-precision temporal values you should be aware that Python natively only supports up toμs
-precision;ns
-precision values will be truncated to microseconds on conversion to Python. If this matters to your use-case you should export to a different format (such as Arrow or NumPy).Examples
>>> df = pl.DataFrame( ... { ... "w": ["a", "b", "b", "a"], ... "x": ["q", "q", "q", "k"], ... "y": [1.0, 2.5, 3.0, 4.5], ... "z": [9, 8, 7, 6], ... } ... )
Group rows by the given key column(s):
>>> df.rows_by_key(key=["w"]) defaultdict(<class 'list'>, {'a': [('q', 1.0, 9), ('k', 4.5, 6)], 'b': [('q', 2.5, 8), ('q', 3.0, 7)]})
Return the same row groupings as dictionaries:
>>> df.rows_by_key(key=["w"], named=True) defaultdict(<class 'list'>, {'a': [{'x': 'q', 'y': 1.0, 'z': 9}, {'x': 'k', 'y': 4.5, 'z': 6}], 'b': [{'x': 'q', 'y': 2.5, 'z': 8}, {'x': 'q', 'y': 3.0, 'z': 7}]})
Return row groupings, assuming keys are unique:
>>> df.rows_by_key(key=["z"], unique=True) {9: ('a', 'q', 1.0), 8: ('b', 'q', 2.5), 7: ('b', 'q', 3.0), 6: ('a', 'k', 4.5)}
Return row groupings as dictionaries, assuming keys are unique:
>>> df.rows_by_key(key=["z"], named=True, unique=True) {9: {'w': 'a', 'x': 'q', 'y': 1.0}, 8: {'w': 'b', 'x': 'q', 'y': 2.5}, 7: {'w': 'b', 'x': 'q', 'y': 3.0}, 6: {'w': 'a', 'x': 'k', 'y': 4.5}}
Return dictionary rows grouped by a compound key, including key values:
>>> df.rows_by_key(key=["w", "x"], named=True, include_key=True) defaultdict(<class 'list'>, {('a', 'q'): [{'w': 'a', 'x': 'q', 'y': 1.0, 'z': 9}], ('b', 'q'): [{'w': 'b', 'x': 'q', 'y': 2.5, 'z': 8}, {'w': 'b', 'x': 'q', 'y': 3.0, 'z': 7}], ('a', 'k'): [{'w': 'a', 'x': 'k', 'y': 4.5, 'z': 6}]})
- sample(
- n: int | Series | None = None,
- *,
- fraction: float | Series | None = None,
- with_replacement: bool = False,
- shuffle: bool = False,
- seed: int | None = None,
Sample from this DataFrame.
- Parameters:
- n
Number of items to return. Cannot be used with
fraction
. Defaults to 1 iffraction
is None.- fraction
Fraction of items to return. Cannot be used with
n
.- with_replacement
Allow values to be sampled more than once.
- shuffle
If set to True, the order of the sampled rows will be shuffled. If set to False (default), the order of the returned rows will be neither stable nor fully random.
- seed
Seed for the random number generator. If set to None (default), a random seed is generated for each sample operation.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.sample(n=2, seed=0) shape: (2, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 3 ┆ 8 ┆ c │ │ 2 ┆ 7 ┆ b │ └─────┴─────┴─────┘
- property schema: Schema[source]
Get an ordered mapping of column names to their data type.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.schema Schema({'foo': Int64, 'bar': Float64, 'ham': String})
- select(
- *exprs: IntoExpr | Iterable[IntoExpr],
- **named_exprs: IntoExpr,
Select columns from this DataFrame.
- Parameters:
- *exprs
Column(s) to select, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.
- **named_exprs
Additional columns to select, specified as keyword arguments. The columns will be renamed to the keyword used.
Examples
Pass the name of a column to select that column.
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.select("foo") shape: (3, 1) ┌─────┐ │ foo │ │ --- │ │ i64 │ ╞═════╡ │ 1 │ │ 2 │ │ 3 │ └─────┘
Multiple columns can be selected by passing a list of column names.
>>> df.select(["foo", "bar"]) shape: (3, 2) ┌─────┬─────┐ │ foo ┆ bar │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 6 │ │ 2 ┆ 7 │ │ 3 ┆ 8 │ └─────┴─────┘
Multiple columns can also be selected using positional arguments instead of a list. Expressions are also accepted.
>>> df.select(pl.col("foo"), pl.col("bar") + 1) shape: (3, 2) ┌─────┬─────┐ │ foo ┆ bar │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 7 │ │ 2 ┆ 8 │ │ 3 ┆ 9 │ └─────┴─────┘
Use keyword arguments to easily name your expression inputs.
>>> df.select(threshold=pl.when(pl.col("foo") > 2).then(10).otherwise(0)) shape: (3, 1) ┌───────────┐ │ threshold │ │ --- │ │ i32 │ ╞═══════════╡ │ 0 │ │ 0 │ │ 10 │ └───────────┘
Expressions with multiple outputs can be automatically instantiated as Structs by enabling the setting
Config.set_auto_structify(True)
:>>> with pl.Config(auto_structify=True): ... df.select( ... is_odd=(pl.col(pl.Int64) % 2 == 1).name.suffix("_is_odd"), ... ) shape: (3, 1) ┌──────────────┐ │ is_odd │ │ --- │ │ struct[2] │ ╞══════════════╡ │ {true,false} │ │ {false,true} │ │ {true,false} │ └──────────────┘
- select_seq(
- *exprs: IntoExpr | Iterable[IntoExpr],
- **named_exprs: IntoExpr,
Select columns from this DataFrame.
This will run all expression sequentially instead of in parallel. Use this when the work per expression is cheap.
- Parameters:
- *exprs
Column(s) to select, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.
- **named_exprs
Additional columns to select, specified as keyword arguments. The columns will be renamed to the keyword used.
See also
- serialize( ) bytes | str | None [source]
Serialize this DataFrame to a file or string in JSON format.
- Parameters:
- file
File path or writable file-like object to which the result will be written. If set to
None
(default), the output is returned as a string instead.- format
The format in which to serialize. Options:
"binary"
: Serialize to binary format (bytes). This is the default."json"
: Serialize to JSON format (string).
Notes
Serialization is not stable across Polars versions: a LazyFrame serialized in one Polars version may not be deserializable in another Polars version.
Examples
Serialize the DataFrame into a binary representation.
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... } ... ) >>> bytes = df.serialize() >>> bytes b'\xa1gcolumns\x82\xa4dnamecfoohdatatypeeInt64lbit_settings\x00fvalues\x83...'
The bytes can later be deserialized back into a DataFrame.
>>> import io >>> pl.DataFrame.deserialize(io.BytesIO(bytes)) shape: (3, 2) ┌─────┬─────┐ │ foo ┆ bar │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 6 │ │ 2 ┆ 7 │ │ 3 ┆ 8 │ └─────┴─────┘
- set_sorted( ) DataFrame [source]
Indicate that one or multiple columns are sorted.
This can speed up future operations.
- Parameters:
- column
Column that are sorted
- descending
Whether the columns are sorted in descending order.
Warning
This can lead to incorrect results if the data is NOT sorted!! Use with care!
- property shape: tuple[int, int][source]
Get the shape of the DataFrame.
Examples
>>> df = pl.DataFrame({"foo": [1, 2, 3, 4, 5]}) >>> df.shape (5, 1)
- shift(n: int = 1, *, fill_value: IntoExpr | None = None) DataFrame [source]
Shift values by the given number of indices.
- Parameters:
- n
Number of indices to shift forward. If a negative value is passed, values are shifted in the opposite direction instead.
- fill_value
Fill the resulting null values with this value. Accepts expression input. Non-expression inputs are parsed as literals.
Notes
This method is similar to the
LAG
operation in SQL when the value forn
is positive. With a negative value forn
, it is similar toLEAD
.Examples
By default, values are shifted forward by one index.
>>> df = pl.DataFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [5, 6, 7, 8], ... } ... ) >>> df.shift() shape: (4, 2) ┌──────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞══════╪══════╡ │ null ┆ null │ │ 1 ┆ 5 │ │ 2 ┆ 6 │ │ 3 ┆ 7 │ └──────┴──────┘
Pass a negative value to shift in the opposite direction instead.
>>> df.shift(-2) shape: (4, 2) ┌──────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞══════╪══════╡ │ 3 ┆ 7 │ │ 4 ┆ 8 │ │ null ┆ null │ │ null ┆ null │ └──────┴──────┘
Specify
fill_value
to fill the resulting null values.>>> df.shift(-2, fill_value=100) shape: (4, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 3 ┆ 7 │ │ 4 ┆ 8 │ │ 100 ┆ 100 │ │ 100 ┆ 100 │ └─────┴─────┘
- shrink_to_fit(*, in_place: bool = False) DataFrame [source]
Shrink DataFrame memory usage.
Shrinks to fit the exact capacity needed to hold the data.
- slice( ) DataFrame [source]
Get a slice of this DataFrame.
- Parameters:
- offset
Start index. Negative indexing is supported.
- length
Length of the slice. If set to
None
, all rows starting at the offset will be selected.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.slice(1, 2) shape: (2, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞═════╪═════╪═════╡ │ 2 ┆ 7.0 ┆ b │ │ 3 ┆ 8.0 ┆ c │ └─────┴─────┴─────┘
- sort(
- by: IntoExpr | Iterable[IntoExpr],
- *more_by: IntoExpr,
- descending: bool | Sequence[bool] = False,
- nulls_last: bool | Sequence[bool] = False,
- multithreaded: bool = True,
- maintain_order: bool = False,
Sort the dataframe by the given columns.
- Parameters:
- by
Column(s) to sort by. Accepts expression input, including selectors. Strings are parsed as column names.
- *more_by
Additional columns to sort by, specified as positional arguments.
- descending
Sort in descending order. When sorting by multiple columns, can be specified per column by passing a sequence of booleans.
- nulls_last
Place null values last; can specify a single boolean applying to all columns or a sequence of booleans for per-column control.
- multithreaded
Sort using multiple threads.
- maintain_order
Whether the order should be maintained if elements are equal.
Examples
Pass a single column name to sort by that column.
>>> df = pl.DataFrame( ... { ... "a": [1, 2, None], ... "b": [6.0, 5.0, 4.0], ... "c": ["a", "c", "b"], ... } ... ) >>> df.sort("a") shape: (3, 3) ┌──────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞══════╪═════╪═════╡ │ null ┆ 4.0 ┆ b │ │ 1 ┆ 6.0 ┆ a │ │ 2 ┆ 5.0 ┆ c │ └──────┴─────┴─────┘
Sorting by expressions is also supported.
>>> df.sort(pl.col("a") + pl.col("b") * 2, nulls_last=True) shape: (3, 3) ┌──────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞══════╪═════╪═════╡ │ 2 ┆ 5.0 ┆ c │ │ 1 ┆ 6.0 ┆ a │ │ null ┆ 4.0 ┆ b │ └──────┴─────┴─────┘
Sort by multiple columns by passing a list of columns.
>>> df.sort(["c", "a"], descending=True) shape: (3, 3) ┌──────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞══════╪═════╪═════╡ │ 2 ┆ 5.0 ┆ c │ │ null ┆ 4.0 ┆ b │ │ 1 ┆ 6.0 ┆ a │ └──────┴─────┴─────┘
Or use positional arguments to sort by multiple columns in the same way.
>>> df.sort("c", "a", descending=[False, True]) shape: (3, 3) ┌──────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞══════╪═════╪═════╡ │ 1 ┆ 6.0 ┆ a │ │ null ┆ 4.0 ┆ b │ │ 2 ┆ 5.0 ┆ c │ └──────┴─────┴─────┘
- sql( ) DataFrame [source]
Execute a SQL query against the DataFrame.
Added in version 0.20.24.
Warning
This functionality is considered unstable, although it is close to being considered stable. It may be changed at any point without it being considered a breaking change.
- Parameters:
- query
SQL query to execute.
- table_name
Optionally provide an explicit name for the table that represents the calling frame (defaults to “self”).
See also
Notes
The calling frame is automatically registered as a table in the SQL context under the name “self”. If you want access to the DataFrames and LazyFrames found in the current globals, use the top-level
pl.sql
.More control over registration and execution behaviour is available by using the
SQLContext
object.The SQL query executes in lazy mode before being collected and returned as a DataFrame.
Examples
>>> from datetime import date >>> df1 = pl.DataFrame( ... { ... "a": [1, 2, 3], ... "b": ["zz", "yy", "xx"], ... "c": [date(1999, 12, 31), date(2010, 10, 10), date(2077, 8, 8)], ... } ... )
Query the DataFrame using SQL:
>>> df1.sql("SELECT c, b FROM self WHERE a > 1") shape: (2, 2) ┌────────────┬─────┐ │ c ┆ b │ │ --- ┆ --- │ │ date ┆ str │ ╞════════════╪═════╡ │ 2010-10-10 ┆ yy │ │ 2077-08-08 ┆ xx │ └────────────┴─────┘
Apply transformations to a DataFrame using SQL, aliasing “self” to “frame”.
>>> df1.sql( ... query=''' ... SELECT ... a, ... (a % 2 == 0) AS a_is_even, ... CONCAT_WS(':', b, b) AS b_b, ... EXTRACT(year FROM c) AS year, ... 0::float4 AS "zero", ... FROM frame ... ''', ... table_name="frame", ... ) shape: (3, 5) ┌─────┬───────────┬───────┬──────┬──────┐ │ a ┆ a_is_even ┆ b_b ┆ year ┆ zero │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ bool ┆ str ┆ i32 ┆ f32 │ ╞═════╪═══════════╪═══════╪══════╪══════╡ │ 1 ┆ false ┆ zz:zz ┆ 1999 ┆ 0.0 │ │ 2 ┆ true ┆ yy:yy ┆ 2010 ┆ 0.0 │ │ 3 ┆ false ┆ xx:xx ┆ 2077 ┆ 0.0 │ └─────┴───────────┴───────┴──────┴──────┘
- std(ddof: int = 1) DataFrame [source]
Aggregate the columns of this DataFrame to their standard deviation value.
- Parameters:
- ddof
“Delta Degrees of Freedom”: the divisor used in the calculation is N - ddof, where N represents the number of elements. By default ddof is 1.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.std() shape: (1, 3) ┌─────┬─────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ str │ ╞═════╪═════╪══════╡ │ 1.0 ┆ 1.0 ┆ null │ └─────┴─────┴──────┘ >>> df.std(ddof=0) shape: (1, 3) ┌──────────┬──────────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ str │ ╞══════════╪══════════╪══════╡ │ 0.816497 ┆ 0.816497 ┆ null │ └──────────┴──────────┴──────┘
- property style: GT[source]
Create a Great Table for styling.
Warning
This functionality is currently considered unstable. It may be changed at any point without it being considered a breaking change.
Polars does not implement styling logic itself, but instead defers to the Great Tables package. Please see the Great Tables reference for more information and documentation.
Examples
Import some styling helpers, and create example data:
>>> import polars.selectors as cs >>> from great_tables import loc, style >>> df = pl.DataFrame( ... { ... "site_id": [0, 1, 2], ... "measure_a": [5, 4, 6], ... "measure_b": [7, 3, 3], ... } ... )
Emphasize the site_id as row names:
>>> df.style.tab_stub(rowname_col="site_id")
Fill the background for the highest measure_a value row:
>>> df.style.tab_style( ... style.fill("yellow"), ... loc.body(rows=pl.col("measure_a") == pl.col("measure_a").max()), ... )
Put a spanner (high-level label) over measure columns:
>>> df.style.tab_spanner( ... "Measures", cs.starts_with("measure") ... )
Format measure_b values to two decimal places:
>>> df.style.fmt_number("measure_b", decimals=2)
- sum() DataFrame [source]
Aggregate the columns of this DataFrame to their sum value.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.sum() shape: (1, 3) ┌─────┬─────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪══════╡ │ 6 ┆ 21 ┆ null │ └─────┴─────┴──────┘
- sum_horizontal(*, ignore_nulls: bool = True) Series [source]
Sum all values horizontally across columns.
- Parameters:
- ignore_nulls
Ignore null values (default). If set to
False
, any null value in the input will lead to a null output.
- Returns:
- Series
A Series named
"sum"
.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [4.0, 5.0, 6.0], ... } ... ) >>> df.sum_horizontal() shape: (3,) Series: 'sum' [f64] [ 5.0 7.0 9.0 ]
- tail(n: int = 5) DataFrame [source]
Get the last
n
rows.- Parameters:
- n
Number of rows to return. If a negative value is passed, return all rows except the first
abs(n)
.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3, 4, 5], ... "bar": [6, 7, 8, 9, 10], ... "ham": ["a", "b", "c", "d", "e"], ... } ... ) >>> df.tail(3) shape: (3, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 3 ┆ 8 ┆ c │ │ 4 ┆ 9 ┆ d │ │ 5 ┆ 10 ┆ e │ └─────┴─────┴─────┘
Pass a negative value to get all rows
except
the firstabs(n)
.>>> df.tail(-3) shape: (2, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 4 ┆ 9 ┆ d │ │ 5 ┆ 10 ┆ e │ └─────┴─────┴─────┘
- to_arrow(
- *,
- compat_level: CompatLevel | None = None,
Collect the underlying arrow arrays in an Arrow Table.
This operation is mostly zero copy.
- Data types that do copy:
CategoricalType
- Parameters:
- compat_level
Use a specific compatibility level when exporting Polars’ internal data structures.
Examples
>>> df = pl.DataFrame( ... {"foo": [1, 2, 3, 4, 5, 6], "bar": ["a", "b", "c", "d", "e", "f"]} ... ) >>> df.to_arrow() pyarrow.Table foo: int64 bar: large_string ---- foo: [[1,2,3,4,5,6]] bar: [["a","b","c","d","e","f"]]
- to_dict(*, as_series: bool = True) dict[str, Series] | dict[str, list[Any]] [source]
Convert DataFrame to a dictionary mapping column name to values.
- Parameters:
- as_series
True -> Values are Series False -> Values are List[Any]
See also
Examples
>>> df = pl.DataFrame( ... { ... "A": [1, 2, 3, 4, 5], ... "fruits": ["banana", "banana", "apple", "apple", "banana"], ... "B": [5, 4, 3, 2, 1], ... "cars": ["beetle", "audi", "beetle", "beetle", "beetle"], ... "optional": [28, 300, None, 2, -30], ... } ... ) >>> df shape: (5, 5) ┌─────┬────────┬─────┬────────┬──────────┐ │ A ┆ fruits ┆ B ┆ cars ┆ optional │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ i64 ┆ str ┆ i64 │ ╞═════╪════════╪═════╪════════╪══════════╡ │ 1 ┆ banana ┆ 5 ┆ beetle ┆ 28 │ │ 2 ┆ banana ┆ 4 ┆ audi ┆ 300 │ │ 3 ┆ apple ┆ 3 ┆ beetle ┆ null │ │ 4 ┆ apple ┆ 2 ┆ beetle ┆ 2 │ │ 5 ┆ banana ┆ 1 ┆ beetle ┆ -30 │ └─────┴────────┴─────┴────────┴──────────┘ >>> df.to_dict(as_series=False) {'A': [1, 2, 3, 4, 5], 'fruits': ['banana', 'banana', 'apple', 'apple', 'banana'], 'B': [5, 4, 3, 2, 1], 'cars': ['beetle', 'audi', 'beetle', 'beetle', 'beetle'], 'optional': [28, 300, None, 2, -30]} >>> df.to_dict(as_series=True) {'A': shape: (5,) Series: 'A' [i64] [ 1 2 3 4 5 ], 'fruits': shape: (5,) Series: 'fruits' [str] [ "banana" "banana" "apple" "apple" "banana" ], 'B': shape: (5,) Series: 'B' [i64] [ 5 4 3 2 1 ], 'cars': shape: (5,) Series: 'cars' [str] [ "beetle" "audi" "beetle" "beetle" "beetle" ], 'optional': shape: (5,) Series: 'optional' [i64] [ 28 300 null 2 -30 ]}
- to_dicts() list[dict[str, Any]] [source]
Convert every row to a dictionary of Python-native values.
Notes
If you have
ns
-precision temporal values you should be aware that Python natively only supports up toμs
-precision;ns
-precision values will be truncated to microseconds on conversion to Python. If this matters to your use-case you should export to a different format (such as Arrow or NumPy).Examples
>>> df = pl.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) >>> df.to_dicts() [{'foo': 1, 'bar': 4}, {'foo': 2, 'bar': 5}, {'foo': 3, 'bar': 6}]
- to_dummies(
- columns: ColumnNameOrSelector | Sequence[ColumnNameOrSelector] | None = None,
- *,
- separator: str = '_',
- drop_first: bool = False,
Convert categorical variables into dummy/indicator variables.
- Parameters:
- columns
Column name(s) or selector(s) that should be converted to dummy variables. If set to
None
(default), convert all columns.- separator
Separator/delimiter used when generating column names.
- drop_first
Remove the first category from the variables being encoded.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2], ... "bar": [3, 4], ... "ham": ["a", "b"], ... } ... ) >>> df.to_dummies() shape: (2, 6) ┌───────┬───────┬───────┬───────┬───────┬───────┐ │ foo_1 ┆ foo_2 ┆ bar_3 ┆ bar_4 ┆ ham_a ┆ ham_b │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ u8 ┆ u8 ┆ u8 ┆ u8 ┆ u8 ┆ u8 │ ╞═══════╪═══════╪═══════╪═══════╪═══════╪═══════╡ │ 1 ┆ 0 ┆ 1 ┆ 0 ┆ 1 ┆ 0 │ │ 0 ┆ 1 ┆ 0 ┆ 1 ┆ 0 ┆ 1 │ └───────┴───────┴───────┴───────┴───────┴───────┘
>>> df.to_dummies(drop_first=True) shape: (2, 3) ┌───────┬───────┬───────┐ │ foo_2 ┆ bar_4 ┆ ham_b │ │ --- ┆ --- ┆ --- │ │ u8 ┆ u8 ┆ u8 │ ╞═══════╪═══════╪═══════╡ │ 0 ┆ 0 ┆ 0 │ │ 1 ┆ 1 ┆ 1 │ └───────┴───────┴───────┘
>>> import polars.selectors as cs >>> df.to_dummies(cs.integer(), separator=":") shape: (2, 5) ┌───────┬───────┬───────┬───────┬─────┐ │ foo:1 ┆ foo:2 ┆ bar:3 ┆ bar:4 ┆ ham │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ u8 ┆ u8 ┆ u8 ┆ u8 ┆ str │ ╞═══════╪═══════╪═══════╪═══════╪═════╡ │ 1 ┆ 0 ┆ 1 ┆ 0 ┆ a │ │ 0 ┆ 1 ┆ 0 ┆ 1 ┆ b │ └───────┴───────┴───────┴───────┴─────┘
>>> df.to_dummies(cs.integer(), drop_first=True, separator=":") shape: (2, 3) ┌───────┬───────┬─────┐ │ foo:2 ┆ bar:4 ┆ ham │ │ --- ┆ --- ┆ --- │ │ u8 ┆ u8 ┆ str │ ╞═══════╪═══════╪═════╡ │ 0 ┆ 0 ┆ a │ │ 1 ┆ 1 ┆ b │ └───────┴───────┴─────┘
- to_init_repr(n: int = 1000) str [source]
Convert DataFrame to instantiable string representation.
- Parameters:
- n
Only use first n rows.
Examples
>>> df = pl.DataFrame( ... [ ... pl.Series("foo", [1, 2, 3], dtype=pl.UInt8), ... pl.Series("bar", [6.0, 7.0, 8.0], dtype=pl.Float32), ... pl.Series("ham", ["a", "b", "c"], dtype=pl.String), ... ] ... ) >>> print(df.to_init_repr()) pl.DataFrame( [ pl.Series('foo', [1, 2, 3], dtype=pl.UInt8), pl.Series('bar', [6.0, 7.0, 8.0], dtype=pl.Float32), pl.Series('ham', ['a', 'b', 'c'], dtype=pl.String), ] )
>>> df_from_str_repr = eval(df.to_init_repr()) >>> df_from_str_repr shape: (3, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ u8 ┆ f32 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6.0 ┆ a │ │ 2 ┆ 7.0 ┆ b │ │ 3 ┆ 8.0 ┆ c │ └─────┴─────┴─────┘
- to_jax(
- return_type: JaxExportType = 'array',
- *,
- device: jax.Device | str | None = None,
- label: str | Expr | Sequence[str | Expr] | None = None,
- features: str | Expr | Sequence[str | Expr] | None = None,
- dtype: PolarsDataType | None = None,
- order: IndexOrder = 'fortran',
Convert DataFrame to a Jax Array, or dict of Jax Arrays.
Added in version 0.20.27.
Warning
This functionality is currently considered unstable. It may be changed at any point without it being considered a breaking change.
- Parameters:
- return_type{“array”, “dict”}
Set return type; a Jax Array, or dict of Jax Arrays.
- device
Specify the jax
Device
on which the array will be created; can provide a string (such as “cpu”, “gpu”, or “tpu”) in which case the device is retrieved asjax.devices(string)[0]
. For more specific control you can supply the instantiatedDevice
directly. If None, arrays are created on the default device.- label
One or more column names, expressions, or selectors that label the feature data; results in a
{"label": ..., "features": ...}
dict being returned whenreturn_type
is “dict” instead of a{"col": array, }
dict.- features
One or more column names, expressions, or selectors that contain the feature data; if omitted, all columns that are not designated as part of the label are used. Only applies when
return_type
is “dict”.- dtype
Unify the dtype of all returned arrays; this casts any column that is not already of the required dtype before converting to Array. Note that export will be single-precision (32bit) unless the Jax config/environment directs otherwise (eg: “jax_enable_x64” was set True in the config object at startup, or “JAX_ENABLE_X64” is set to “1” in the environment).
- order{“c”, “fortran”}
The index order of the returned Jax array, either C-like (row-major) or Fortran-like (column-major).
See also
Examples
>>> df = pl.DataFrame( ... { ... "lbl": [0, 1, 2, 3], ... "feat1": [1, 0, 0, 1], ... "feat2": [1.5, -0.5, 0.0, -2.25], ... } ... )
Standard return type (2D Array), on the standard device:
>>> df.to_jax() Array([[ 0. , 1. , 1.5 ], [ 1. , 0. , -0.5 ], [ 2. , 0. , 0. ], [ 3. , 1. , -2.25]], dtype=float32)
Create the Array on the default GPU device:
>>> a = df.to_jax(device="gpu") >>> a.device() GpuDevice(id=0, process_index=0)
Create the Array on a specific GPU device:
>>> gpu_device = jax.devices("gpu")[1] >>> a = df.to_jax(device=gpu_device) >>> a.device() GpuDevice(id=1, process_index=0)
As a dictionary of individual Arrays:
>>> df.to_jax("dict") {'lbl': Array([0, 1, 2, 3], dtype=int32), 'feat1': Array([1, 0, 0, 1], dtype=int32), 'feat2': Array([ 1.5 , -0.5 , 0. , -2.25], dtype=float32)}
As a “label” and “features” dictionary; note that as “features” is not declared, it defaults to all the columns that are not in “label”:
>>> df.to_jax("dict", label="lbl") {'label': Array([[0], [1], [2], [3]], dtype=int32), 'features': Array([[ 1. , 1.5 ], [ 0. , -0.5 ], [ 0. , 0. ], [ 1. , -2.25]], dtype=float32)}
As a “label” and “features” dictionary where each is designated using a col or selector expression (which can also be used to cast the data if the label and features are better-represented with different dtypes):
>>> import polars.selectors as cs >>> df.to_jax( ... return_type="dict", ... features=cs.float(), ... label=pl.col("lbl").cast(pl.UInt8), ... ) {'label': Array([[0], [1], [2], [3]], dtype=uint8), 'features': Array([[ 1.5 ], [-0.5 ], [ 0. ], [-2.25]], dtype=float32)}
- to_numpy(
- *,
- order: IndexOrder = 'fortran',
- writable: bool = False,
- allow_copy: bool = True,
- structured: bool = False,
- use_pyarrow: bool | None = None,
Convert this DataFrame to a NumPy ndarray.
This operation copies data only when necessary. The conversion is zero copy when all of the following hold:
The DataFrame is fully contiguous in memory, with all Series back-to-back and all Series consisting of a single chunk.
The data type is an integer or float.
The DataFrame contains no null values.
The
order
parameter is set tofortran
(default).The
writable
parameter is set toFalse
(default).
- Parameters:
- order
The index order of the returned NumPy array, either C-like or Fortran-like. In general, using the Fortran-like index order is faster. However, the C-like order might be more appropriate to use for downstream applications to prevent cloning data, e.g. when reshaping into a one-dimensional array.
- writable
Ensure the resulting array is writable. This will force a copy of the data if the array was created without copy, as the underlying Arrow data is immutable.
- allow_copy
Allow memory to be copied to perform the conversion. If set to
False
, causes conversions that are not zero-copy to fail.- structured
Return a structured array with a data type that corresponds to the DataFrame schema. If set to
False
(default), a 2D ndarray is returned instead.- use_pyarrow
-
function for the conversion to NumPy if necessary.
Deprecated since version 0.20.28: Polars now uses its native engine by default for conversion to NumPy.
Examples
Numeric data without nulls can be converted without copying data in some cases. The resulting array will not be writable.
>>> df = pl.DataFrame({"a": [1, 2, 3]}) >>> arr = df.to_numpy() >>> arr array([[1], [2], [3]]) >>> arr.flags.writeable False
Set
writable=True
to force data copy to make the array writable.>>> df.to_numpy(writable=True).flags.writeable True
If the DataFrame contains different numeric data types, the resulting data type will be the supertype. This requires data to be copied. Integer types with nulls are cast to a float type with
nan
representing a null value.>>> df = pl.DataFrame({"a": [1, 2, None], "b": [4.0, 5.0, 6.0]}) >>> df.to_numpy() array([[ 1., 4.], [ 2., 5.], [nan, 6.]])
Set
allow_copy=False
to raise an error if data would be copied.>>> s.to_numpy(allow_copy=False) Traceback (most recent call last): ... RuntimeError: copy not allowed: cannot convert to a NumPy array without copying data
Polars defaults to F-contiguous order. Use
order="c"
to force the resulting array to be C-contiguous.>>> df.to_numpy(order="c").flags.c_contiguous True
DataFrames with mixed types will result in an array with an object dtype.
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.5, 7.0, 8.5], ... "ham": ["a", "b", "c"], ... }, ... schema_overrides={"foo": pl.UInt8, "bar": pl.Float32}, ... ) >>> df.to_numpy() array([[1, 6.5, 'a'], [2, 7.0, 'b'], [3, 8.5, 'c']], dtype=object)
Set
structured=True
to convert to a structured array, which can better preserve individual column data such as name and data type.>>> df.to_numpy(structured=True) array([(1, 6.5, 'a'), (2, 7. , 'b'), (3, 8.5, 'c')], dtype=[('foo', 'u1'), ('bar', '<f4'), ('ham', '<U1')])
- to_pandas( ) DataFrame [source]
Convert this DataFrame to a pandas DataFrame.
This operation copies data if
use_pyarrow_extension_array
is not enabled.- Parameters:
- use_pyarrow_extension_array
Use PyArrow-backed extension arrays instead of NumPy arrays for the columns of the pandas DataFrame. This allows zero copy operations and preservation of null values. Subsequent operations on the resulting pandas DataFrame may trigger conversion to NumPy if those operations are not supported by PyArrow compute functions.
- **kwargs
Additional keyword arguments to be passed to
pyarrow.Table.to_pandas()
.
- Returns:
Notes
This operation requires that both
pandas
andpyarrow
are installed.Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.to_pandas() foo bar ham 0 1 6.0 a 1 2 7.0 b 2 3 8.0 c
Null values in numeric columns are converted to
NaN
.>>> df = pl.DataFrame( ... { ... "foo": [1, 2, None], ... "bar": [6.0, None, 8.0], ... "ham": [None, "b", "c"], ... } ... ) >>> df.to_pandas() foo bar ham 0 1.0 6.0 None 1 2.0 NaN b 2 NaN 8.0 c
Pass
use_pyarrow_extension_array=True
to get a pandas DataFrame with columns backed by PyArrow extension arrays. This will preserve null values.>>> df.to_pandas(use_pyarrow_extension_array=True) foo bar ham 0 1 6.0 <NA> 1 2 <NA> b 2 <NA> 8.0 c >>> _.dtypes foo int64[pyarrow] bar double[pyarrow] ham large_string[pyarrow] dtype: object
- to_series(index: int = 0) Series [source]
Select column as Series at index location.
- Parameters:
- index
Location of selection.
See also
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.to_series(1) shape: (3,) Series: 'bar' [i64] [ 6 7 8 ]
- to_struct(name: str = '') Series [source]
Convert a
DataFrame
to aSeries
of typeStruct
.- Parameters:
- name
Name for the struct Series
Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 2, 3, 4, 5], ... "b": ["one", "two", "three", "four", "five"], ... } ... ) >>> df.to_struct("nums") shape: (5,) Series: 'nums' [struct[2]] [ {1,"one"} {2,"two"} {3,"three"} {4,"four"} {5,"five"} ]
- to_torch(
- return_type: TorchExportType = 'tensor',
- *,
- label: str | Expr | Sequence[str | Expr] | None = None,
- features: str | Expr | Sequence[str | Expr] | None = None,
- dtype: PolarsDataType | None = None,
Convert DataFrame to a PyTorch Tensor, Dataset, or dict of Tensors.
Added in version 0.20.23.
Warning
This functionality is currently considered unstable. It may be changed at any point without it being considered a breaking change.
- Parameters:
- return_type{“tensor”, “dataset”, “dict”}
Set return type; a PyTorch Tensor, PolarsDataset (a frame-specialized TensorDataset), or dict of Tensors.
- label
One or more column names, expressions, or selectors that label the feature data; when
return_type
is “dataset”, the PolarsDataset will return(features, label)
tensor tuples for each row. Otherwise, it returns(features,)
tensor tuples where the feature contains all the row data.- features
One or more column names, expressions, or selectors that contain the feature data; if omitted, all columns that are not designated as part of the label are used.
- dtype
Unify the dtype of all returned tensors; this casts any column that is not of the required dtype before converting to Tensor. This includes the label column unless the label is an expression (such as
pl.col("label_column").cast(pl.Int16)
).
See also
Examples
>>> df = pl.DataFrame( ... { ... "lbl": [0, 1, 2, 3], ... "feat1": [1, 0, 0, 1], ... "feat2": [1.5, -0.5, 0.0, -2.25], ... } ... )
Standard return type (Tensor), with f32 supertype:
>>> df.to_torch(dtype=pl.Float32) tensor([[ 0.0000, 1.0000, 1.5000], [ 1.0000, 0.0000, -0.5000], [ 2.0000, 0.0000, 0.0000], [ 3.0000, 1.0000, -2.2500]])
As a dictionary of individual Tensors:
>>> df.to_torch("dict") {'lbl': tensor([0, 1, 2, 3]), 'feat1': tensor([1, 0, 0, 1]), 'feat2': tensor([ 1.5000, -0.5000, 0.0000, -2.2500], dtype=torch.float64)}
As a “label” and “features” dictionary; note that as “features” is not declared, it defaults to all the columns that are not in “label”:
>>> df.to_torch("dict", label="lbl", dtype=pl.Float32) {'label': tensor([[0.], [1.], [2.], [3.]]), 'features': tensor([[ 1.0000, 1.5000], [ 0.0000, -0.5000], [ 0.0000, 0.0000], [ 1.0000, -2.2500]])}
As a PolarsDataset, with f64 supertype:
>>> ds = df.to_torch("dataset", dtype=pl.Float64) >>> ds[3] (tensor([ 3.0000, 1.0000, -2.2500], dtype=torch.float64),) >>> ds[:2] (tensor([[ 0.0000, 1.0000, 1.5000], [ 1.0000, 0.0000, -0.5000]], dtype=torch.float64),) >>> ds[[0, 3]] (tensor([[ 0.0000, 1.0000, 1.5000], [ 3.0000, 1.0000, -2.2500]], dtype=torch.float64),)
As a convenience the PolarsDataset can opt in to half-precision data for experimentation (usually this would be set on the model/pipeline):
>>> list(ds.half()) [(tensor([0.0000, 1.0000, 1.5000], dtype=torch.float16),), (tensor([ 1.0000, 0.0000, -0.5000], dtype=torch.float16),), (tensor([2., 0., 0.], dtype=torch.float16),), (tensor([ 3.0000, 1.0000, -2.2500], dtype=torch.float16),)]
Pass PolarsDataset to a DataLoader, designating the label:
>>> from torch.utils.data import DataLoader >>> ds = df.to_torch("dataset", label="lbl") >>> dl = DataLoader(ds, batch_size=2) >>> batches = list(dl) >>> batches[0] [tensor([[ 1.0000, 1.5000], [ 0.0000, -0.5000]], dtype=torch.float64), tensor([0, 1])]
Note that labels can be given as expressions, allowing them to have a dtype independent of the feature columns (multi-column labels are supported).
>>> ds = df.to_torch( ... return_type="dataset", ... dtype=pl.Float32, ... label=pl.col("lbl").cast(pl.Int16), ... ) >>> ds[:2] (tensor([[ 1.0000, 1.5000], [ 0.0000, -0.5000]]), tensor([0, 1], dtype=torch.int16))
Easily integrate with (for example) scikit-learn and other datasets:
>>> from sklearn.datasets import fetch_california_housing >>> housing = fetch_california_housing() >>> df = pl.DataFrame( ... data=housing.data, ... schema=housing.feature_names, ... ).with_columns( ... Target=housing.target, ... ) >>> train = df.to_torch("dataset", label="Target") >>> loader = DataLoader( ... train, ... shuffle=True, ... batch_size=64, ... )
- top_k( ) DataFrame [source]
Return the
k
largest rows.Non-null elements are always preferred over null elements, regardless of the value of
reverse
. The output is not guaranteed to be in any particular order, callsort()
after this function if you wish the output to be sorted.- Parameters:
- k
Number of rows to return.
- by
Column(s) used to determine the top rows. Accepts expression input. Strings are parsed as column names.
- reverse
Consider the
k
smallest elements of theby
column(s) (instead of thek
largest). This can be specified per column by passing a sequence of booleans.
See also
Examples
>>> df = pl.DataFrame( ... { ... "a": ["a", "b", "a", "b", "b", "c"], ... "b": [2, 1, 1, 3, 2, 1], ... } ... )
Get the rows which contain the 4 largest values in column b.
>>> df.top_k(4, by="b") shape: (4, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════╪═════╡ │ b ┆ 3 │ │ a ┆ 2 │ │ b ┆ 2 │ │ b ┆ 1 │ └─────┴─────┘
Get the rows which contain the 4 largest values when sorting on column b and a.
>>> df.top_k(4, by=["b", "a"]) shape: (4, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════╪═════╡ │ b ┆ 3 │ │ b ┆ 2 │ │ a ┆ 2 │ │ c ┆ 1 │ └─────┴─────┘
- transpose(
- *,
- include_header: bool = False,
- header_name: str = 'column',
- column_names: str | Iterable[str] | None = None,
Transpose a DataFrame over the diagonal.
- Parameters:
- include_header
If set, the column names will be added as first column.
- header_name
If
include_header
is set, this determines the name of the column that will be inserted.- column_names
Optional iterable yielding strings or a string naming an existing column. These will name the value (non-header) columns in the transposed data.
- Returns:
- DataFrame
Notes
This is a very expensive operation. Perhaps you can do it differently.
Examples
>>> df = pl.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}) >>> df.transpose(include_header=True) shape: (2, 4) ┌────────┬──────────┬──────────┬──────────┐ │ column ┆ column_0 ┆ column_1 ┆ column_2 │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 ┆ i64 │ ╞════════╪══════════╪══════════╪══════════╡ │ a ┆ 1 ┆ 2 ┆ 3 │ │ b ┆ 4 ┆ 5 ┆ 6 │ └────────┴──────────┴──────────┴──────────┘
Replace the auto-generated column names with a list
>>> df.transpose(include_header=False, column_names=["x", "y", "z"]) shape: (2, 3) ┌─────┬─────┬─────┐ │ x ┆ y ┆ z │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ 1 ┆ 2 ┆ 3 │ │ 4 ┆ 5 ┆ 6 │ └─────┴─────┴─────┘
Include the header as a separate column
>>> df.transpose( ... include_header=True, header_name="foo", column_names=["x", "y", "z"] ... ) shape: (2, 4) ┌─────┬─────┬─────┬─────┐ │ foo ┆ x ┆ y ┆ z │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╪═════╡ │ a ┆ 1 ┆ 2 ┆ 3 │ │ b ┆ 4 ┆ 5 ┆ 6 │ └─────┴─────┴─────┴─────┘
Replace the auto-generated column with column names from a generator function
>>> def name_generator(): ... base_name = "my_column_" ... count = 0 ... while True: ... yield f"{base_name}{count}" ... count += 1 >>> df.transpose(include_header=False, column_names=name_generator()) shape: (2, 3) ┌─────────────┬─────────────┬─────────────┐ │ my_column_0 ┆ my_column_1 ┆ my_column_2 │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ╞═════════════╪═════════════╪═════════════╡ │ 1 ┆ 2 ┆ 3 │ │ 4 ┆ 5 ┆ 6 │ └─────────────┴─────────────┴─────────────┘
Use an existing column as the new column names
>>> df = pl.DataFrame(dict(id=["i", "j", "k"], a=[1, 2, 3], b=[4, 5, 6])) >>> df.transpose(column_names="id") shape: (2, 3) ┌─────┬─────┬─────┐ │ i ┆ j ┆ k │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ 1 ┆ 2 ┆ 3 │ │ 4 ┆ 5 ┆ 6 │ └─────┴─────┴─────┘ >>> df.transpose(include_header=True, header_name="new_id", column_names="id") shape: (2, 4) ┌────────┬─────┬─────┬─────┐ │ new_id ┆ i ┆ j ┆ k │ │ --- ┆ --- ┆ --- ┆ --- │<