Version 1
Breaking changes
Properly apply strict
parameter in Series constructor
The behavior of the Series constructor has been updated. Generally, it will be more strict, unless
the user passes strict=False
.
Strict construction is more efficient than non-strict construction, so make sure to pass values of the same data type to the constructor for the best performance.
Example
Before:
>>> s = pl.Series([1, 2, 3.5])
shape: (3,)
Series: '' [f64]
[
1.0
2.0
3.5
]
>>> s = pl.Series([1, 2, 3.5], strict=False)
shape: (3,)
Series: '' [i64]
[
1
2
null
]
>>> s = pl.Series([1, 2, 3.5], strict=False, dtype=pl.Int8)
Series: '' [i8]
[
1
2
null
]
After:
>>> s = pl.Series([1, 2, 3.5])
Traceback (most recent call last):
...
TypeError: unexpected value while building Series of type Int64; found value of type Float64: 3.5
Hint: Try setting `strict=False` to allow passing data with mixed types.
>>> s = pl.Series([1, 2, 3.5], strict=False)
shape: (3,)
Series: '' [f64]
[
1.0
2.0
3.5
]
>>> s = pl.Series([1, 2, 3.5], strict=False, dtype=pl.Int8)
Series: '' [i8]
[
1
2
3
]
Change data orientation inference logic for DataFrame construction
Polars no longer inspects data types to infer the orientation of the data passed to the DataFrame constructor. Data orientation is inferred based on the data and schema dimensions.
Additionally, a warning is raised whenever row orientation is inferred. Because of some confusing
edge cases, users should pass orient="row"
to make explicit that their input is row-based.
Example
Before:
>>> data = [[1, "a"], [2, "b"]]
>>> pl.DataFrame(data)
shape: (2, 2)
┌──────────┬──────────┐
│ column_0 ┆ column_1 │
│ --- ┆ --- │
│ i64 ┆ str │
╞══════════╪══════════╡
│ 1 ┆ a │
│ 2 ┆ b │
└──────────┴──────────┘
After:
>>> pl.DataFrame(data)
Traceback (most recent call last):
...
TypeError: unexpected value while building Series of type Int64; found value of type String: "a"
Hint: Try setting `strict=False` to allow passing data with mixed types.
Use instead:
>>> pl.DataFrame(data, orient="row")
shape: (2, 2)
┌──────────┬──────────┐
│ column_0 ┆ column_1 │
│ --- ┆ --- │
│ i64 ┆ str │
╞══════════╪══════════╡
│ 1 ┆ a │
│ 2 ┆ b │
└──────────┴──────────┘
Consistently convert to given time zone in Series constructor
Danger
This change may silently impact the results of your pipelines. If you work with time zones, please make sure to account for this change.
Handling of time zone information in the Series and DataFrame constructors was inconsistent. Row-wise construction would convert to the given time zone, while column-wise construction would replace the time zone. The inconsistency has been fixed by always converting to the time zone specified in the data type.
Example
Before:
>>> from datetime import datetime
>>> pl.Series([datetime(2020, 1, 1)], dtype=pl.Datetime('us', 'Europe/Amsterdam'))
shape: (1,)
Series: '' [datetime[μs, Europe/Amsterdam]]
[
2020-01-01 00:00:00 CET
]
After:
>>> from datetime import datetime
>>> pl.Series([datetime(2020, 1, 1)], dtype=pl.Datetime('us', 'Europe/Amsterdam'))
shape: (1,)
Series: '' [datetime[μs, Europe/Amsterdam]]
[
2020-01-01 01:00:00 CET
]
Update some error types to more appropriate variants
We have updated a lot of error types to more accurately represent the problem. Most commonly,
ComputeError
types were changed to InvalidOperationError
or SchemaError
.
Example
Before:
>>> s = pl.Series("a", [100, 200, 300])
>>> s.cast(pl.UInt8)
Traceback (most recent call last):
...
polars.exceptions.ComputeError: conversion from `i64` to `u8` failed in column 'a' for 1 out of 3 values: [300]
After:
>>> s.cast(pl.UInt8)
Traceback (most recent call last):
...
polars.exceptions.InvalidOperationError: conversion from `i64` to `u8` failed in column 'a' for 1 out of 3 values: [300]
Update read/scan_parquet
to disable Hive partitioning by default for file inputs
Parquet reading functions now also support directory inputs. Hive partitioning is enabled by default
for directories, but is now disabled by default for file inputs. File inputs include single files,
globs, and lists of files. Explicitly pass hive_partitioning=True
to restore previous behavior.
Example
Before:
>>> pl.read_parquet("dataset/a=1/foo.parquet")
shape: (2, 2)
┌─────┬─────┐
│ a ┆ x │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════╪═════╡
│ 1 ┆ 1.0 │
│ 1 ┆ 2.0 │
└─────┴─────┘
After:
>>> pl.read_parquet("dataset/a=1/foo.parquet")
shape: (2, 1)
┌─────┐
│ x │
│ --- │
│ f64 │
╞═════╡
│ 1.0 │
│ 2.0 │
└─────┘
>>> pl.read_parquet("dataset/a=1/foo.parquet", hive_partitioning=True)
shape: (2, 2)
┌─────┬─────┐
│ a ┆ x │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════╪═════╡
│ 1 ┆ 1.0 │
│ 1 ┆ 2.0 │
└─────┴─────┘
Update reshape
to return Array types instead of List types
reshape
now returns an Array type instead of a List type.
Users can restore the old functionality by calling .arr.to_list()
on the output. Note that this is
not more expensive than it would be to create a List type directly, because reshaping into an array
is basically free.
Example
Before:
>>> s = pl.Series([1, 2, 3, 4, 5, 6])
>>> s.reshape((2, 3))
shape: (2,)
Series: '' [list[i64]]
[
[1, 2, 3]
[4, 5, 6]
]
After:
>>> s.reshape((2, 3))
shape: (2,)
Series: '' [array[i64, 3]]
[
[1, 2, 3]
[4, 5, 6]
]
Read 2D NumPy arrays as Array
type instead of List
The Series constructor now parses 2D NumPy arrays as an Array
type rather than a List
type.
Example
Before:
>>> import numpy as np
>>> arr = np.array([[1, 2], [3, 4]])
>>> pl.Series(arr)
shape: (2,)
Series: '' [list[i64]]
[
[1, 2]
[3, 4]
]
After:
>>> import numpy as np
>>> arr = np.array([[1, 2], [3, 4]])
>>> pl.Series(arr)
shape: (2,)
Series: '' [array[i64, 2]]
[
[1, 2]
[3, 4]
]
Split replace
functionality into two separate methods
The API for replace
has proven to be confusing to many users, particularly with regards to the
default
argument and the resulting data type.
It has been split up into two methods: replace
and replace_strict
. replace
now always keeps
the existing data type (breaking, see example below) and is meant for replacing some values in
your existing column. Its parameters default
and return_dtype
have been deprecated.
The new method replace_strict
is meant for creating a new column, mapping some or all of the
values of the original column, and optionally specifying a default value. If no default is provided,
it raises an error if any non-null values are not mapped.
Example
Before:
>>> s = pl.Series([1, 2, 3])
>>> s.replace(1, "a")
shape: (3,)
Series: '' [str]
[
"a"
"2"
"3"
]
After:
>>> s.replace(1, "a")
Traceback (most recent call last):
...
polars.exceptions.InvalidOperationError: conversion from `str` to `i64` failed in column 'literal' for 1 out of 1 values: ["a"]
>>> s.replace_strict(1, "a", default=s)
shape: (3,)
Series: '' [str]
[
"a"
"2"
"3"
]
Preserve nulls in ewm_mean
, ewm_std
, and ewm_var
Polars will no longer forward-fill null values in ewm
methods. The user can call .forward_fill()
on the output to achieve the same result.
Example
Before:
>>> s = pl.Series([1, 4, None, 3])
>>> s.ewm_mean(alpha=.9, ignore_nulls=False)
shape: (4,)
Series: '' [f64]
[
1.0
3.727273
3.727273
3.007913
]
After:
>>> s.ewm_mean(alpha=.9, ignore_nulls=False)
shape: (4,)
Series: '' [f64]
[
1.0
3.727273
null
3.007913
]
Update clip
to no longer propagate nulls in the given bounds
Null values in the bounds no longer set the value to null - instead, the original value is retained.
Before
>>> df = pl.DataFrame({"a": [0, 1, 2], "min": [1, None, 1]})
>>> df.select(pl.col("a").clip("min"))
shape: (3, 1)
┌──────┐
│ a │
│ --- │
│ i64 │
╞══════╡
│ 1 │
│ null │
│ 2 │
└──────┘
After
>>> df.select(pl.col("a").clip("min"))
shape: (3, 1)
┌──────┐
│ a │
│ --- │
│ i64 │
╞══════╡
│ 1 │
│ 1 │
│ 2 │
└──────┘
Change str.to_datetime
to default to microsecond precision for format specifiers "%f"
and "%.f"
In .str.to_datetime
, when specifying %.f
as the format, the default was to set the resulting
datatype to nanosecond precision. This has been changed to microsecond precision.
Example
Before
>>> s = pl.Series(["2022-08-31 00:00:00.123456789"])
>>> s.str.to_datetime(format="%Y-%m-%d %H:%M:%S%.f")
shape: (1,)
Series: '' [datetime[ns]]
[
2022-08-31 00:00:00.123456789
]
After
>>> s.str.to_datetime(format="%Y-%m-%d %H:%M:%S%.f")
shape: (1,)
Series: '' [datetime[us]]
[
2022-08-31 00:00:00.123456
]
Update resulting column names in pivot
when pivoting by multiple values
In DataFrame.pivot
, when specifying multiple values
columns, the result would redundantly
include the column
column in the column names. This has been addressed.
Example
Before:
>>> df = pl.DataFrame(
... {
... "name": ["Cady", "Cady", "Karen", "Karen"],
... "subject": ["maths", "physics", "maths", "physics"],
... "test_1": [98, 99, 61, 58],
... "test_2": [100, 100, 60, 60],
... }
... )
>>> df.pivot(index='name', columns='subject', values=['test_1', 'test_2'])
shape: (2, 5)
┌───────┬──────────────────────┬────────────────────────┬──────────────────────┬────────────────────────┐
│ name ┆ test_1_subject_maths ┆ test_1_subject_physics ┆ test_2_subject_maths ┆ test_2_subject_physics │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═══════╪══════════════════════╪════════════════════════╪══════════════════════╪════════════════════════╡
│ Cady ┆ 98 ┆ 99 ┆ 100 ┆ 100 │
│ Karen ┆ 61 ┆ 58 ┆ 60 ┆ 60 │
└───────┴──────────────────────┴────────────────────────┴──────────────────────┴────────────────────────┘
After:
>>> df = pl.DataFrame(
... {
... "name": ["Cady", "Cady", "Karen", "Karen"],
... "subject": ["maths", "physics", "maths", "physics"],
... "test_1": [98, 99, 61, 58],
... "test_2": [100, 100, 60, 60],
... }
... )
>>> df.pivot('subject', index='name')
┌───────┬──────────────┬────────────────┬──────────────┬────────────────┐
│ name ┆ test_1_maths ┆ test_1_physics ┆ test_2_maths ┆ test_2_physics │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═══════╪══════════════╪════════════════╪══════════════╪════════════════╡
│ Cady ┆ 98 ┆ 99 ┆ 100 ┆ 100 │
│ Karen ┆ 61 ┆ 58 ┆ 60 ┆ 60 │
└───────┴──────────────┴────────────────┴──────────────┴────────────────┘
Note that the function signature has also changed:
columns
has been renamed toon
, and is now the first positional argument.index
andvalues
are both optional. Ifindex
is not specified, then it will use all columns not specified inon
andvalues
. Ifvalues
is not specified, it will use all columns not specified inon
andindex
.
Support Decimal types by default when converting from Arrow
Update conversion from Arrow to always convert Decimals into Polars Decimals, rather than cast to
Float64. Config.activate_decimals
has been removed.
Example
Before:
>>> from decimal import Decimal as D
>>> import pyarrow as pa
>>> arr = pa.array([D("1.01"), D("2.25")])
>>> pl.from_arrow(arr)
shape: (2,)
Series: '' [f64]
[
1.01
2.25
]
After:
>>> pl.from_arrow(arr)
shape: (2,)
Series: '' [decimal[3,2]]
[
1.01
2.25
]
Remove serde functionality from pl.read_json
and DataFrame.write_json
pl.read_json
no longer supports reading JSON files produced by DataFrame.serialize
. Users should
use pl.DataFrame.deserialize
instead.
DataFrame.write_json
now only writes row-oriented JSON. The parameters row_oriented
and pretty
have been removed. Users should use DataFrame.serialize
to serialize a DataFrame.
Example - write_json
Before:
>>> df = pl.DataFrame({"a": [1, 2], "b": [3.0, 4.0]})
>>> df.write_json()
'{"columns":[{"name":"a","datatype":"Int64","bit_settings":"","values":[1,2]},{"name":"b","datatype":"Float64","bit_settings":"","values":[3.0,4.0]}]}'
After:
>>> df.write_json() # Same behavior as previously `df.write_json(row_oriented=True)`
'[{"a":1,"b":3.0},{"a":2,"b":4.0}]'
Example - read_json
Before:
>>> import io
>>> df_ser = '{"columns":[{"name":"a","datatype":"Int64","bit_settings":"","values":[1,2]},{"name":"b","datatype":"Float64","bit_settings":"","values":[3.0,4.0]}]}'
>>> pl.read_json(io.StringIO(df_ser))
shape: (2, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════╪═════╡
│ 1 ┆ 3.0 │
│ 2 ┆ 4.0 │
└─────┴─────┘
After:
>>> pl.read_json(io.StringIO(df_ser)) # Format no longer supported: data is treated as a single row
shape: (1, 1)
┌─────────────────────────────────┐
│ columns │
│ --- │
│ list[struct[4]] │
╞═════════════════════════════════╡
│ [{"a","Int64","",[1.0, 2.0]}, … │
└─────────────────────────────────┘
Use instead:
>>> pl.DataFrame.deserialize(io.StringIO(df_ser))
shape: (2, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════╪═════╡
│ 1 ┆ 3.0 │
│ 2 ┆ 4.0 │
└─────┴─────┘
Series.equals
no longer checks names by default
Previously, Series.equals
would return False
if the Series names didn't match. The method now no
longer checks the names by default. The previous behavior can be retained by setting
check_names=True
.
Example
Before:
>>> s1 = pl.Series("foo", [1, 2, 3])
>>> s2 = pl.Series("bar", [1, 2, 3])
>>> s1.equals(s2)
False
After:
>>> s1.equals(s2)
True
>>> s1.equals(s2, check_names=True)
False
Remove columns
parameter from nth
expression function
The columns
parameter was removed in favor of treating positional inputs as additional indices.
Use Expr.get
instead to get the same functionality.
Example
Before:
>>> df = pl.DataFrame({"a": [1, 2], "b": [3, 4], "c": [5, 6]})
>>> df.select(pl.nth(1, "a"))
shape: (1, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 2 │
└─────┘
After:
>>> df.select(pl.nth(1, "a"))
...
TypeError: argument 'indices': 'str' object cannot be interpreted as an integer
Use instead:
>>> df.select(pl.col("a").get(1))
shape: (1, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 2 │
└─────┘
Rename struct fields of rle
output
The struct fields of the rle
method have been renamed from lengths/values
to len/value
. The
data type of the len
field has also been updated to match the index type (was previously Int32
,
now UInt32
).
Before
>>> s = pl.Series(["a", "a", "b", "c", "c", "c"])
>>> s.rle().struct.unnest()
shape: (3, 2)
┌─────────┬────────┐
│ lengths ┆ values │
│ --- ┆ --- │
│ i32 ┆ str │
╞═════════╪════════╡
│ 2 ┆ a │
│ 1 ┆ b │
│ 3 ┆ c │
└─────────┴────────┘
After
>>> s.rle().struct.unnest()
shape: (3, 2)
┌─────┬───────┐
│ len ┆ value │
│ --- ┆ --- │
│ u32 ┆ str │
╞═════╪═══════╡
│ 2 ┆ a │
│ 1 ┆ b │
│ 3 ┆ c │
└─────┴───────┘
Update set_sorted
to only accept a single column
Calling set_sorted
indicates that a column is sorted individually. Passing multiple columns
indicates that each of those columns are also sorted individually. However, many users assumed this
meant that the columns were sorted as a group, which led to incorrect results.
To help users avoid this pitfall, we removed the possibility to specify multiple columns in
set_sorted
. To set multiple columns as sorted, simply call set_sorted
multiple times.
Example
Before:
>>> df = pl.DataFrame({"a": [1, 2, 3], "b": [4.0, 5.0, 6.0], "c": [9, 7, 8]})
>>> df.set_sorted("a", "b")
After:
>>> df.set_sorted("a", "b")
Traceback (most recent call last):
...
TypeError: DataFrame.set_sorted() takes 2 positional arguments but 3 were given
Use instead:
>>> df.set_sorted("a").set_sorted("b")
Default to raising on out-of-bounds indices in all get
/gather
operations
The default behavior was inconsistent between get
and gather
operations in various places. Now
all such operations will raise by default. Pass null_on_oob=True
to restore previous behavior.
Example
Before:
>>> s = pl.Series([[0, 1, 2], [0]])
>>> s.list.get(1)
shape: (2,)
Series: '' [i64]
[
1
null
]
After:
>>> s.list.get(1)
Traceback (most recent call last):
...
polars.exceptions.ComputeError: get index is out of bounds
Use instead:
>>> s.list.get(1, null_on_oob=True)
shape: (2,)
Series: '' [i64]
[
1
null
]
Change default engine for read_excel
to "calamine"
The calamine
engine (available through the fastexcel
package) has been added to Polars
relatively recently. It's much faster than the other engines, and was already the default for xlsb
and xls
files. We now made it the default for all Excel files.
There may be subtle differences between this engine and the previous default (xlsx2csv
). One clear
difference is that the calamine
engine does not support the engine_options
parameter. If you
cannot get your desired behavior with the calamine
engine, specify engine="xlsx2csv"
to restore
previous behavior.
Example
Before:
>>> pl.read_excel("data.xlsx", engine_options={"skip_empty_lines": True})
After:
>>> pl.read_excel("data.xlsx", engine_options={"skip_empty_lines": True})
Traceback (most recent call last):
...
TypeError: read_excel() got an unexpected keyword argument 'skip_empty_lines'
Instead, explicitly specify the xlsx2csv
engine or omit the engine_options
:
>>> pl.read_excel("data.xlsx", engine="xlsx2csv", engine_options={"skip_empty_lines": True})
Remove class variables from some DataTypes
Some DataType classes had class variables. The Datetime
class, for example, had time_unit
and
time_zone
as class variables. This was unintended: these should have been instance variables. This
has now been corrected.
Example
Before:
>>> dtype = pl.Datetime
>>> dtype.time_unit is None
True
After:
>>> dtype.time_unit is None
Traceback (most recent call last):
...
AttributeError: type object 'Datetime' has no attribute 'time_unit'
Use instead:
>>> getattr(dtype, "time_unit", None) is None
True
Change default offset
in group_by_dynamic
from 'negative every
' to 'zero'
This affects the start of the first window in group_by_dynamic
. The new behavior should align more
with user expectations.
Example
Before:
>>> from datetime import date
>>> df = pl.DataFrame({
... "ts": [date(2020, 1, 1), date(2020, 1, 2), date(2020, 1, 3)],
... "value": [1, 2, 3],
... })
>>> df.group_by_dynamic("ts", every="1d", period="2d").agg("value")
shape: (4, 2)
┌────────────┬───────────┐
│ ts ┆ value │
│ --- ┆ --- │
│ date ┆ list[i64] │
╞════════════╪═══════════╡
│ 2019-12-31 ┆ [1] │
│ 2020-01-01 ┆ [1, 2] │
│ 2020-01-02 ┆ [2, 3] │
│ 2020-01-03 ┆ [3] │
└────────────┴───────────┘
After:
>>> df.group_by_dynamic("ts", every="1d", period="2d").agg("value")
shape: (3, 2)
┌────────────┬───────────┐
│ ts ┆ value │
│ --- ┆ --- │
│ date ┆ list[i64] │
╞════════════╪═══════════╡
│ 2020-01-01 ┆ [1, 2] │
│ 2020-01-02 ┆ [2, 3] │
│ 2020-01-03 ┆ [3] │
└────────────┴───────────┘
Change default serialization format of LazyFrame/DataFrame/Expr
The only serialization format available for the serialize/deserialize
methods on Polars objects
was JSON. We added a more optimized binary format and made this the default. JSON serialization is
still available by passing format="json"
.
Example
Before:
>>> lf = pl.LazyFrame({"a": [1, 2, 3]}).sum()
>>> serialized = lf.serialize()
>>> serialized
'{"MapFunction":{"input":{"DataFrameScan":{"df":{"columns":[{"name":...'
>>> from io import StringIO
>>> pl.LazyFrame.deserialize(StringIO(serialized)).collect()
shape: (1, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 6 │
└─────┘
After:
>>> lf = pl.LazyFrame({"a": [1, 2, 3]}).sum()
>>> serialized = lf.serialize()
>>> serialized
b'\xa1kMapFunction\xa2einput\xa1mDataFrameScan\xa4bdf...'
>>> from io import BytesIO # Note: using BytesIO instead of StringIO
>>> pl.LazyFrame.deserialize(BytesIO(serialized)).collect()
shape: (1, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 6 │
└─────┘
Constrain access to globals from DataFrame.sql
in favor of pl.sql
The sql
methods on DataFrame
and LazyFrame
can no longer access global variables. These
methods should be used for operating on the frame itself. For global access, there is now the
top-level sql
function.
Example
Before:
>>> df1 = pl.DataFrame({"id1": [1, 2]})
>>> df2 = pl.DataFrame({"id2": [3, 4]})
>>> df1.sql("SELECT * FROM df1 CROSS JOIN df2")
shape: (4, 2)
┌─────┬─────┐
│ id1 ┆ id2 │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 3 │
│ 1 ┆ 4 │
│ 2 ┆ 3 │
│ 2 ┆ 4 │
└─────┴─────┘
After:
>>> df1.sql("SELECT * FROM df1 CROSS JOIN df2")
Traceback (most recent call last):
...
polars.exceptions.SQLInterfaceError: relation 'df1' was not found
Use instead:
>>> pl.sql("SELECT * FROM df1 CROSS JOIN df2", eager=True)
shape: (4, 2)
┌─────┬─────┐
│ id1 ┆ id2 │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 3 │
│ 1 ┆ 4 │
│ 2 ┆ 3 │
│ 2 ┆ 4 │
└─────┴─────┘
Remove re-export of type aliases
We have a lot of type aliases defined in the polars.type_aliases
module. Some of these were
re-exported at the top-level and in the polars.datatypes
module. These re-exports have been
removed.
We plan on adding a public polars.typing
module in the future with a number of curated type
aliases. Until then, please define your own type aliases, or import from our polars.type_aliases
module. Note that the type_aliases
module is not technically public, so use at your own risk.
Example
Before:
def foo(dtype: pl.PolarsDataType) -> None: ...
After:
PolarsDataType = pl.DataType | type[pl.DataType]
def foo(dtype: PolarsDataType) -> None: ...
Streamline optional dependency definitions in pyproject.toml
We revisited to optional dependency definitions and made some minor changes. If you were using the
extras fastexcel
, gevent
, matplotlib
, or async
, this is a breaking change. Please update
your Polars installation to use the new extras.
Example
Before:
pip install 'polars[fastexcel,gevent,matplotlib]'
After:
pip install 'polars[calamine,async,graph]'
Deprecations
Issue PerformanceWarning
when LazyFrame properties schema/dtypes/columns/width
are used
Recent improvements to the correctness of the schema resolving in the lazy engine have had significant performance impact on the cost of resolving the schema. It is no longer 'free' - in fact, in complex pipelines with lazy file reading, resolving the schema can be relatively expensive.
Because of this, the schema-related properties on LazyFrame were no longer good API design. Properties represent information that is already available, and just needs to be retrieved. However, for the LazyFrame properties, accessing these may have significant performance cost.
To solve this, we added the LazyFrame.collect_schema
method, which retrieves the schema and
returns a Schema
object. The properties raise a PerformanceWarning
and tell the user to use
collect_schema
instead. We chose not to deprecate the properties for now to facilitate writing
code that is generic for both DataFrames and LazyFrames.