polars.scan_pyarrow_dataset#

polars.scan_pyarrow_dataset( source: pa.dataset.Dataset, *, allow_pyarrow_filter: bool = True, batch_size: int | None = None, ) → LazyFrame[source]#

Scan a pyarrow dataset.

This can be useful to connect to cloud or partitioned datasets.

Parameters:

source: Pyarrow dataset to scan.
allow_pyarrow_filter: Allow predicates to be pushed down to pyarrow. This can lead to different results if comparisons are done with null values as pyarrow handles this different than polars does.
batch_size: The maximum row count for scanned pyarrow record batches.

Warning

This API is experimental and may change without it being considered a breaking change.

Notes

When using partitioning, the appropriate partitioning option must be set on pyarrow.dataset.dataset before passing to Polars or the partitioned-on column(s) may not get passed to Polars.

Examples

>>> import pyarrow.dataset as ds
>>> dset = ds.dataset("s3://my-partitioned-folder/", format="ipc")  
>>> (
...     pl.scan_pyarrow_dataset(dset)
...     .filter("bools")
...     .select(["bools", "floats", "date"])
...     .collect()
... )  
shape: (1, 3)
┌───────┬────────┬────────────┐
│ bools ┆ floats ┆ date       │
│ ---   ┆ ---    ┆ ---        │
│ bool  ┆ f64    ┆ date       │
╞═══════╪════════╪════════════╡
│ true  ┆ 2.0    ┆ 1970-05-04 │
└───────┴────────┴────────────┘