polars.scan_pyarrow_dataset#
- polars.scan_pyarrow_dataset( ) LazyFrame [source]#
Scan a pyarrow dataset.
This can be useful to connect to cloud or partitioned datasets.
- Parameters:
- source
Pyarrow dataset to scan.
- allow_pyarrow_filter
Allow predicates to be pushed down to pyarrow. This can lead to different results if comparisons are done with null values as pyarrow handles this different than polars does.
- batch_size
The maximum row count for scanned pyarrow record batches.
Warning
This API is experimental and may change without it being considered a breaking change.
Notes
When using partitioning, the appropriate
partitioning
option must be set onpyarrow.dataset.dataset
before passing to Polars or the partitioned-on column(s) may not get passed to Polars.Examples
>>> import pyarrow.dataset as ds >>> dset = ds.dataset("s3://my-partitioned-folder/", format="ipc") >>> ( ... pl.scan_pyarrow_dataset(dset) ... .filter("bools") ... .select(["bools", "floats", "date"]) ... .collect() ... ) shape: (1, 3) ┌───────┬────────┬────────────┐ │ bools ┆ floats ┆ date │ │ --- ┆ --- ┆ --- │ │ bool ┆ f64 ┆ date │ ╞═══════╪════════╪════════════╡ │ true ┆ 2.0 ┆ 1970-05-04 │ └───────┴────────┴────────────┘