polars.scan_iceberg#
- polars.scan_iceberg(
- source: str | Table,
- *,
- snapshot_id: int | None = None,
- storage_options: dict[str, Any] | None = None,
- reader_override: Literal['native', 'pyiceberg'] | None = None,
- use_metadata_statistics: bool = True,
Lazily read from an Apache Iceberg table.
- Parameters:
- source
A PyIceberg table, or a direct path to the metadata.
Note: For Local filesystem, absolute and relative paths are supported but for the supported object storages - GCS, Azure and S3 full URI must be provided.
- snapshot_id
The snapshot ID to scan from.
- storage_options
Extra options for the storage backends supported by
pyiceberg
. For cloud storages, this may include configurations for authentication etc.More info is available here.
- reader_override
Overrides the reader used to read the data.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
Note that this parameter should not be necessary outside of testing, as polars will by default automatically select the best reader.
Available options:
native: Uses polars native reader. This allows for more optimizations to improve performance.
pyiceberg: Uses PyIceberg, which may support more features.
- use_metadata_statistics
Load and use min/max statistics from Iceberg metadata files when a filter is present. This allows the reader to potentially skip loading metadata from the underlying data files.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- Returns:
- LazyFrame
Examples
Creates a scan for an Iceberg table from local filesystem, or object store.
>>> table_path = "file:/path/to/iceberg-table/metadata.json" >>> pl.scan_iceberg(table_path).collect()
Creates a scan for an Iceberg table from S3. See a list of supported storage options for S3 here.
>>> table_path = "s3://bucket/path/to/iceberg-table/metadata.json" >>> storage_options = { ... "s3.region": "eu-central-1", ... "s3.access-key-id": "THE_AWS_ACCESS_KEY_ID", ... "s3.secret-access-key": "THE_AWS_SECRET_ACCESS_KEY", ... } >>> pl.scan_iceberg( ... table_path, storage_options=storage_options ... ).collect()
Creates a scan for an Iceberg table from Azure. Supported options for Azure are available here.
Following type of table paths are supported:
az://<container>/<path>/metadata.json
adl://<container>/<path>/metadata.json
abfs[s]://<container>/<path>/metadata.json
>>> table_path = "az://container/path/to/iceberg-table/metadata.json" >>> storage_options = { ... "adlfs.account-name": "AZURE_STORAGE_ACCOUNT_NAME", ... "adlfs.account-key": "AZURE_STORAGE_ACCOUNT_KEY", ... } >>> pl.scan_iceberg( ... table_path, storage_options=storage_options ... ).collect()
Creates a scan for an Iceberg table from Google Cloud Storage. Supported options for GCS are available here.
>>> table_path = "s3://bucket/path/to/iceberg-table/metadata.json" >>> storage_options = { ... "gcs.project-id": "my-gcp-project", ... "gcs.oauth.token": "ya29.dr.AfM...", ... } >>> pl.scan_iceberg( ... table_path, storage_options=storage_options ... ).collect()
Creates a scan for an Iceberg table with additional options. In the below example,
without_files
option is used which loads the table without file tracking information.>>> table_path = "/path/to/iceberg-table/metadata.json" >>> storage_options = {"py-io-impl": "pyiceberg.io.fsspec.FsspecFileIO"} >>> pl.scan_iceberg( ... table_path, storage_options=storage_options ... ).collect()
Creates a scan for an Iceberg table using a specific snapshot ID.
>>> table_path = "/path/to/iceberg-table/metadata.json" >>> snapshot_id = 7051579356916758811 >>> pl.scan_iceberg(table_path, snapshot_id=snapshot_id).collect()