Skip to content

Checkpointing

When running long queries, you can enable checkpointing so that intermediate state is persisted during execution. If a worker fails part way through, the query can resume from the last checkpoint instead of starting over from the beginning.

Checkpointing requires configuration on both the scheduler and the workers: the scheduler decides when checkpoints are created, while the workers store the checkpoint data.

Note

Checkpointing is a no-op when the query's shuffles already write to a shared storage system. In that case the intermediate state is persisted there as part of normal execution, so there is nothing extra for checkpointing to store.

Scheduler configuration

Enable checkpointing on the scheduler by adding the [scheduler.checkpoint] section. The period controls how often checkpoints are created: once the period has passed after a stage has completed, a checkpoint is created.

[scheduler]
enabled = true
# ...

[scheduler.checkpoint]
enabled = true
period = "10 mins"

Checkpointing must be turned on by setting enabled = true, just like the other components. The period accepts either a jiff friendly duration format (see the jiff documentation) or an ISO 8601 duration format, e.g. PT10M for 10 minutes.

Worker configuration

Each worker needs a location to store its checkpoint data, configured through worker.checkpoint_location. This can be a shared filesystem or an object store on S3, Google Cloud Storage, or Azure Blob Storage.

Shared filesystem

If your infrastructure has some shared storage file system, such as NFS (or CephFs, etc.), you can use that here. The path must be accessible by all workers on the same path.

[worker]
enabled = true
checkpoint_location.shared_filesystem.path = "/mnt/storage/polars/checkpoints"

S3-compatible storage

[worker]
enabled = true
checkpoint_location.s3.url = "s3://bucket/path/to/key"
checkpoint_location.s3.aws_access_key_id = "YOURACCESSKEY"
checkpoint_location.s3.aws_secret_access_key = "YOURSECRETKEY"

If you self-host an S3-compatible storage solution, you can override the aws_endpoint_url configuration option.

[worker]
checkpoint_location.s3.aws_endpoint_url = "http://your-s3-compatible-storage-host:8080"

Google Cloud Storage

[worker]
enabled = true
checkpoint_location.gcs.url = "gs://bucket/path/to/key"
checkpoint_location.gcs.google_service_account_path = "/etc/polars/gcs-service-account.json"

Azure Blob Storage

[worker]
enabled = true
checkpoint_location.abs.url = "az://container/path/to/key"
checkpoint_location.abs.azure_storage_account_name = "YOURACCOUNT"
checkpoint_location.abs.azure_storage_account_key = "YOURKEY"

Object store options

For the object store options (s3, gcs, and abs), the allowed keys are the same as in scan_parquet() (e.g. aws_access_key_id, google_service_account_path, azure_storage_account_name). You can use any other cloud provider that supports the S3 API, such as MinIO or DigitalOcean Spaces.