Public datasets
Public datasets
Start experimenting with Polars Cloud immediately using our curated public datasets. These datasets span different scale factors, letting you test performance across various data sizes—from small exploratory queries to large-scale processing workloads.
Available datasets
PDSH - derived from TPC-H benchmark Standard analytical queries for testing joins, aggregations, and filtering operations. Queries available in the Polars benchmark repository.
PDSDS - derived from TPC-DS benchmark Decision support dataset designed for complex analytical workloads.
NYC Taxi - source: NYC.gov Real-world transportation data with temporal patterns and geospatial dimensions.
Usage
Access any dataset directly from your Polars code and execute in Polars Cloud:
data = pl.scan_parquet(
"s3://polars-cloud-samples-us-east-2-prd/{dataset}/{scale_factor/year}/",
storage_options={"request_payer": "true"}
)
query = data.select().remote(ctx).execute()
Note: These buckets use AWS Requester Pays, meaning you pay only for pays the cost of the request and the data download from the bucket. The storage costs are covered.
Dataset URLs
All datasets are hosted in AWS region us-east-2
and use Requester Pays buckets.
PDSH (TPC-H derived)
Scale Factor | Size | URL Pattern | Format |
---|---|---|---|
SF10 | ~10GB | s3://polars-cloud-samples-us-east-2-prd/pdsh/sf10/{filename}.parquet |
Single files |
SF100 | ~100GB | s3://polars-cloud-samples-us-east-2-prd/pdsh/sf100/{table}/_.parquet |
Partitioned |
SF1000 | ~1TB | s3://polars-cloud-samples-us-east-2-prd/pdsh/sf1000/{table}/_.parquet |
Partitioned |
Example
data = pl.scan_parquet(
"s3://polars-cloud-samples-us-east-2-prd/pdsh/sf10/lineitem.parquet",
storage_options={"request_payer": "true"}
)
partitioned_data = pl.scan_parquet(
"s3://polars-cloud-samples-us-east-2-prd/pdsh/sf100/lineitem/*.parquet",
storage_options={"request_payer": "true"}
)
PDSDS (TPC-DS derived)
Scale Factor | Size | URL Pattern |
---|---|---|
SF1 | ~1GB | s3://polars-cloud-samples-us-east-2-prd/pdsds/sf1/{filename}.parquet |
SF10 | ~10GB | s3://polars-cloud-samples-us-east-2-prd/pdsds/sf10/{filename}.parquet |
SF100 | ~100GB | s3://polars-cloud-samples-us-east-2-prd/pdsds/sf100/{filename}.parquet |
SF300 | ~300GB | s3://polars-cloud-samples-us-east-2-prd/pdsds/sf300/{filename}.parquet |
Example
data = pl.scan_parquet(
"s3://polars-cloud-samples-us-east-2-prd/pdsh/sf10/store_sales.parquet",
storage_options={"request_payer": "true"}
)
NYC Taxi
Year | URL Pattern |
---|---|
2023 | s3://polars-cloud-samples-us-east-2-prd/taxi/2023/{filename}.parquet |
2024 | s3://polars-cloud-samples-us-east-2-prd/taxi/2024/{filename}.parquet |
Example
data = pl.scan_parquet(
"s3://polars-cloud-samples-us-east-2-prd/taxi/2024/yellow_tripdata_2024-01.parquet",
storage_options={"request_payer": "true"}
)