Skip to content

Execute remote query

Polars Cloud enables you to execute existing Polars queries on cloud infrastructure with minimal code changes. This approach allows you to process datasets that exceed local resources or use additional compute resources for faster execution.

Polars Cloud is set up and connected

This page assumes that you have created an organization and connected a workspace to your cloud environment. If you haven't yet, follow the steps on the Connect cloud environment page.

Define your query locally

The following example uses a query from the PDS-H benchmark suite, a derived version of the popular TPC-H benchmark. Data generation tools and additional queries are available in the Polars benchmark repository.

import polars as pl

customer = pl.scan_parquet("data/customer.parquet")
lineitem = pl.scan_parquet("data/lineitem.parquet")
orders = pl.scan_parquet("data/orders.parquet")

def pdsh_q3(customer, lineitem, orders):

    return (
        customer.filter(pl.col("c_mktsegment") == "BUILDING")
        .join(orders, left_on="c_custkey", right_on="o_custkey")
        .join(lineitem, left_on="o_orderkey", right_on="l_orderkey")
        .filter(pl.col("o_orderdate") < pl.date(1995, 3, 15))
        .filter(pl.col("l_shipdate") > pl.date(1995, 3, 15))
        .with_columns(
            (pl.col("l_extendedprice") * (1 - pl.col("l_discount"))).alias("revenue")
        )
        .group_by("o_orderkey", "o_orderdate", "o_shippriority")
        .agg(pl.sum("revenue"))
        .select(
            pl.col("o_orderkey").alias("l_orderkey"),
            "revenue",
            "o_orderdate",
            "o_shippriority",
        )
        .sort(by=["revenue", "o_orderdate"], descending=[True, False])
    )


pdsh_q3(customer, lineitem, orders).collect()

Scale to the cloud

To execute your query in the cloud, you need to define a compute context. The compute context specifies the hardware to use when executing the query in the cloud. It allows you to set the workspace to execute your query and set compute resources. More elaborate options can be found on the Compute context introduction page.

ComputeContext

import polars_cloud as pc

ctx = pc.ComputeContext(
    # make sure to enter your own workspace name
    workspace="your-workspace",
    memory=16,
    cpus=12,
)

# Use a larger dataset available on S3
lineitem_sf10 = pl.scan_parquet("s3://polars-cloud-samples-us-east-2-prd/pdsh/sf10/lineitem.parquet",
        storage_options={"request_payer": "true"})
customer_sf10 = pl.scan_parquet("s3://polars-cloud-samples-us-east-2-prd/pdsh/sf10/customer.parquet",
        storage_options={"request_payer": "true"})
orders_sf10 = pl.scan_parquet("s3://polars-cloud-samples-us-east-2-prd/pdsh/sf10/orders.parquet",
        storage_options={"request_payer": "true"})

# Your query remains the same
pdsh_q3(lineitem_sf10, customer_sf10, orders_sf10).remote(context=ctx).show()

Run the examples yourself

All examples on this page can be executed using the sample datasets hosted on our S3 bucket. By including the storage_option parameter in your queries, you'll only incur S3 data transfer costs. No additional storage fees apply

S3 bucket region

The example datasets are hosted in the us-east-2 S3 region. Query performance may be affected if you're running operations from a distant geographic location due to network latency.

Working with remote query results

Once you've called .remote(context=ctx) on your query, you have several options for how to handle the results, each suited to different use cases and workflows.

Write to storage

The most straightforward approach for batch processing is to write results directly to cloud storage using .sink_parquet(). This method is ideal when you want to store processed data for later use or as part of a data pipeline:

# Replace the S3 url with your own to run the query successfully

pdsh_q3(lineitem_sf10, customer_sf10, orders_sf10).remote(context=ctx).sink_parquet("s3://your-bucket/processed-data/")

Running .sink_parquet() will write the results to the defined bucket on S3. The query you execute runs in your cloud environment, and both the data and results remain secure in your own infrastructure. This approach is perfect for ETL workflows, scheduled jobs, or any time you need to persist large datasets without transferring them to your local machine.

Inspect results

Using .show() will display the first 10 rows of the result so you can inspect the structure without having to transfer the whole dataset. This method displays the first 10 rows in your console or notebook.

pdsh_q3(lineitem_sf10, customer_sf10, orders_sf10).remote(context=ctx).show()
shape: (10, 4)
┌────────────┬─────────────┬─────────────┬────────────────┐
│ l_orderkey ┆ revenue     ┆ o_orderdate ┆ o_shippriority │
│ ---        ┆ ---         ┆ ---         ┆ ---            │
│ i64        ┆ f64         ┆ date        ┆ i64            │
╞════════════╪═════════════╪═════════════╪════════════════╡
│ 4791171    ┆ 440715.2185 ┆ 1995-02-23  ┆ 0              │
│ 46678469   ┆ 439855.325  ┆ 1995-01-27  ┆ 0              │
│ 23906758   ┆ 432728.5737 ┆ 1995-03-14  ┆ 0              │
│ 23861382   ┆ 428739.1368 ┆ 1995-03-09  ┆ 0              │
│ 59393639   ┆ 426036.0662 ┆ 1995-02-12  ┆ 0              │
│ 3355202    ┆ 425100.6657 ┆ 1995-03-04  ┆ 0              │
│ 9806272    ┆ 425088.0568 ┆ 1995-03-13  ┆ 0              │
│ 22810436   ┆ 423231.969  ┆ 1995-01-02  ┆ 0              │
│ 16384100   ┆ 421478.7294 ┆ 1995-03-02  ┆ 0              │
│ 52974151   ┆ 415367.1195 ┆ 1995-02-05  ┆ 0              │
└────────────┴─────────────┴─────────────┴────────────────┘

The .await_and_scan() method returns a LazyFrame pointing to intermediate results stored temporarily in your S3 environment. These intermediate result files are automatically deleted after several hours. For persistent storage use sink_parquet. The output is a LazyFrame, allowing continued query chaining for further analysis.

result = pdsh_q3(lineitem_sf10, customer_sf10, orders_sf10).remote(context=ctx).await_and_scan()

print(result.collect())
shape: (114_003, 4)
┌────────────┬─────────────┬─────────────┬────────────────┐
│ l_orderkey ┆ revenue     ┆ o_orderdate ┆ o_shippriority │
│ ---        ┆ ---         ┆ ---         ┆ ---            │
│ i64        ┆ f64         ┆ date        ┆ i64            │
╞════════════╪═════════════╪═════════════╪════════════════╡
│ 4791171    ┆ 440715.2185 ┆ 1995-02-23  ┆ 0              │
│ 46678469   ┆ 439855.325  ┆ 1995-01-27  ┆ 0              │
│ 23906758   ┆ 432728.5737 ┆ 1995-03-14  ┆ 0              │
│ 23861382   ┆ 428739.1368 ┆ 1995-03-09  ┆ 0              │
│ 59393639   ┆ 426036.0662 ┆ 1995-02-12  ┆ 0              │
│ …          ┆ …           ┆ …           ┆ …              │
│ 44149381   ┆ 904.3968    ┆ 1995-01-16  ┆ 0              │
│ 34297697   ┆ 897.8464    ┆ 1995-03-06  ┆ 0              │
│ 25478115   ┆ 887.2318    ┆ 1994-11-28  ┆ 0              │
│ 52204674   ┆ 860.25      ┆ 1994-12-18  ┆ 0              │
│ 47255457   ┆ 838.9381    ┆ 1994-11-18  ┆ 0              │
└────────────┴─────────────┴─────────────┴────────────────┘