Run your first query
Polars Cloud makes it simple to run your query at scale. You can take any existing Polars query and run it on powerful cloud infrastructure with just a few additional lines of code. This allows you to process datasets that are too large for your local machine or leverage more compute power when you need it.
Polars Cloud is set up and connected
This page assumes that you have created an organization and connected a workspace to your cloud environment. If you haven't yet, find more information on Connect cloud environment page.
Define your query locally
If you're already familiar with Polars, you will immediately be productive with Polars Cloud. The
same lazy evaluation and API you know works exactly the same way, with the addition of defining a
ComputeContext and calling .remote(context=ctx)
to execute your query in your cloud environment,
instead of locally.
Below we define a query that you would typically write on your local machine.
import polars as pl
query = (
pl.scan_parquet("s3://my-bucket/data/*.parquet")
.filter(pl.col("status") == "active")
.group_by("category")
.agg(pl.col("amount").sum().alias("total_amount"))
.sort("total_amount")
)
query.collect()
Scale to the cloud
To execute your query in the cloud, we must define a compute context. The compute context defines the hardware to use when executing the query in the cloud. It allows to define the workspace to execute your query, set compute resources and define if you want to execute in job mode. More elaborate options can be found on the Compute context introduction page
import polars_cloud as pc
ctx = pc.ComputeContext(
workspace="your-workspace",
memory=48,
cpus=24,
)
# Your query remains the same
query.remote(context=ctx).collect()
Job mode
Job mode is used for systematic data processing pipelines, written for scheduled execution. They typically process large volumes of data in scheduled intervals (e.g. hourly, daily, etc.). A key characteristic the jobs typically runs without a human in the loop.
Read more about the differences on the compute context page
Distributed execution
To run your queries over multiple nodes, you must define your cluster_size
in the ComputeContext
and call the .distributed()
method.
ctx = pc.ComputeContext(cpus=10, memory=10, cluster_size=10)
query.remote(context=ctx).distributed().sink_parquet("s3://my-bucket/output/")
This distributes your query execution across 10 machines in this example, providing a total of 100 cores and 100GB of RAM for processing. Find more information about executing your queries on multiple nodes on the distributed queries page.
Working with remote query results
Once you've called .remote(context=ctx)
on your query, you have several options for how to handle
the results, each suited to different use cases and workflows.
Direct or intermediate storage of results
The most straightforward approach for batch processing is to write results directly to cloud storage
using .sink_parquet()
. This method is ideal when you want to store processed data for later use or
as part of a data pipeline:
ctx = pc.ComputeContext(cpus=10, memory=10, cluster_size=10)
query.remote(context=ctx).distributed().sink_parquet("s3://my-bucket/output/")
Running .sink_parquet()
will write the results to the defined bucket on S3. The query you execute
in job mode runs in your cloud environment, and the data and results remain secure in your own
infrastructure. This approach is perfect for ETL workflows, scheduled jobs, or any time you need to
persist large datasets without transferring them to your local machine.
Inspect results in your local environment
For exploratory analysis and interactive workflows, use .collect()
or .show()
to return query
results:
# Quick preview of results
query.remote(context=ctx).show()
# Full results for further analysis
result = query.remote(context=ctx).collect()
print(result.collect())
The .show()
method is perfect for data exploration where you want to understand the structure and
content of your results without the overhead of transferring large datasets. It displays a sample of
rows in your console or notebook.
When calling .collect()
on your remote query execution, the intermediate results are written to a
temporary location on S3. These intermediate result files are automatically deleted after several
hours. The output of the remote query is a LazyFrame, which means you can continue chaining
operations on the returned results for further analysis. Find a more elaborate example on the
Example workflow page