Getting started

Polars Cloud is a managed compute platform for your Polars queries. It allows you to effortlessly run your local queries in your cloud environment, both in an interactive setting as well as for ETL or batch jobs. By working in a 'Bring your own Cloud' model the data never leaves your environment.

Installation

Install the Polars Cloud python library in your environment

$ pip install polars polars-cloud

Create an account and login by running the command below.

$ pc login

Connect your cloud

Polars Cloud currently exclusively supports AWS as a cloud provider.

Polars Cloud needs permission to manage hardware in your environment. This is done by deploying our cloudformation template. See our infrastructure section for more details.

To connect your cloud run:

$ pc setup workspace -n <YOUR_WORKSPACE_NAME>

This redirects you to the browser where you can connect Polars Cloud to your AWS environment. Alternatively, you can follow the steps in the browser and create the workspace there.

Run your queries

Now that we are done with the setup, we can start running queries. The general principle here is writing Polars like you're always used to and calling .remote() on your LazyFrame. The following example shows how to create a compute cluster and run a simple Polars query.

Python

ComputeContext · LazyFrameExt

import polars_cloud as pc
import polars as pl

ctx = pc.ComputeContext(memory=8, cpus=2, cluster_size=1)
lf = pl.LazyFrame(
    {
        "a": [1, 2, 3],
        "b": [4, 4, 5],
    }
).with_columns(
    pl.col("a").max().over("b").alias("c"),
)
(
    lf.remote(context=ctx)
    .sink_parquet(uri="s3://my-bucket/result.parquet")
)

Let us go through the code line by line. First we need to define the hardware the cluster will run on. This can be provided in terms of cpu & memory or by specifying the the exact instance type in AWS.

ctx = pc.ComputeContext(memory=8, cpus=2 , cluster_size=1)

Then we write a regular lazy Polars query. In this simple example we compute the maximum of column a over column b.

df = pl.LazyFrame({
    "a": [1, 2, 3],
    "b": [4, 4, 5]
})
lf = df.with_columns(
    c = pl.col("a").max().over("b")
)

Finally we are going to run our query on the compute cluster. We use .remote() to signify that we want to run the query remotely. This gives back a special version of the LazyFrame with extension methods. Up until this point nothing has executed yet, calling .write_parquet() sends the query to g