Getting started
The following section walks you through the few steps required to deploy a cluster on your own
Kubernetes infrastructure. We expect the latter to be provisioned, and the few tools to interact
with it (kubectl and Helm) to be installed.
Start by creating an account if you do not already have one, and follow the steps to create a Kubernetes Workspace.
Create a Polars service account
Each cluster, regardless of its deployment type, needs to authenticate and register with our control plane. This is done via a service account, a credential pair (client ID and client secret) that you create and store on your end. We do not store it on our side, and in case of loss a new set will need to be generated. Service accounts are managed (created and revoked) in the settings page of the workspace or can be created with the CLI.
pc service-account create --workspace-name my-workspace --name my-sa
Note this service account is distinct from any Kubernetes service account object; it is an identity managed solely within Polars.
Deploy your cluster with Helm
Now that the admin is done, the most exciting part: the actual deployment. We distribute a Helm chart via this repo, and you can use the following commands to register the repo and install the chart:
Workspace ID
The Workspace ID can be found in the workspace settings page or with pc workspace list.
helm repo add polars-inc https://polars-inc.github.io/helm-charts
helm repo update
helm upgrade --install polars polars-inc/polars \
--set clusterId="My First Cluster" \
--set workspaceId=<WORKSPACE ID> \
--set clientId=<SERVICE ACCOUNT ID> \
--set clientSecret=<SERVICE ACCOUNT SECRET> \
--set scheduler.deployment.runtimeContainer.resources.requests.memory=1Gi \
--set worker.deployment.replicaCount=2 \
--set worker.deployment.runtimeContainer.resources.requests.memory=4Gi \
--set worker.deployment.runtimeContainer.resources.limits.memory=4Gi \
--set anonymousResults.temporaryStorage.enabled=true
Not for production use
The cluster configuration defined above is for a quickstart only and should not be used in a production environment! See the Production configuration section below.
Key parameters explained:
- The name of the workspace chosen earlier needs to be provided to uniquely identify the cluster.
- The Polars On-Prem service account credentials generated earlier also need to be provided for authentication.
- The Helm command deploys a cluster including 2 workers; feel free to increase that number to fulfill your needs. Note that we also set up requests and limits thresholds for the deployed resources: 1 GiB for the scheduler and 4 GiB for each worker.
- For remote Polars queries without a specific output sink, Polars On-Prem automatically adds a
persistent sink. We call this sink the "anonymous results" sink. Infrastructure-wise, this sink is
backed by S3-compatible storage, which should be accessible from all worker nodes and the client.
For a lightweight quickstart we opted for SeaweedFS,
backed by an
emptyDir(all temporary results will be lost upon restart of that service).
Helm will install the cluster in the default Kubernetes namespace. During setup, the
workspaceId, clientId and clientSecret values are all prefilled in the frontend.
Once the installation completes, verify that all pods are running:
kubectl get pods
You should see output similar to:
NAME READY STATUS RESTARTS AGE
polars-scheduler-xxxxxxxxx-xxxxx 1/1 Running 0 1m
polars-worker-xxxxxxxxx-xxxxx 1/1 Running 0 1m
polars-worker-xxxxxxxxx-xxxxx 1/1 Running 0 1m
polars-temporary-storage-xxxxxxxxx-xxxxx 1/1 Running 0 1m
Once all pods show a Running status, the cluster is registered with our control plane and ready to
accept queries.
Run your first query
Below you can find a very minimal Polars query to test if your deployment was successful. For a quickstart, port-forward the required endpoints:
kubectl port-forward svc/polars-scheduler 5051:5051
kubectl port-forward svc/polars-observatory 3001:3001
kubectl port-forward svc/polars-temporary-storage 8333:8333
For a more detailed explanation on the requirements behind these port-forwarding commands, see the next section. Finally, submit your query from a script or notebook cell:
import polars as pl
import polars_cloud as pc
ctx = pc.ClusterContext(compute_address="localhost")
result = (
pl.LazyFrame()
.with_columns(a=pl.arange(0, 100000000).sum())
.remote(ctx)
.execute()
)
print(result.head())
The cluster is now ready to execute your own Polars queries. The following sections give more details about the setup, and what knobs to turn to make your deployment production-ready.
Deployed Kubernetes resources
Each self-hosted deployment creates a few objects, as detailed on the following diagram:

At the moment, no Kubernetes ingress object is provided by our Helm chart. Access to the various components listed above need to be done via port-forwarding:
- The scheduler is available on port 5051, and port-forwarding is required to submit queries to the cluster. This is the central service of the cluster, bookkeeping query queue and execution.
- The dashboard (aka "observatory") is exposed on port 3001. Port-forwarding means gaining access to the locally-exposed cluster dashboard, which lifecycle is tied to the scheduler. The dashboard includes information about the cluster itself, status of submitted queries, and of course query profiling.
- The temporary storage is available on port 8333 (default port for SeaweedFS). Port-forwarding allows access to anonymous results, that is, tell the Python client where to find the result data.
Our Helm chart can easily be wrapped in a parent chart providing the missing Kubernetes objects to avoid these explicit port-forwards; we chose not to include them for now as users will be bound by the implementation of the Kubernetes distribution of their choice.
To tune the configuration of each service, refer to the Helm chart documentation. A few important configuration options are discussed in the Production configuration section below.
Communication with our control plane
Deployed clusters sync with our control plane to verify licensing and power your dashboard experience. This connection streams query plans, profiling data, and cluster metadata, giving you full visibility into historical usage and query execution patterns so you can optimize and troubleshoot queries.
The data your queries process stays entirely within your environment, and is never shared with us.
We also offer custom solutions for running Polars On-Prem in air-gapped environments in which registration with our servers is not required, and no data is shared with us (see On-Prem Enterprise section).
Production configuration
The complete list of configurable options is provided in the documentation of the Helm chart. The three topics listed below are the main configuration sections to tweak to ready your cluster for production use.

In blue the optional PVC and S3-compatible storage.
Anonymous results data
For remote queries without a specific output sink, Polars automatically adds a persistent sink. We call this sink the "anonymous results" sink. Infrastructure-wise, this sink is backed by S3-compatible storage, which must be accessible from all worker nodes and the client.
For a lightweight quickstart we opted for SeaweedFS,
backed by an emptyDir. In a production environment, any S3-compatible technology can be used
(i.e., MinIO, DigitalOcean Spaces, etc.). Support for Azure Blob Storage (ABS) and Google Cloud
Storage (GCS) is currently being tested (released as beta).
Anonymous results configuration is under the
anonymousResults section.
Shuffle data
During query execution, data is spread amongst worker nodes. On certain types of events, data from other nodes needs to be made available to be able to perform next operations; in these situations the data is shuffled between worker nodes, according to the bookkeeping done by the scheduler.
By default, emptyDir volumes are used on each worker node. You can however decide to use ephemeral
volumes instead for more configuration flexibility; as an alternative, your own S3-compatible
storage can be used.
Using S3-compatible storage might improve fault tolerance, since intermediate results are stored independently of the worker pods themselves. The performance trade-off depends on the latency and throughput characteristics of your storage backend relative to local volumes. As an example, on AWS, EBS offers lower latency than S3 but lower throughput. This makes EBS a better fit for workloads that produce many small shuffle files, while S3 will outperform it when shuffle files are large.
Shuffle configuration is under the
shuffleData section.
Resource allocation
Polars On-Prem is best experienced on dedicated Kubernetes nodes, with only one worker pod per node. If other workloads run on the cluster however (or if multiple instances of the Polars On-Prem chart are deployed), pod requests and limits should be allocated to worker nodes via the Kubernetes API. In the same fashion, node selector, taints and tolerations should be used to optimize the topology of the cluster.
Resource allocation and cluster topology configuration is under the
worker.deployment section.
On-Prem Enterprise
If you are interested in deploying one or several clusters without any resource limitations nor data sharing, on bare-metal machines or in a Kubernetes setup, and in air-gapped environments, please sign up here to apply.