Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
pola-rs
GitHub Repository: pola-rs/polars
Path: blob/main/docs/source/polars-on-premises/bare-metal/config-reference.md
7889 views

Config file reference

This page describes the different configuration options for Polars on-premises. The config file is a standard TOML file with different sections. Any of the configuration can be overridden using environment variables in the following format: PC_CUBLET__section_name__key.

Top-level configuration

The polars-on-premises binary requires a license which path is provided as a configuration option, listed below. The license itself has the following shape:

{ "params": { "expiry": "2026-01-31T23:59:59Z", "name": "Company" }, "signature": "..." }
KeyTypeDescription
cluster_idstringLogical ID for the cluster; workers and scheduler that share this ID will form a single cluster.
e.g. prod-eu-1; must be unique among all clusters.
cublet_idstringUnique ID for this node (aka "cublet") within the cluster, used for addressing and leader selection.
e.g. scheduler, worker_0; must be unique per cluster.
licensepathAbsolute path to the Polars on-premises license file required to start the process.
e.g. /etc/polars/license.json.
memory_limitintegerHard memory budget for all components in this cublet; enforced via cgroups when delegated.
e.g. 1073741824 (1 GiB), 10737418240 (10 GiB).

Example:

cluster_id = "polars-cluster-dev" cublet_id = "scheduler" license = "/etc/polars/license.json" memory_limit = 1073741824 # 1 GiB

[scheduler] section

For remote Polars queries without a specific output sink, Polars on-premises can automatically add persistent sink. We call these sinks "anonymous results" sinks. Infrastructure-wise, these sinks are backed by S3-compatible storage accessible from all worker nodes and the Python client. The data written to this location is not automatically deleted, so you need to configure a retention policy for this data yourself.

You may configure the credentials using the options listed below; the key names correspond to the storage_options parameter from the scan_parquet() method (e.g. aws_access_key_id, aws_secret_access_key, aws_session_token, aws_region). We currently only support the AWS keys of the storage_options dictionary, but note that you can use any other cloud provider that supports the S3 API, such as MinIO or DigitalOcean Spaces.

KeyTypeDescription
enabledbooleanWhether the scheduler component runs in this process.
true for the leader node, false on pure workers.
allow_shared_diskbooleanWhether workers are allowed to write to a shared/local disk visible to the scheduler.
false for fully remote/storage-only setups, true if you have a shared filesystem.
n_workersintegerExpected number of workers in this cluster; scheduler waits for the latter to be online before running queries.
e.g. 4.
anonymous_result_dststringDestination for results of queries that do not have an explicit sink. Currently supported local mounted (must be reachable on the exact same path and allow_shared_disk enabled) and S3-based. Both options must be network reachable by scheduler, workers, and client.
e.g. s3://bucket/path/to/key.
e.g. file:///mnt/storage/polars/results
anonymous_result_dst.s3objectComplex subject for when using S3-backed anonymous results.
anonymous_result_dst.s3.urlstringS3 bucket url.
e.g. s3://bucket/path/to/key.
anonymous_result_dst.s3.aws_endpoint_urlstringStorage option configuration, see scan_parquet().
anonymous_result_dst.s3.aws_regionstringStorage option configuration.
e.g. eu-east-1
anonymous_result_dst.s3.aws_access_key_idstringStorage option configuration.
anonymous_result_dst.s3.aws_secret_access_keystringStorage option configuration.

Example:

[scheduler] enabled = true allow_shared_disk = false n_workers = 4 anonymous_result_dst.s3.url = "s3://bucket/path/to/key" anonymous_result_dst.s3.aws_secret_access_key = "YOURSECRETKEY" anonymous_result_dst.s3.aws_access_key_id = "YOURACCESSKEY"

Example with mounted local disk as anonymous result destination:

[scheduler] enabled = true allow_shared_disk = true anonymous_result_dst = "file:///mnt/storage/polars/results"

[worker] section

During distributed query execution, data may be shuffled between workers. A local path can be provided, but shuffles can also be configured to use S3-compatible storage (accessible from all worker nodes). You may configure the credentials using the options listed below; the key names correspond to the storage_options parameter from the scan_parquet() method (e.g. aws_access_key_id, aws_secret_access_key, aws_session_token, aws_region).

KeyTypeDescription
enabledbooleanWhether the worker component runs in this process.
true on worker nodes, false on the dedicated scheduler.
worker_ipstringPublic or routable IP address other workers/scheduler use to reach this worker.
e.g. 192.168.1.2.
flight_portintegerPort for shuffle traffic between workers.
e.g. 5052.
service_portintegerPort on which the worker receives task instructions from the scheduler.
e.g. 5053.
heartbeat_interval_secsintegerInterval for worker heartbeats towards the scheduler, used for liveness and load reporting.
e.g. 5.
shuffle_location.local.pathpathLocal path where shuffle/intermediate data is stored; fast local SSD is recommended.
e.g. /mnt/storage/polars/shuffle.
shuffle_location.s3.urlpathDestination for shuffle/intermediate data.
e.g. s3://bucket/path/to/key.
shuffle_location.s3.aws_endpoint_urlstringStorage option configuration, see scan_parquet().
shuffle_location.s3.aws_regionstringStorage option configuration.
e.g. eu-east-1
shuffle_location.s3.aws_access_key_idstringStorage option configuration.
shuffle_location.s3.aws_secret_access_keystringStorage option configuration.

Example:

[worker] enabled = true worker_ip = "192.168.1.2" flight_port = 5052 service_port = 5053 heartbeat_interval_secs = 5 shuffle_location.local.path = "/mnt/storage/polars/shuffle"

[observatory] section

KeyTypeDescription
enabledbooleanEnable sending/receiving profiling data so clients can call result.await_profile().
true on both scheduler and workers if you want profiles on queries; false to disable.
max_metrics_bytes_totalintegerHow many bytes all the worker host metrics will consume in total. If a system-wide memory limit is specified then this is added to the share that the scheduler takes. Note that the worker host metrics is not yet available, so this configuration can be set to 0.

Example:

[observatory] enabled = true max_metrics_bytes_total = 0

[static_leader] section

KeyTypeDescription
leader_keystringID of the leader service; should match the scheduler’s cublet_id.
Typically scheduler to match your scheduler node.
public_leader_addrstringHost/IP at which the leader is reachable from this node.
e.g. 192.168.1.1.

Example:

[static_leader] leader_key = "scheduler" public_leader_addr = "192.168.1.1"