Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
pola-rs
GitHub Repository: pola-rs/polars
Path: blob/main/docs/source/polars-on-premises/bare-metal/configuration/reference.md
8424 views

Config file reference

This page describes the different configuration options for Polars on-premises. The config file is a standard TOML file with different sections. Any of the configuration can be overridden using environment variables in the following format: PC_CUBLET__section_name__key.

Example configuration files can be found at Example Configurations.

See the sidebar for extensive documentation on important components and their configuration together.

Top-level configuration

KeyTypeDescription
cluster_idstringLogical ID for the cluster; workers and scheduler that share this ID will form a single cluster.
e.g. prod-eu-1; must be unique among all clusters.
instance_idstringUnique ID for this node within the cluster, used for addressing and leader selection.
e.g. scheduler, worker_0; must be unique per cluster.
licensepathAbsolute path to the Polars on-premises license file required to start the process.
e.g. /etc/polars/license.json.
memory_limitintegerHard memory budget for all components in this node; enforced via cgroups when delegated.
e.g. 1073741824 (1 GiB), 10737418240 (10 GiB).

[scheduler] section

KeyTypeDescription
enabledbooleanWhether the scheduler component runs in this process.
true for the leader node, false on pure workers.
allow_local_sinksbooleanWhether workers are allowed to write to a shared/local disk visible to the scheduler.
false for fully remote/storage-only setups, true if you have a shared filesystem.
n_workersintegerExpected number of workers in this cluster; scheduler waits for the latter to be online before running queries.
e.g. 4.
anonymous_result_locationobjectDestination for results of queries that do not have an explicit sink. Currently supported local mounted (must be reachable on the exact same path and allow_local_sinks enabled) and S3-based. Both options must be network reachable by scheduler, workers, and client.
e.g. /mnt/storage/polars/results.
e.g. s3://bucket/path/to/key
anonymous_result_location.localobjectObject used for local disk-backed anonymous results.
anonymous_result_location.local.pathpathLocal path where anonymous results are stored.
e.g. /mnt/storage/polars/results.
anonymous_result_location.s3objectObject used for S3-backed anonymous results.
anonymous_result_location.s3.urlstringS3 bucket url.
e.g. s3://bucket/path/to/key.
anonymous_result_location.s3.aws_endpoint_urlstringStorage option configuration, see scan_parquet().
anonymous_result_location.s3.aws_regionstringStorage option configuration.
e.g. eu-east-1
anonymous_result_location.s3.aws_access_key_idstringStorage option configuration.
anonymous_result_location.s3.aws_secret_access_keystringStorage option configuration.
client_serviceobjectObject used for configuring the bind address of the client service. This is the service used by the polars-cloud Python client. Defaults to 0.0.0.0:5051.
client_service.bind_addrstringBind address for the client service.
e.g. 0.0.0.0:5051.
client_service.bind_addr.ipstringIP address for the client service bind address.
e.g. 192.168.1.1.
client_service.bind_addr.portintegerPort for the client service bind address.
e.g. 5051.
client_service.bind_addr.hostnamestringAlternative to ip, resolved once at startup.
e.g. my-host-1.
worker_serviceobjectObject used for configuring the bind address of the worker service. This is an internal service used by the workers. Defaults to 0.0.0.0:5050.
worker_service.bind_addrstringBind address for the worker service.
e.g. 0.0.0.0:5050.
worker_service.bind_addr.ipstringIP address for the worker service bind address.
e.g. 192.168.1.1.
worker_service.bind_addr.portintegerPort for the worker service bind address.
e.g. 5050.
worker_service.bind_addr.hostnamestringAlternative to ip, resolved once at startup.
e.g. my-host-2.

[worker] section

KeyTypeDescription
enabledbooleanWhether the worker component runs in this process.
true on worker nodes, false on the dedicated scheduler.
heartbeat_periodstringInterval for worker heartbeats towards the scheduler, used for liveness and load reporting. Either an ISO 8601 duration format or a jiff friendly duration format (see https://docs.rs/jiff/0.2.18/jiff/fmt/friendly/)
e.g. 5 secs.
e.g. PT5S.
shuffle_locationobjectObject used for shuffle data storage.
shuffle_location.localobjectObject used for local disk-backed shuffle data storage.
shuffle_location.local.pathpathLocal path where shuffle/intermediate data is stored; fast local SSD is recommended.
e.g. /mnt/storage/polars/shuffle.
shuffle_location.shared_filesystemobjectObject used for shared filesystem-backed shuffle data storage.
shuffle_location.shared_filesystem.pathpathShared filesystem path where shuffle/intermediate data is stored. Must be accessible by all workers on the same path.
e.g. /mnt/storage/polars/shuffle.
shuffle_location.s3objectObject used for S3-backed shuffle data storage.
shuffle_location.s3.urlpathDestination for shuffle/intermediate data.
e.g. s3://bucket/path/to/key.
shuffle_location.s3.aws_endpoint_urlstringStorage option configuration, see scan_parquet().
shuffle_location.s3.aws_regionstringStorage option configuration.
e.g. eu-east-1
shuffle_location.s3.aws_access_key_idstringStorage option configuration.
shuffle_location.s3.aws_secret_access_keystringStorage option configuration.
task_serviceobjectObject used for configuring the bind address of the task service. This is an internal service in the worker for receiving tasks from the scheduler. Defaults to 0.0.0.0:5052.
task_service.bind_addrstringBind address for the task service.
e.g. 0.0.0.0:5052.
task_service.bind_addr.ipstringIP address for the task service bind address.
e.g. 192.168.1.1.
task_service.bind_addr.portintegerPort for the task service bind address.
e.g. 5052.
task_service.bind_addr.hostnamestringAlternative to ip, resolved once at startup.
e.g. my-host-2.
task_service.public_addrstringAddress at which this service is reachable by the scheduler. Defaults to the bind address if not set. This field is required when the bind address is 0.0.0.0.
e.g. 192.168.1.1.
task_service.public_addr.ipstringIP address for the task service public address.
e.g. 192.168.1.2.
task_service.public_addr.portintegerPort for the task service public address.
e.g. 5052.
task_service.public_addr.hostnamestringAlternative to ip, resolved once at startup.
e.g. my-host-2.
shuffle_serviceobjectObject used for configuring the bind address of the task service. This is an internal service in the worker for receiving tasks from the scheduler. Defaults to 0.0.0.0:5052.
shuffle_service.bind_addrstringBind address for the task service.
e.g. 0.0.0.0:5053.
shuffle_service.bind_addr.ipstringIP address for the task service bind address.
e.g. 192.168.1.1.
shuffle_service.bind_addr.portintegerPort for the task service bind address.
e.g. 5053.
shuffle_service.bind_addr.hostnamestringAlternative to ip, resolved once at startup.
e.g. my-host-2.
shuffle_service.public_addrstringAddress at which this service is reachable by the scheduler. Defaults to the bind address if not set. This field is required when the bind address is 0.0.0.0.
e.g. 192.168.1.1.
shuffle_service.public_addr.ipstringIP address for the task service public address.
e.g. 192.168.1.2.
shuffle_service.public_addr.portintegerPort for the task service public address.
e.g. 5053.
shuffle_service.public_addr.hostnamestringAlternative to ip, resolved once at startup.
e.g. my-host-2.

[observatory] section

KeyTypeDescription
enabledbooleanEnable sending/receiving profiling data so clients can call result.await_profile().
true on both scheduler and workers if you want profiles on queries; false to disable.
max_metrics_bytes_totalintegerHow many bytes all the worker host metrics will consume in total. If a system-wide memory limit is specified then this is added to the share that the scheduler takes. For every worker, about 50 bytes of metrics are stored per second.
database_pathstringLocation to use for storing profiling data. An SQLite database file will be created here, or if a file already exists it will be opened. If this points to a directory, a file in that directory will be created. Polars on-premises will automatically add the cluster_id to this file name to ensure uniqueness within the directory.
serviceobjectObject used for configuring the bind address of the observatory service. This is an internal service in the scheduler for receiving profiling data from all nodes. Defaults to 0.0.0.0:5049.
service.bind_addrstringBind address for the observatory service.
e.g. 0.0.0.0:5049.
service.bind_addr.ipstringIP address for the observatory service bind address.
e.g. 192.168.1.1.
service.bind_addr.portintegerPort for the observatory service bind address.
e.g. 5049.
service.bind_addr.hostnamestringAlternative to ip, resolved once at startup.
e.g. my-host-2.
rest_api.enabledbooleanBy default enabled for exposing the observatory REST API. This is a public service for accessing the profiling data and host metrics data through a web interface.
rest_api.serviceobjectObject used for configuring the bind address of the observatory REST API service. Defaults to 0.0.0.0:3001.
rest_api.service.bind_addrstringBind address for the observatory REST API service.
e.g. 0.0.0.0:3001.
rest_api.service.bind_addr.ipstringIP address for the observatory REST API service bind address.
e.g. 192.168.1.1.
rest_api.service.bind_addr.portintegerPort for the observatory REST API service bind address.
e.g. 3001.
rest_api.service.bind_addr.hostnamestringAlternative to ip, resolved once at startup.
e.g. my-host-2.

[monitoring] section

KeyTypeDescription
enabledbooleanEnable sending/receiving monitoring data to the observatory service. If enabled, it will use the address specified in observatory_service.public_addr.
host_metricsobjectObject used for configuring the host metrics exporter.
host_metrics.enabledbooleanEnable/disable exporting host metrics from this node

[static_leader] section

KeyTypeDescription
leader_instance_idstringID of the leader node; should match the scheduler’s instance_id.
Typically scheduler to match your scheduler node.
scheduler_service.public_addrstringAddress at which the scheduler client service is reachable from this node.
e.g. 192.168.1.1.
scheduler_service.public_addr.ipstringIP address for the scheduler client service public address.
e.g. 192.168.1.1.
scheduler_service.public_addr.portintegerPort for the scheduler client service public address.
e.g. 5051.
scheduler_service.public_addr.hostnamestringAlternative to ip, resolved once at startup.
e.g. my-host-2.
observatory_service.public_addrstringAddress at which the observatory service is reachable from this node.
e.g. 192.168.1.1.
observatory_service.public_addr.ipstringIP address for the observatory service public address.
e.g. 192.168.1.1.
observatory_service.public_addr.portintegerPort for the observatory service public address.
e.g. 5049.
observatory_service.public_addr.hostnamestringAlternative to ip, resolved once at startup.
e.g. my-host-2.