Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
pola-rs
GitHub Repository: pola-rs/polars
Path: blob/main/docs/source/polars-cloud/public-datasets.md
8393 views

Public datasets

Public datasets

Start experimenting with Polars Cloud immediately using our curated public datasets. These datasets span different scale factors, letting you test performance across various data sizes—from small exploratory queries to large-scale processing workloads.

Available datasets

PDSH - derived from TPC-H benchmark Standard analytical queries for testing joins, aggregations, and filtering operations. Queries available in the Polars benchmark repository.

PDSDS - derived from TPC-DS benchmark Decision support dataset designed for complex analytical workloads.

NYC Taxi - source: NYC.gov Real-world transportation data with temporal patterns and geospatial dimensions.

Usage

Access any dataset directly from your Polars code and execute in Polars Cloud:

data = pl.scan_parquet( "s3://polars-cloud-samples-us-east-2-prd/{dataset}/{scale_factor/year}/", storage_options={"request_payer": "true"} ) query = data.select().remote(ctx).execute()

Note: These buckets use AWS Requester Pays, meaning you pay only for pays the cost of the request and the data download from the bucket. The storage costs are covered.

Dataset URLs

All datasets are hosted in AWS region us-east-2 and use Requester Pays buckets.

PDSH (TPC-H derived)

Scale FactorSizeURL PatternFormat
SF10~10GBs3://polars-cloud-samples-us-east-2-prd/pdsh/sf10/{filename}.parquetSingle files
SF100~100GBs3://polars-cloud-samples-us-east-2-prd/pdsh/sf100/{table}/_.parquetPartitioned
SF1000~1TBs3://polars-cloud-samples-us-east-2-prd/pdsh/sf1000/{table}/_.parquetPartitioned

Example

data = pl.scan_parquet( "s3://polars-cloud-samples-us-east-2-prd/pdsh/sf10/lineitem.parquet", storage_options={"request_payer": "true"} ) partitioned_data = pl.scan_parquet( "s3://polars-cloud-samples-us-east-2-prd/pdsh/sf100/lineitem/*.parquet", storage_options={"request_payer": "true"} )

PDSDS (TPC-DS derived)

Scale FactorSizeURL Pattern
SF1~1GBs3://polars-cloud-samples-us-east-2-prd/pdsds/sf1/{filename}.parquet
SF10~10GBs3://polars-cloud-samples-us-east-2-prd/pdsds/sf10/{filename}.parquet
SF100~100GBs3://polars-cloud-samples-us-east-2-prd/pdsds/sf100/{filename}.parquet
SF300~300GBs3://polars-cloud-samples-us-east-2-prd/pdsds/sf300/{filename}.parquet

Example

data = pl.scan_parquet( "s3://polars-cloud-samples-us-east-2-prd/pdsh/sf10/store_sales.parquet", storage_options={"request_payer": "true"} )

NYC Taxi

YearURL Pattern
2023s3://polars-cloud-samples-us-east-2-prd/taxi/2023/{filename}.parquet
2024s3://polars-cloud-samples-us-east-2-prd/taxi/2024/{filename}.parquet

Example

data = pl.scan_parquet( "s3://polars-cloud-samples-us-east-2-prd/taxi/2024/yellow_tripdata_2024-01.parquet", storage_options={"request_payer": "true"} )