Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
pola-rs
GitHub Repository: pola-rs/polars
Path: blob/main/docs/source/user-guide/io/hugging-face.md
6940 views

Hugging Face

Scanning datasets from Hugging Face

All cloud-enabled scan functions, and their read_ counterparts transparently support scanning from Hugging Face:

Path format

To scan from Hugging Face, a hf:// path can be passed to the scan functions. The hf:// path format is defined as hf://BUCKET/REPOSITORY@REVISION/PATH, where:

  • BUCKET is one of datasets or spaces

  • REPOSITORY is the location of the repository, this is usually in the format of username/repo_name. A branch can also be optionally specified by appending @branch

  • REVISION is the name of the branch (or commit) to use. This is optional and defaults to main if not given.

  • PATH is a file or directory path, or a glob pattern from the repository root.

Example hf:// paths:

PathPath components
hf://datasets/nameexhaustion/polars-docs/iris.csvBucket: datasets
Repository: nameexhaustion/polars-docs
Branch: main
Path: iris.csv
Web URL
hf://datasets/nameexhaustion/polars-docs@foods/*.csvBucket: datasets
Repository: nameexhaustion/polars-docs
Branch: foods
Path: *.csv
Web URL
hf://datasets/nameexhaustion/polars-docs/hive_dates/Bucket: datasets
Repository: nameexhaustion/polars-docs
Branch: main
Path: hive_dates/
Web URL
hf://spaces/nameexhaustion/polars-docs/orders.featherBucket: spaces
Repository: nameexhaustion/polars-docs
Branch: main
Path: orders.feather
Web URL

Authentication

A Hugging Face API key can be passed to Polars to access private locations using either of the following methods:

  • Passing a token in storage_options to the scan function, e.g. scan_parquet(..., storage_options={'token': '<your HF token>'})

  • Setting the HF_TOKEN environment variable, e.g. export HF_TOKEN=<your HF token>

Examples

CSV

--8<-- "python/user-guide/io/hugging-face.py:setup"

{{code_block('user-guide/io/hugging-face','scan_iris_csv',['scan_csv'])}}

--8<-- "python/user-guide/io/hugging-face.py:scan_iris_repr"

See this file at https://huggingface.co/datasets/nameexhaustion/polars-docs/blob/main/iris.csv

NDJSON

{{code_block('user-guide/io/hugging-face','scan_iris_ndjson',['scan_ndjson'])}}

--8<-- "python/user-guide/io/hugging-face.py:scan_iris_repr"

See this file at https://huggingface.co/datasets/nameexhaustion/polars-docs/blob/main/iris.jsonl

Parquet

{{code_block('user-guide/io/hugging-face','scan_parquet_hive_repr',['scan_parquet'])}}

--8<-- "python/user-guide/io/hugging-face.py:scan_parquet_hive_repr"

See this folder at https://huggingface.co/datasets/nameexhaustion/polars-docs/tree/main/hive_dates/

IPC

{{code_block('user-guide/io/hugging-face','scan_ipc',['scan_ipc'])}}

--8<-- "python/user-guide/io/hugging-face.py:scan_ipc_repr"

See this file at https://huggingface.co/spaces/nameexhaustion/polars-docs/blob/main/orders.feather