CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In
huggingface

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

GitHub Repository: huggingface/notebooks
Path: blob/main/course/videos/datasets_and_dataframes.ipynb
Views: 2542
Kernel: Unknown Kernel

This notebook regroups the code sample of the video below, which is a part of the Hugging Face course.

#@title from IPython.display import HTML HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/tfcY1067A5Q?rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')

Install the Transformers and Datasets libraries to run this notebook.

! pip install datasets transformers[sentencepiece]
from datasets import load_dataset dataset = load_dataset("swiss_judgment_prediction", "all_languages", split="train") dataset[0]
# Convert the output format to pandas.DataFrame dataset.set_format("pandas") dataset[0]
dataset.__getitem__(0) dataset.set_format("pandas") dataset.__getitem__(0)
df = dataset.to_pandas() df.head()
# How are languages distributed across regions? df.groupby("region")["language"].value_counts() # Which legal area is most common? df["legal area"].value_counts()
from transformers import AutoTokenizer # Load a pretrained tokenizer tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") # Tokenize the `text` column dataset.map(lambda x : tokenizer(x["text"]))
# Reset back to Arrow format dataset.reset_format() # Now we can tokenize! dataset.map(lambda x : tokenizer(x["text"]))