CoCalc -- datasets_and_dataframes.ipynb

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

GitHub Repository: huggingface/notebooks
Path: blob/main/course/videos/datasets_and_dataframes.ipynb
Views: ²⁵⁴²

Kernel: Unknown Kernel

This notebook regroups the code sample of the video below, which is a part of the Hugging Face course.

In [ ]:

#@title
from IPython.display import HTML

HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/tfcY1067A5Q?rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')

Install the Transformers and Datasets libraries to run this notebook.

In [ ]:

! pip install datasets transformers[sentencepiece]

In [ ]:

from datasets import load_dataset

dataset = load_dataset("swiss_judgment_prediction", "all_languages", split="train")
dataset[0]

In [ ]:

# Convert the output format to pandas.DataFrame
dataset.set_format("pandas")
dataset[0]

In [ ]:

dataset.__getitem__(0)

dataset.set_format("pandas")

dataset.__getitem__(0)

In [ ]:

df = dataset.to_pandas()
df.head()

In [ ]:

# How are languages distributed across regions?
df.groupby("region")["language"].value_counts()

# Which legal area is most common?
df["legal area"].value_counts()

In [ ]:

from transformers import AutoTokenizer

# Load a pretrained tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Tokenize the `text` column
dataset.map(lambda x : tokenizer(x["text"]))

In [ ]:

# Reset back to Arrow format
dataset.reset_format()
# Now we can tokenize!
dataset.map(lambda x : tokenizer(x["text"]))

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

Product

Resources

Company

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more, all in one place. Commercial Alternative to JupyterHub.

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.