CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In
huggingface

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

GitHub Repository: huggingface/notebooks
Path: blob/main/course/videos/slice_and_dice.ipynb
Views: 2542
Kernel: Unknown Kernel

This notebook regroups the code sample of the video below, which is a part of the Hugging Face course.

#@title from IPython.display import HTML HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/tqfSFcPMgOI?rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')

Install the Transformers and Datasets libraries to run this notebook.

! pip install datasets transformers[sentencepiece]
from datasets import load_dataset squad = load_dataset("squad", split="train") squad[0]
squad_shuffled = squad.shuffle(seed=666) squad_shuffled[0]
dataset = squad.train_test_split(test_size=0.1) dataset
indices = [0, 10, 20, 40, 80] examples = squad.select(indices) examples
sample = squad.shuffle().select(range(5)) sample
squad_filtered = squad.filter(lambda x : x["title"].startswith("L")) squad_filtered[0]
squad.rename_column("context", "passages")
squad.remove_columns(["id", "title"])
squad
squad.flatten()
def lowercase_title(example): return {"title": example["title"].lower()} squad_lowercase = squad.map(lowercase_title) # Peek at random sample squad_lowercase.shuffle(seed=42)["title"][:5]
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") def tokenize_title(example): return tokenizer(example["title"]) squad.map(tokenize_title, batched=True, batch_size=500)