CoCalc -- fast_tokenizers.ipynb

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

GitHub Repository: huggingface/notebooks
Path: blob/main/course/videos/fast_tokenizers.ipynb
Views: ²⁵⁴²

Kernel: Unknown Kernel

This notebook regroups the code sample of the video below, which is a part of the Hugging Face course.

In [ ]:

#@title
from IPython.display import HTML

HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/g8quOxoqhHQ?rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')

Install the Transformers and Datasets libraries to run this notebook.

In [ ]:

! pip install datasets transformers[sentencepiece]

In [ ]:

from datasets import load_dataset

raw_datasets = load_dataset("glue", "mnli")
raw_datasets

In [ ]:

from transformers import AutoTokenizer

fast_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

def tokenize_with_fast(examples):
    return fast_tokenizer(
        examples["premise"], examples["hypothesis"], truncation=True
    )

In [ ]:

slow_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=False)

def tokenize_with_slow(examples):
    return fast_tokenizer(
        examples["premise"], examples["hypothesis"], truncation=True
    )

In [ ]:

%time tokenized_datasets = raw_datasets.map(tokenize_with_fast)

In [ ]:

%time tokenized_datasets = raw_datasets.map(tokenize_with_slow)

In [ ]:

%time tokenized_datasets = raw_datasets.map(tokenize_with_fast, batched=True)

In [ ]:

%time tokenized_datasets = raw_datasets.map(tokenize_with_slow, batched=True)

In [ ]:

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

Product

Resources

Company

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more, all in one place. Commercial Alternative to JupyterHub.

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.