CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In
huggingface

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

GitHub Repository: huggingface/notebooks
Path: blob/main/course/videos/building_tokenizer.ipynb
Views: 2542
Kernel: Unknown Kernel

This notebook regroups the code sample of the video below, which is a part of the Hugging Face course.

#@title from IPython.display import HTML HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/MR8tZm5ViWU?rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')

Install the Transformers and Datasets libraries to run this notebook.

! pip install datasets transformers[sentencepiece]
from datasets import load_dataset dataset = load_dataset("wikitext", name="wikitext-2-raw-v1", split="train") def get_training_corpus(): for i in range(0, len(dataset), 1000): yield dataset[i : i + 1000]["text"]
from tokenizers import Tokenizer, models, normalizers, pre_tokenizers, trainers, processors, decoders
tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))
tokenizer.normalizer = normalizers.Sequence( [ normalizers.Replace(Regex(r"[\p{Other}&&[^\n\t\r]]"), ""), normalizers.Replace(Regex(r"[\s]"), " "), normalizers.Lowercase(), normalizers.NFD(), normalizers.StripAccents()] )
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"] trainer = trainers.WordPieceTrainer(vocab_size=25000, special_tokens=special_tokens)
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)
cls_token_id = tokenizer.token_to_id("[CLS]") sep_token_id = tokenizer.token_to_id("[SEP]") tokenizer.post_processor = processors.TemplateProcessing( single=f"[CLS]:0 $A:0 [SEP]:0", pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1", special_tokens=[("[CLS]", cls_token_id), ("[SEP]", sep_token_id)], )
tokenizer.decoder = decoders.WordPiece(prefix="##")