CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In
huggingface

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

GitHub Repository: huggingface/notebooks
Path: blob/main/course/fr/chapter7/section4_tf.ipynb
Views: 2549
Kernel: Python 3

Traduction (TensorFlow)

Installez les bibliothèques 🤗 Datasets et 🤗 Transformers pour exécuter ce notebook.

!pip install datasets transformers[sentencepiece] !apt install git-lfs
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ Collecting datasets Downloading datasets-2.6.1-py3-none-any.whl (441 kB) |████████████████████████████████| 441 kB 5.2 MB/s Collecting transformers[sentencepiece] Downloading transformers-4.23.1-py3-none-any.whl (5.3 MB) |████████████████████████████████| 5.3 MB 38.1 MB/s Requirement already satisfied: dill<0.3.6 in /usr/local/lib/python3.7/dist-packages (from datasets) (0.3.5.1) Collecting responses<0.19 Downloading responses-0.18.0-py3-none-any.whl (38 kB) Requirement already satisfied: pyarrow>=6.0.0 in /usr/local/lib/python3.7/dist-packages (from datasets) (6.0.1) Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.7/dist-packages (from datasets) (1.21.6) Requirement already satisfied: fsspec[http]>=2021.11.1 in /usr/local/lib/python3.7/dist-packages (from datasets) (2022.8.2) Requirement already satisfied: aiohttp in /usr/local/lib/python3.7/dist-packages (from datasets) (3.8.3) Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from datasets) (4.13.0) Collecting xxhash Downloading xxhash-3.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB) |████████████████████████████████| 212 kB 45.6 MB/s Collecting multiprocess Downloading multiprocess-0.70.14-py37-none-any.whl (115 kB) |████████████████████████████████| 115 kB 47.7 MB/s Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.7/dist-packages (from datasets) (6.0) Collecting huggingface-hub<1.0.0,>=0.2.0 Downloading huggingface_hub-0.10.1-py3-none-any.whl (163 kB) |████████████████████████████████| 163 kB 24.9 MB/s Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from datasets) (1.3.5) Requirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from datasets) (21.3) Requirement already satisfied: tqdm>=4.62.1 in /usr/local/lib/python3.7/dist-packages (from datasets) (4.64.1) Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.7/dist-packages (from datasets) (2.23.0) Requirement already satisfied: charset-normalizer<3.0,>=2.0 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (2.1.1) Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (1.2.0) Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (4.0.2) Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (1.8.1) Requirement already satisfied: asynctest==0.13.0 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (0.13.0) Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (22.1.0) Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (6.0.2) Requirement already satisfied: typing-extensions>=3.7.4 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (4.1.1) Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (1.3.1) Requirement already satisfied: filelock in /usr/local/lib/python3.7/dist-packages (from huggingface-hub<1.0.0,>=0.2.0->datasets) (3.8.0) Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging->datasets) (3.0.9) Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests>=2.19.0->datasets) (1.24.3) Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests>=2.19.0->datasets) (3.0.4) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests>=2.19.0->datasets) (2022.9.24) Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests>=2.19.0->datasets) (2.10) Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 Downloading urllib3-1.25.11-py2.py3-none-any.whl (127 kB) |████████████████████████████████| 127 kB 14.0 MB/s Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->datasets) (3.9.0) Collecting multiprocess Downloading multiprocess-0.70.13-py37-none-any.whl (115 kB) |████████████████████████████████| 115 kB 37.5 MB/s Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas->datasets) (2.8.2) Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas->datasets) (2022.4) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas->datasets) (1.15.0) Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.7/dist-packages (from transformers[sentencepiece]) (2022.6.2) Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 Downloading tokenizers-0.13.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB) |████████████████████████████████| 7.6 MB 39.5 MB/s Collecting sentencepiece!=0.1.92,>=0.1.91 Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB) |████████████████████████████████| 1.3 MB 44.7 MB/s Requirement already satisfied: protobuf<=3.20.2 in /usr/local/lib/python3.7/dist-packages (from transformers[sentencepiece]) (3.17.3) Installing collected packages: urllib3, tokenizers, huggingface-hub, xxhash, transformers, sentencepiece, responses, multiprocess, datasets Attempting uninstall: urllib3 Found existing installation: urllib3 1.24.3 Uninstalling urllib3-1.24.3: Successfully uninstalled urllib3-1.24.3 Successfully installed datasets-2.6.1 huggingface-hub-0.10.1 multiprocess-0.70.13 responses-0.18.0 sentencepiece-0.1.97 tokenizers-0.13.1 transformers-4.23.1 urllib3-1.25.11 xxhash-3.1.0 Reading package lists... Done Building dependency tree Reading state information... Done git-lfs is already the newest version (2.3.4-1). The following package was automatically installed and is no longer required: libnvidia-common-460 Use 'apt autoremove' to remove it. 0 upgraded, 0 newly installed, 0 to remove and 27 not upgraded.

Vous aurez besoin de configurer git, adaptez votre email et votre nom dans la cellule suivante.

!git config --global user.email "[email protected]" !git config --global user.name "Your Name"

Vous devrez également être connecté au Hub d'Hugging Face. Exécutez ce qui suit et entrez vos informations d'identification.

from huggingface_hub import notebook_login notebook_login()
from datasets import load_dataset, load_metric raw_datasets = load_dataset("kde4", lang1="en", lang2="fr")
raw_datasets
split_datasets = raw_datasets["train"].train_test_split(train_size=0.9, seed=20) split_datasets
split_datasets["validation"] = split_datasets.pop("test")
split_datasets["train"][1]["translation"]
from transformers import pipeline model_checkpoint = "Helsinki-NLP/opus-mt-en-fr" translator = pipeline("translation", model=model_checkpoint) translator("Default to expanded threads")
split_datasets["train"][172]["translation"]
translator( "Unable to import %1 using the OFX importer plugin. This file is not the correct format." )
from transformers import AutoTokenizer model_checkpoint = "Helsinki-NLP/opus-mt-en-fr" tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, return_tensors="tf")
en_sentence = split_datasets["train"][1]["translation"]["en"] fr_sentence = split_datasets["train"][1]["translation"]["fr"] inputs = tokenizer(en_sentence) with tokenizer.as_target_tokenizer(): targets = tokenizer(fr_sentence)
wrong_targets = tokenizer(fr_sentence) print(tokenizer.convert_ids_to_tokens(wrong_targets["input_ids"])) print(tokenizer.convert_ids_to_tokens(targets["input_ids"]))
max_input_length = 128 max_target_length = 128 def preprocess_function(examples): inputs = [ex["en"] for ex in examples["translation"]] targets = [ex["fr"] for ex in examples["translation"]] model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True) # Configurer le tokenizer pour les cibles with tokenizer.as_target_tokenizer(): labels = tokenizer(targets, max_length=max_target_length, truncation=True) model_inputs["labels"] = labels["input_ids"] return model_inputs
tokenized_datasets = split_datasets.map( preprocess_function, batched=True, remove_columns=split_datasets["train"].column_names, )
from transformers import TFAutoModelForSeq2SeqLM model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint, from_pt=True)
from transformers import DataCollatorForSeq2Seq data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")
batch = data_collator([tokenized_datasets["train"][i] for i in range(1, 3)]) batch.keys()
batch["labels"]
batch["decoder_input_ids"]
for i in range(1, 3): print(tokenized_datasets["train"][i]["labels"])
tf_train_dataset = model.prepare_tf_dataset( tokenized_datasets["train"], collate_fn=data_collator, shuffle=True, batch_size=32, ) tf_eval_dataset = model.prepare_tf_dataset( tokenized_datasets["validation"], collate_fn=data_collator, shuffle=False, batch_size=16, )
!pip install sacrebleu
from datasets import load_metric metric = load_metric("sacrebleu")
predictions = [ "This plugin lets you translate web pages between several languages automatically." ] references = [ [ "This plugin allows you to automatically translate web pages between several languages." ] ] metric.compute(predictions=predictions, references=references)
predictions = ["This This This This"] references = [ [ "This plugin allows you to automatically translate web pages between several languages." ] ] metric.compute(predictions=predictions, references=references)
predictions = ["This plugin"] references = [ [ "This plugin allows you to automatically translate web pages between several languages." ] ] metric.compute(predictions=predictions, references=references)
import numpy as np import tensorflow as tf from tqdm import tqdm generation_data_collator = DataCollatorForSeq2Seq( tokenizer, model=model, return_tensors="tf", pad_to_multiple_of=128 ) tf_generate_dataset = model.prepare_tf_dataset( tokenized_datasets["validation"], collate_fn=generation_data_collator, shuffle=False, batch_size=8, ) @tf.function(jit_compile=True) def generate_with_xla(batch): return model.generate( input_ids=batch["input_ids"], attention_mask=batch["attention_mask"], max_new_tokens=128, ) def compute_metrics(): all_preds = [] all_labels = [] for batch, labels in tqdm(tf_generate_dataset): predictions = generate_with_xla(batch) for batch in tf_generate_dataset: predictions = model.generate( input_ids=batch["input_ids"], attention_mask=batch["attention_mask"] ) labels = labels.numpy() labels = np.where(labels != -100, labels, tokenizer.pad_token_id) decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True) decoded_preds = [pred.strip() for pred in decoded_preds] decoded_labels = [[label.strip()] for label in decoded_labels] all_preds.extend(decoded_preds) all_labels.extend(decoded_labels) result = metric.compute(predictions=all_preds, references=all_labels) return {"bleu": result["score"]}
from huggingface_hub import notebook_login notebook_login()
print(compute_metrics())
from transformers import create_optimizer from transformers.keras_callbacks import PushToHubCallback import tensorflow as tf # Le nombre d'étapes d'entraînement est le nombre d'échantillons dans le jeu de données, divisé par la taille du batch puis multiplié # par le nombre total d'époques. Notez que le jeu de données tf_train_dataset est ici un lot de données tf.data.Dataset, # pas le jeu de données original Hugging Face, donc son len() est déjà num_samples // batch_size. num_epochs = 3 num_train_steps = len(tf_train_dataset) * num_epochs optimizer, schedule = create_optimizer( init_lr=5e-5, num_warmup_steps=0, num_train_steps=num_train_steps, weight_decay_rate=0.01, ) model.compile(optimizer=optimizer) # Entraîner en mixed-precision float16 tf.keras.mixed_precision.set_global_policy("mixed_float16")
from transformers.keras_callbacks import PushToHubCallback callback = PushToHubCallback( output_dir="marian-finetuned-kde4-en-to-fr", tokenizer=tokenizer ) model.fit( tf_train_dataset, validation_data=tf_eval_dataset, callbacks=[callback], epochs=num_epochs, )
print(compute_metrics())
from transformers import pipeline # Remplacer par votre propre checkpoint model_checkpoint = "huggingface-course/marian-finetuned-kde4-en-to-fr" translator = pipeline("translation", model=model_checkpoint) translator("Default to expanded threads")
translator( "Unable to import %1 using the OFX importer plugin. This file is not the correct format." )