Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
rasbt
GitHub Repository: rasbt/machine-learning-book
Path: blob/main/ch16/bonus-distilbert-lightning-trainer/distilbert_finetuning-full.ipynb
1247 views
Kernel: Python 3

Finetuning a DistilBERT Classifier Using the Lightning Trainer

# pip install transformers
# pip install datasets
# pip install lightning
# pip install watermark
%load_ext watermark %watermark -p torch,transformers,datasets,lightning
torch : 2.0.1+cu118 transformers: 4.33.2 datasets : 2.14.5 lightning : 2.0.9

1 Loading the Dataset

The IMDB movie review dataset consists of 50k movie reviews with sentiment label (0: negative, 1: positive).

1a) Load from datasets Hub

from datasets import list_datasets, load_dataset
# list_datasets()
imdb_data = load_dataset("imdb") print(imdb_data)
Downloading builder script: 0%| | 0.00/4.31k [00:00<?, ?B/s]
Downloading metadata: 0%| | 0.00/2.17k [00:00<?, ?B/s]
Downloading readme: 0%| | 0.00/7.59k [00:00<?, ?B/s]
Downloading data: 0%| | 0.00/84.1M [00:00<?, ?B/s]
Generating train split: 0%| | 0/25000 [00:00<?, ? examples/s]
Generating test split: 0%| | 0/25000 [00:00<?, ? examples/s]
Generating unsupervised split: 0%| | 0/50000 [00:00<?, ? examples/s]
DatasetDict({ train: Dataset({ features: ['text', 'label'], num_rows: 25000 }) test: Dataset({ features: ['text', 'label'], num_rows: 25000 }) unsupervised: Dataset({ features: ['text', 'label'], num_rows: 50000 }) })
imdb_data["train"][99]
{'text': "This film is terrible. You don't really need to read this review further. If you are planning on watching it, suffice to say - don't (unless you are studying how not to make a good movie).<br /><br />The acting is horrendous... serious amateur hour. Throughout the movie I thought that it was interesting that they found someone who speaks and looks like Michael Madsen, only to find out that it is actually him! A new low even for him!!<br /><br />The plot is terrible. People who claim that it is original or good have probably never seen a decent movie before. Even by the standard of Hollywood action flicks, this is a terrible movie.<br /><br />Don't watch it!!! Go for a jog instead - at least you won't feel like killing yourself.", 'label': 0}

1b) Load from local directory

The IMDB movie review set can be downloaded from http://ai.stanford.edu/~amaas/data/sentiment/. After downloading the dataset, decompress the files.

A) If you are working with Linux or MacOS X, open a new terminal windowm cd into the download directory and execute

tar -zxf aclImdb_v1.tar.gz

B) If you are working with Windows, download an archiver such as 7Zip to extract the files from the download archive.

C) Use the following code to download and unzip the dataset via Python

Download the movie reviews

import os import sys import tarfile import time import urllib.request source = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz" target = "aclImdb_v1.tar.gz" if os.path.exists(target): os.remove(target) def reporthook(count, block_size, total_size): global start_time if count == 0: start_time = time.time() return duration = time.time() - start_time progress_size = int(count * block_size) speed = progress_size / (1024.0**2 * duration) percent = count * block_size * 100.0 / total_size sys.stdout.write( f"\r{int(percent)}% | {progress_size / (1024.**2):.2f} MB " f"| {speed:.2f} MB/s | {duration:.2f} sec elapsed" ) sys.stdout.flush() if not os.path.isdir("aclImdb") and not os.path.isfile("aclImdb_v1.tar.gz"): urllib.request.urlretrieve(source, target, reporthook)
100% | 80.23 MB | 5.11 MB/s | 15.71 sec elapsed
if not os.path.isdir("aclImdb"): with tarfile.open(target, "r:gz") as tar: tar.extractall()

Convert them to a pandas DataFrame and save them as CSV

import os import sys import numpy as np import pandas as pd from packaging import version from tqdm import tqdm # change the `basepath` to the directory of the # unzipped movie dataset basepath = "aclImdb" labels = {"pos": 1, "neg": 0} df = pd.DataFrame() with tqdm(total=50000) as pbar: for s in ("test", "train"): for l in ("pos", "neg"): path = os.path.join(basepath, s, l) for file in sorted(os.listdir(path)): with open(os.path.join(path, file), "r", encoding="utf-8") as infile: txt = infile.read() if version.parse(pd.__version__) >= version.parse("1.3.2"): x = pd.DataFrame( [[txt, labels[l]]], columns=["review", "sentiment"] ) df = pd.concat([df, x], ignore_index=False) else: df = df.append([[txt, labels[l]]], ignore_index=True) pbar.update() df.columns = ["text", "label"]
100%|██████████| 50000/50000 [01:08<00:00, 725.46it/s]
import numpy as np np.random.seed(0) df = df.reindex(np.random.permutation(df.index))

Basic datasets analysis and sanity checks

print("Class distribution:") np.bincount(df["label"].values)
Class distribution:
array([25000, 25000])
text_len = df["text"].apply(lambda x: len(x.split())) text_len.min(), text_len.median(), text_len.max()
(4, 173.0, 2470)

Split data into training, validation, and test sets

df_shuffled = df.sample(frac=1, random_state=1).reset_index() df_train = df_shuffled.iloc[:35_000] df_val = df_shuffled.iloc[35_000:40_000] df_test = df_shuffled.iloc[40_000:] df_train.to_csv("train.csv", index=False, encoding="utf-8") df_val.to_csv("validation.csv", index=False, encoding="utf-8") df_test.to_csv("test.csv", index=False, encoding="utf-8")

2 Tokenization and Numericalization

Load the dataset via load_dataset

imdb_dataset = load_dataset( "csv", data_files={ "train": "train.csv", "validation": "validation.csv", "test": "test.csv", }, ) print(imdb_dataset)
Downloading data files: 0%| | 0/3 [00:00<?, ?it/s]
Extracting data files: 0%| | 0/3 [00:00<?, ?it/s]
Generating train split: 0 examples [00:00, ? examples/s]
Generating validation split: 0 examples [00:00, ? examples/s]
Generating test split: 0 examples [00:00, ? examples/s]
DatasetDict({ train: Dataset({ features: ['index', 'text', 'label'], num_rows: 35000 }) validation: Dataset({ features: ['index', 'text', 'label'], num_rows: 5000 }) test: Dataset({ features: ['index', 'text', 'label'], num_rows: 10000 }) })

Tokenize the dataset

from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") print("Tokenizer input max length:", tokenizer.model_max_length) print("Tokenizer vocabulary size:", tokenizer.vocab_size)
Downloading (…)okenizer_config.json: 0%| | 0.00/28.0 [00:00<?, ?B/s]
Downloading (…)lve/main/config.json: 0%| | 0.00/483 [00:00<?, ?B/s]
Downloading (…)solve/main/vocab.txt: 0%| | 0.00/232k [00:00<?, ?B/s]
Downloading (…)/main/tokenizer.json: 0%| | 0.00/466k [00:00<?, ?B/s]
Tokenizer input max length: 512 Tokenizer vocabulary size: 30522
def tokenize_text(batch): return tokenizer(batch["text"], truncation=True, padding=True)
imdb_tokenized = imdb_dataset.map(tokenize_text, batched=True, batch_size=None)
Map: 0%| | 0/35000 [00:00<?, ? examples/s]
Map: 0%| | 0/5000 [00:00<?, ? examples/s]
Map: 0%| | 0/10000 [00:00<?, ? examples/s]
del imdb_dataset
imdb_tokenized.set_format("torch", columns=["input_ids", "attention_mask", "label"])
import os os.environ["TOKENIZERS_PARALLELISM"] = "false"

3 Set Up DataLoaders

from torch.utils.data import DataLoader, Dataset class IMDBDataset(Dataset): def __init__(self, dataset_dict, partition_key="train"): self.partition = dataset_dict[partition_key] def __getitem__(self, index): return self.partition[index] def __len__(self): return self.partition.num_rows
train_dataset = IMDBDataset(imdb_tokenized, partition_key="train") val_dataset = IMDBDataset(imdb_tokenized, partition_key="validation") test_dataset = IMDBDataset(imdb_tokenized, partition_key="test") train_loader = DataLoader( dataset=train_dataset, batch_size=12, shuffle=True, num_workers=4 ) val_loader = DataLoader( dataset=val_dataset, batch_size=12, num_workers=4 ) test_loader = DataLoader( dataset=test_dataset, batch_size=12, num_workers=4 )
/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:560: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. warnings.warn(_create_warning_msg(

4 Initializing DistilBERT

from transformers import AutoModelForSequenceClassification model = AutoModelForSequenceClassification.from_pretrained( "distilbert-base-uncased", num_labels=2)
Downloading model.safetensors: 0%| | 0.00/268M [00:00<?, ?B/s]
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.bias', 'classifier.weight', 'pre_classifier.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

5 Finetuning

Wrap in LightningModule for Training

import lightning as L import torch import torchmetrics class LightningModel(L.LightningModule): def __init__(self, model, learning_rate=5e-5): super().__init__() self.learning_rate = learning_rate self.model = model self.val_acc = torchmetrics.Accuracy(task="multiclass", num_classes=2) self.test_acc = torchmetrics.Accuracy(task="multiclass", num_classes=2) def forward(self, input_ids, attention_mask, labels): return self.model(input_ids, attention_mask=attention_mask, labels=labels) def training_step(self, batch, batch_idx): outputs = self(batch["input_ids"], attention_mask=batch["attention_mask"], labels=batch["label"]) self.log("train_loss", outputs["loss"]) return outputs["loss"] # this is passed to the optimizer for training def validation_step(self, batch, batch_idx): outputs = self(batch["input_ids"], attention_mask=batch["attention_mask"], labels=batch["label"]) self.log("val_loss", outputs["loss"], prog_bar=True) logits = outputs["logits"] predicted_labels = torch.argmax(logits, 1) self.val_acc(predicted_labels, batch["label"]) self.log("val_acc", self.val_acc, prog_bar=True) def test_step(self, batch, batch_idx): outputs = self(batch["input_ids"], attention_mask=batch["attention_mask"], labels=batch["label"]) logits = outputs["logits"] predicted_labels = torch.argmax(logits, 1) self.test_acc(predicted_labels, batch["label"]) self.log("accuracy", self.test_acc, prog_bar=True) def configure_optimizers(self): optimizer = torch.optim.Adam(self.parameters(), lr=self.learning_rate) return optimizer lightning_model = LightningModel(model)
from lightning.pytorch.callbacks import ModelCheckpoint from lightning.pytorch.loggers import CSVLogger callbacks = [ ModelCheckpoint( save_top_k=1, mode="max", monitor="val_acc" ) # save top 1 model ] logger = CSVLogger(save_dir="logs/", name="my-model")
trainer = L.Trainer( max_epochs=3, callbacks=callbacks, accelerator="gpu", devices=1, logger=logger, log_every_n_steps=10, ) trainer.fit(model=lightning_model, train_dataloaders=train_loader, val_dataloaders=val_loader)
INFO: GPU available: True (cuda), used: True INFO:lightning.pytorch.utilities.rank_zero:GPU available: True (cuda), used: True INFO: TPU available: False, using: 0 TPU cores INFO:lightning.pytorch.utilities.rank_zero:TPU available: False, using: 0 TPU cores INFO: IPU available: False, using: 0 IPUs INFO:lightning.pytorch.utilities.rank_zero:IPU available: False, using: 0 IPUs INFO: HPU available: False, using: 0 HPUs INFO:lightning.pytorch.utilities.rank_zero:HPU available: False, using: 0 HPUs INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] INFO: | Name | Type | Params ----------------------------------------------------------------- 0 | model | DistilBertForSequenceClassification | 67.0 M 1 | val_acc | MulticlassAccuracy | 0 2 | test_acc | MulticlassAccuracy | 0 ----------------------------------------------------------------- 67.0 M Trainable params 0 Non-trainable params 67.0 M Total params 267.820 Total estimated model params size (MB) INFO:lightning.pytorch.callbacks.model_summary: | Name | Type | Params ----------------------------------------------------------------- 0 | model | DistilBertForSequenceClassification | 67.0 M 1 | val_acc | MulticlassAccuracy | 0 2 | test_acc | MulticlassAccuracy | 0 ----------------------------------------------------------------- 67.0 M Trainable params 0 Non-trainable params 67.0 M Total params 267.820 Total estimated model params size (MB)
Sanity Checking: 0it [00:00, ?it/s]
/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:560: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. warnings.warn(_create_warning_msg(
Training: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
INFO: `Trainer.fit` stopped: `max_epochs=3` reached. INFO:lightning.pytorch.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=3` reached.
trainer.test(lightning_model, dataloaders=train_loader, ckpt_path="best")
INFO: Restoring states from the checkpoint path at logs/my-model/version_0/checkpoints/epoch=1-step=5834.ckpt INFO:lightning.pytorch.utilities.rank_zero:Restoring states from the checkpoint path at logs/my-model/version_0/checkpoints/epoch=1-step=5834.ckpt INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] INFO: Loaded model weights from the checkpoint at logs/my-model/version_0/checkpoints/epoch=1-step=5834.ckpt INFO:lightning.pytorch.utilities.rank_zero:Loaded model weights from the checkpoint at logs/my-model/version_0/checkpoints/epoch=1-step=5834.ckpt /usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/connectors/data_connector.py:490: PossibleUserWarning: Your `test_dataloader`'s sampler has shuffling enabled, it is strongly recommended that you turn shuffling off for val/test dataloaders. rank_zero_warn(
Testing: 0it [00:00, ?it/s]
[{'accuracy': 0.9881714582443237}]
trainer.test(lightning_model, dataloaders=val_loader, ckpt_path="best")
INFO: Restoring states from the checkpoint path at logs/my-model/version_0/checkpoints/epoch=1-step=5834.ckpt INFO:lightning.pytorch.utilities.rank_zero:Restoring states from the checkpoint path at logs/my-model/version_0/checkpoints/epoch=1-step=5834.ckpt INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] INFO: Loaded model weights from the checkpoint at logs/my-model/version_0/checkpoints/epoch=1-step=5834.ckpt INFO:lightning.pytorch.utilities.rank_zero:Loaded model weights from the checkpoint at logs/my-model/version_0/checkpoints/epoch=1-step=5834.ckpt
Testing: 0it [00:00, ?it/s]
[{'accuracy': 0.9308000206947327}]
trainer.test(lightning_model, dataloaders=test_loader, ckpt_path="best")
INFO: Restoring states from the checkpoint path at logs/my-model/version_0/checkpoints/epoch=1-step=5834.ckpt INFO:lightning.pytorch.utilities.rank_zero:Restoring states from the checkpoint path at logs/my-model/version_0/checkpoints/epoch=1-step=5834.ckpt INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] INFO: Loaded model weights from the checkpoint at logs/my-model/version_0/checkpoints/epoch=1-step=5834.ckpt INFO:lightning.pytorch.utilities.rank_zero:Loaded model weights from the checkpoint at logs/my-model/version_0/checkpoints/epoch=1-step=5834.ckpt
Testing: 0it [00:00, ?it/s]
[{'accuracy': 0.9222999811172485}]