GitHub Repository: rasbt/machine-learning-book
Path: blob/main/ch16/bonus-distilbert-lightning-trainer/distilbert_finetuning-full.ipynb
¹²⁴⁷ views

Kernel: Python 3

Finetuning a DistilBERT Classifier Using the Lightning Trainer

In [1]:

# pip install transformers

In [2]:

# pip install datasets

In [3]:

# pip install lightning

In [4]:

# pip install watermark

In [5]:

%load_ext watermark
%watermark -p torch,transformers,datasets,lightning

Out[5]:

torch       : 2.0.1+cu118
transformers: 4.33.2
datasets    : 2.14.5
lightning   : 2.0.9

In [5]:

1 Loading the Dataset

The IMDB movie review dataset consists of 50k movie reviews with sentiment label (0: negative, 1: positive).

1a) Load from `datasets` Hub

In [6]:

from datasets import list_datasets, load_dataset

In [7]:

# list_datasets()

In [8]:

imdb_data = load_dataset("imdb")
print(imdb_data)

Out[8]:

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [9]:

imdb_data["train"][99]

Out[9]:

{'text': "This film is terrible. You don't really need to read this review further. If you are planning on watching it, suffice to say - don't (unless you are studying how not to make a good movie).<br /><br />The acting is horrendous... serious amateur hour. Throughout the movie I thought that it was interesting that they found someone who speaks and looks like Michael Madsen, only to find out that it is actually him! A new low even for him!!<br /><br />The plot is terrible. People who claim that it is original or good have probably never seen a decent movie before. Even by the standard of Hollywood action flicks, this is a terrible movie.<br /><br />Don't watch it!!! Go for a jog instead - at least you won't feel like killing yourself.",
 'label': 0}

1b) Load from local directory

The IMDB movie review set can be downloaded from http://ai.stanford.edu/~amaas/data/sentiment/. After downloading the dataset, decompress the files.

A) If you are working with Linux or MacOS X, open a new terminal windowm cd into the download directory and execute

tar -zxf aclImdb_v1.tar.gz

B) If you are working with Windows, download an archiver such as 7Zip to extract the files from the download archive.

C) Use the following code to download and unzip the dataset via Python

Download the movie reviews

In [10]:

import os
import sys
import tarfile
import time
import urllib.request

source = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
target = "aclImdb_v1.tar.gz"

if os.path.exists(target):
    os.remove(target)


def reporthook(count, block_size, total_size):
    global start_time
    if count == 0:
        start_time = time.time()
        return
    duration = time.time() - start_time
    progress_size = int(count * block_size)
    speed = progress_size / (1024.0**2 * duration)
    percent = count * block_size * 100.0 / total_size

    sys.stdout.write(
        f"\r{int(percent)}% | {progress_size / (1024.**2):.2f} MB "
        f"| {speed:.2f} MB/s | {duration:.2f} sec elapsed"
    )
    sys.stdout.flush()


if not os.path.isdir("aclImdb") and not os.path.isfile("aclImdb_v1.tar.gz"):
    urllib.request.urlretrieve(source, target, reporthook)

Out[10]:

100% | 80.23 MB | 5.11 MB/s | 15.71 sec elapsed

In [11]:

if not os.path.isdir("aclImdb"):

    with tarfile.open(target, "r:gz") as tar:
        tar.extractall()

Convert them to a pandas DataFrame and save them as CSV

In [12]:

import os
import sys

import numpy as np
import pandas as pd
from packaging import version
from tqdm import tqdm

# change the `basepath` to the directory of the
# unzipped movie dataset

basepath = "aclImdb"

labels = {"pos": 1, "neg": 0}

df = pd.DataFrame()

with tqdm(total=50000) as pbar:
    for s in ("test", "train"):
        for l in ("pos", "neg"):
            path = os.path.join(basepath, s, l)
            for file in sorted(os.listdir(path)):
                with open(os.path.join(path, file), "r", encoding="utf-8") as infile:
                    txt = infile.read()

                if version.parse(pd.__version__) >= version.parse("1.3.2"):
                    x = pd.DataFrame(
                        [[txt, labels[l]]], columns=["review", "sentiment"]
                    )
                    df = pd.concat([df, x], ignore_index=False)

                else:
                    df = df.append([[txt, labels[l]]], ignore_index=True)
                pbar.update()
df.columns = ["text", "label"]

Out[12]:

100%|██████████| 50000/50000 [01:08<00:00, 725.46it/s]

In [13]:

import numpy as np

np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))

Basic datasets analysis and sanity checks

In [14]:

print("Class distribution:")
np.bincount(df["label"].values)

Out[14]:

Class distribution:

array([25000, 25000])

In [15]:

text_len = df["text"].apply(lambda x: len(x.split()))
text_len.min(), text_len.median(), text_len.max()

Out[15]:

(4, 173.0, 2470)

Split data into training, validation, and test sets

In [16]:

df_shuffled = df.sample(frac=1, random_state=1).reset_index()

df_train = df_shuffled.iloc[:35_000]
df_val = df_shuffled.iloc[35_000:40_000]
df_test = df_shuffled.iloc[40_000:]

df_train.to_csv("train.csv", index=False, encoding="utf-8")
df_val.to_csv("validation.csv", index=False, encoding="utf-8")
df_test.to_csv("test.csv", index=False, encoding="utf-8")

2 Tokenization and Numericalization

Load the dataset via load_dataset

In [17]:

imdb_dataset = load_dataset(
    "csv",
    data_files={
        "train": "train.csv",
        "validation": "validation.csv",
        "test": "test.csv",
    },
)

print(imdb_dataset)

Out[17]:

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['index', 'text', 'label'],
        num_rows: 35000
    })
    validation: Dataset({
        features: ['index', 'text', 'label'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['index', 'text', 'label'],
        num_rows: 10000
    })
})

Tokenize the dataset

In [18]:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
print("Tokenizer input max length:", tokenizer.model_max_length)
print("Tokenizer vocabulary size:", tokenizer.vocab_size)

Out[18]:

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Tokenizer input max length: 512
Tokenizer vocabulary size: 30522

In [19]:

def tokenize_text(batch):
    return tokenizer(batch["text"], truncation=True, padding=True)

In [20]:

imdb_tokenized = imdb_dataset.map(tokenize_text, batched=True, batch_size=None)

Out[20]:

Map:   0%|          | 0/35000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [21]:

del imdb_dataset

In [22]:

imdb_tokenized.set_format("torch", columns=["input_ids", "attention_mask", "label"])

In [23]:

import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

3 Set Up DataLoaders

In [24]:

from torch.utils.data import DataLoader, Dataset


class IMDBDataset(Dataset):
    def __init__(self, dataset_dict, partition_key="train"):
        self.partition = dataset_dict[partition_key]

    def __getitem__(self, index):
        return self.partition[index]

    def __len__(self):
        return self.partition.num_rows

In [25]:

train_dataset = IMDBDataset(imdb_tokenized, partition_key="train")
val_dataset = IMDBDataset(imdb_tokenized, partition_key="validation")
test_dataset = IMDBDataset(imdb_tokenized, partition_key="test")

train_loader = DataLoader(
    dataset=train_dataset,
    batch_size=12,
    shuffle=True,
    num_workers=4
)

val_loader = DataLoader(
    dataset=val_dataset,
    batch_size=12,
    num_workers=4
)

test_loader = DataLoader(
    dataset=test_dataset,
    batch_size=12,
    num_workers=4
)

Out[25]:

/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:560: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(

4 Initializing DistilBERT

In [26]:

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2)

Out[26]:

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.bias', 'classifier.weight', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

5 Finetuning

Wrap in LightningModule for Training

In [28]:

import lightning as L
import torch
import torchmetrics


class LightningModel(L.LightningModule):
    def __init__(self, model, learning_rate=5e-5):
        super().__init__()

        self.learning_rate = learning_rate
        self.model = model

        self.val_acc = torchmetrics.Accuracy(task="multiclass", num_classes=2)
        self.test_acc = torchmetrics.Accuracy(task="multiclass", num_classes=2)

    def forward(self, input_ids, attention_mask, labels):
        return self.model(input_ids, attention_mask=attention_mask, labels=labels)

    def training_step(self, batch, batch_idx):
        outputs = self(batch["input_ids"], attention_mask=batch["attention_mask"],
                       labels=batch["label"])
        self.log("train_loss", outputs["loss"])
        return outputs["loss"]  # this is passed to the optimizer for training

    def validation_step(self, batch, batch_idx):
        outputs = self(batch["input_ids"], attention_mask=batch["attention_mask"],
                       labels=batch["label"])
        self.log("val_loss", outputs["loss"], prog_bar=True)

        logits = outputs["logits"]
        predicted_labels = torch.argmax(logits, 1)
        self.val_acc(predicted_labels, batch["label"])
        self.log("val_acc", self.val_acc, prog_bar=True)

    def test_step(self, batch, batch_idx):
        outputs = self(batch["input_ids"], attention_mask=batch["attention_mask"],
                       labels=batch["label"])

        logits = outputs["logits"]
        predicted_labels = torch.argmax(logits, 1)
        self.test_acc(predicted_labels, batch["label"])
        self.log("accuracy", self.test_acc, prog_bar=True)

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=self.learning_rate)
        return optimizer


lightning_model = LightningModel(model)

In [29]:

from lightning.pytorch.callbacks import ModelCheckpoint
from lightning.pytorch.loggers import CSVLogger


callbacks = [
    ModelCheckpoint(
        save_top_k=1, mode="max", monitor="val_acc"
    )  # save top 1 model
]
logger = CSVLogger(save_dir="logs/", name="my-model")

In [30]:

trainer = L.Trainer(
    max_epochs=3,
    callbacks=callbacks,
    accelerator="gpu",
    devices=1,
    logger=logger,
    log_every_n_steps=10,
)

trainer.fit(model=lightning_model,
            train_dataloaders=train_loader,
            val_dataloaders=val_loader)

Out[30]:

INFO: GPU available: True (cuda), used: True
INFO:lightning.pytorch.utilities.rank_zero:GPU available: True (cuda), used: True
INFO: TPU available: False, using: 0 TPU cores
INFO:lightning.pytorch.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO: IPU available: False, using: 0 IPUs
INFO:lightning.pytorch.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO: HPU available: False, using: 0 HPUs
INFO:lightning.pytorch.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO: 
  | Name     | Type                                | Params
-----------------------------------------------------------------
0 | model    | DistilBertForSequenceClassification | 67.0 M
1 | val_acc  | MulticlassAccuracy                  | 0     
2 | test_acc | MulticlassAccuracy                  | 0     
-----------------------------------------------------------------
67.0 M    Trainable params
0         Non-trainable params
67.0 M    Total params
267.820   Total estimated model params size (MB)
INFO:lightning.pytorch.callbacks.model_summary:
  | Name     | Type                                | Params
-----------------------------------------------------------------
0 | model    | DistilBertForSequenceClassification | 67.0 M
1 | val_acc  | MulticlassAccuracy                  | 0     
2 | test_acc | MulticlassAccuracy                  | 0     
-----------------------------------------------------------------
67.0 M    Trainable params
0         Non-trainable params
67.0 M    Total params
267.820   Total estimated model params size (MB)

Sanity Checking: 0it [00:00, ?it/s]

/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:560: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

INFO: `Trainer.fit` stopped: `max_epochs=3` reached.
INFO:lightning.pytorch.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=3` reached.

In [31]:

trainer.test(lightning_model, dataloaders=train_loader, ckpt_path="best")

Out[31]:

INFO: Restoring states from the checkpoint path at logs/my-model/version_0/checkpoints/epoch=1-step=5834.ckpt
INFO:lightning.pytorch.utilities.rank_zero:Restoring states from the checkpoint path at logs/my-model/version_0/checkpoints/epoch=1-step=5834.ckpt
INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO: Loaded model weights from the checkpoint at logs/my-model/version_0/checkpoints/epoch=1-step=5834.ckpt
INFO:lightning.pytorch.utilities.rank_zero:Loaded model weights from the checkpoint at logs/my-model/version_0/checkpoints/epoch=1-step=5834.ckpt
/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/connectors/data_connector.py:490: PossibleUserWarning: Your `test_dataloader`'s sampler has shuffling enabled, it is strongly recommended that you turn shuffling off for val/test dataloaders.
  rank_zero_warn(

Testing: 0it [00:00, ?it/s]

[{'accuracy': 0.9881714582443237}]

In [32]:

trainer.test(lightning_model, dataloaders=val_loader, ckpt_path="best")

Out[32]:

INFO: Restoring states from the checkpoint path at logs/my-model/version_0/checkpoints/epoch=1-step=5834.ckpt
INFO:lightning.pytorch.utilities.rank_zero:Restoring states from the checkpoint path at logs/my-model/version_0/checkpoints/epoch=1-step=5834.ckpt
INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO: Loaded model weights from the checkpoint at logs/my-model/version_0/checkpoints/epoch=1-step=5834.ckpt
INFO:lightning.pytorch.utilities.rank_zero:Loaded model weights from the checkpoint at logs/my-model/version_0/checkpoints/epoch=1-step=5834.ckpt

Testing: 0it [00:00, ?it/s]

[{'accuracy': 0.9308000206947327}]

In [33]:

trainer.test(lightning_model, dataloaders=test_loader, ckpt_path="best")

Out[33]:

INFO: Restoring states from the checkpoint path at logs/my-model/version_0/checkpoints/epoch=1-step=5834.ckpt
INFO:lightning.pytorch.utilities.rank_zero:Restoring states from the checkpoint path at logs/my-model/version_0/checkpoints/epoch=1-step=5834.ckpt
INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO: Loaded model weights from the checkpoint at logs/my-model/version_0/checkpoints/epoch=1-step=5834.ckpt
INFO:lightning.pytorch.utilities.rank_zero:Loaded model weights from the checkpoint at logs/my-model/version_0/checkpoints/epoch=1-step=5834.ckpt

Testing: 0it [00:00, ?it/s]

[{'accuracy': 0.9222999811172485}]

In [ ]:

Finetuning a DistilBERT Classifier Using the Lightning Trainer

1 Loading the Dataset

1a) Load from `datasets` Hub

1b) Load from local directory

2 Tokenization and Numericalization

3 Set Up DataLoaders

4 Initializing DistilBERT

5 Finetuning

Product

Resources

Company

Finetuning a DistilBERT Classifier Using the Lightning Trainer

1 Loading the Dataset

1a) Load from datasets Hub

1b) Load from local directory

2 Tokenization and Numericalization

3 Set Up DataLoaders

4 Initializing DistilBERT

5 Finetuning

1a) Load from `datasets` Hub