GitHub Repository: rasbt/machine-learning-book
Path: blob/main/ch16/ch16-part2-gpt2.ipynb
¹²⁴⁵ views

Kernel: Python 3 (ipykernel)

Machine Learning with PyTorch and Scikit-Learn

-- Code Examples

Package version checks

Add folder to path in order to load from the check_packages.py script:

In [1]:

import sys
sys.path.insert(0, '..')

Check recommended package versions:

In [2]:

from python_environment_check import check_packages


d = {
    'torch': '1.9.0',
    'transformers': '4.9.1',
}
check_packages(d)

Out[2]:

[OK] Your Python version is 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 16:22:27) 
[GCC 9.3.0]
[OK] torch 1.10.0
[OK] transformers 4.9.1

Chapter 16: Transformers – Improving Natural Language Processing with Attention Mechanisms (Part 2/3)

Outline

Building large-scale language models by leveraging unlabeled data

In [2]:

from IPython.display import Image

Building large-scale language models by leveraging unlabeled data

Pre-training and fine-tuning transformer models

In [4]:

Image(filename='figures/16_10.png', width=800)

Out[4]:

Leveraging unlabeled data with GPT

In [5]:

Image(filename='figures/16_11.png', width=800)

Out[5]:

In [ ]:

Image(filename='figures/16_12.png', width=800)

Using GPT-2 to generate new text

In [3]:

from transformers import pipeline, set_seed


generator = pipeline('text-generation', model='gpt2')
set_seed(123)
generator("Hey readers, today is",
          max_length=20,
          num_return_sequences=3)

Out[3]:

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

[{'generated_text': "Hey readers, today is not the last time we'll be seeing one of our favorite indie rock bands"},
 {'generated_text': 'Hey readers, today is Christmas. This is not Christmas, because Christmas is so long and I hope'},
 {'generated_text': "Hey readers, today is CTA Day!\n\nWe're proud to be hosting a special event"}]

In [31]:

from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
text = "Let us encode this sentence"
encoded_input = tokenizer(text, return_tensors='pt')
encoded_input

Out[31]:

{'input_ids': tensor([[ 5756,   514, 37773,   428,  6827]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}

In [ ]:

from transformers import GPT2Model
model = GPT2Model.from_pretrained('gpt2')

In [30]:

output = model(**encoded_input)
output['last_hidden_state'].shape

Out[30]:

torch.Size([1, 5, 768])

Bidirectional pre-training with BERT

In [7]:

Image(filename='figures/16_13.png', width=700)

Out[7]:

In [9]:

Image(filename='figures/16_14.png', width=600)

Out[9]:

In [11]:

Image(filename='figures/16_15.png', width=800)

Out[11]:

The best of both worlds: BART

In [13]:

Image(filename='figures/16_16.png', width=500)

Out[13]:

Readers may ignore the next cell.

In [2]:

! python ../.convert_notebook_to_script.py --input ch16-part2-gpt2.ipynb --output ch16-part2-gpt2.py

Out[2]:

[NbConvertApp] WARNING | Config option `kernel_spec_manager_class` not recognized by `NbConvertApp`.
[NbConvertApp] Converting notebook ch16-part2-gpt2.ipynb to script
[NbConvertApp] Writing 2690 bytes to ch16-part2-gpt2.py

In [ ]:

Machine Learning with PyTorch and Scikit-Learn

-- Code Examples

Package version checks

Chapter 16: Transformers – Improving Natural Language Processing with Attention Mechanisms (Part 2/3)

Building large-scale language models by leveraging unlabeled data

Pre-training and fine-tuning transformer models

Leveraging unlabeled data with GPT

Using GPT-2 to generate new text

Bidirectional pre-training with BERT

The best of both worlds: BART

Product

Resources

Company