Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
rasbt
GitHub Repository: rasbt/machine-learning-book
Path: blob/main/ch16/ch16-part2-gpt2.py
1245 views
1
# coding: utf-8
2
3
4
import sys
5
from python_environment_check import check_packages
6
from transformers import pipeline, set_seed
7
from transformers import GPT2Tokenizer
8
from transformers import GPT2Model
9
10
# # Machine Learning with PyTorch and Scikit-Learn
11
# # -- Code Examples
12
13
# ## Package version checks
14
15
# Add folder to path in order to load from the check_packages.py script:
16
17
18
19
sys.path.insert(0, '..')
20
21
22
# Check recommended package versions:
23
24
25
26
27
28
d = {
29
'torch': '1.9.0',
30
'transformers': '4.9.1',
31
}
32
check_packages(d)
33
34
35
# # Chapter 16: Transformers – Improving Natural Language Processing with Attention Mechanisms (Part 2/3)
36
37
# **Outline**
38
#
39
# - [Building large-scale language models by leveraging unlabeled data](#Building-large-scale-language-models-by-leveraging-unlabeled-data)
40
# - [Pre-training and fine-tuning transformer models](#Pre-training-and-fine-tuning-transformer-models)
41
# - [Leveraging unlabeled data with GPT](#Leveraging-unlabeled-data-with-GPT)
42
# - [Using GPT-2 to generate new text](#Using-GPT-2-to-generate-new-text)
43
# - [Bidirectional pre-training with BERT](#Bidirectional-pre-training-with-BERT)
44
# - [The best of both worlds: BART](#The-best-of-both-worlds-BART)
45
46
47
48
49
50
# ## Building large-scale language models by leveraging unlabeled data
51
# ## Pre-training and fine-tuning transformer models
52
#
53
#
54
55
56
57
58
59
# ## Leveraging unlabeled data with GPT
60
61
62
63
64
65
66
67
68
69
# ### Using GPT-2 to generate new text
70
71
72
73
74
75
generator = pipeline('text-generation', model='gpt2')
76
set_seed(123)
77
generator("Hey readers, today is",
78
max_length=20,
79
num_return_sequences=3)
80
81
82
83
84
85
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
86
text = "Let us encode this sentence"
87
encoded_input = tokenizer(text, return_tensors='pt')
88
encoded_input
89
90
91
92
93
model = GPT2Model.from_pretrained('gpt2')
94
95
96
97
98
output = model(**encoded_input)
99
output['last_hidden_state'].shape
100
101
102
# ### Bidirectional pre-training with BERT
103
#
104
105
106
107
108
109
110
111
112
113
114
115
116
117
# ### The best of both worlds: BART
118
119
120
121
122
123
# ---
124
#
125
# Readers may ignore the next cell.
126
127
128
129
130
131
132
133
134
135
136