CoCalc -- ch16-part2-gpt2.py

GitHub Repository: rasbt/machine-learning-book
Path: blob/main/ch16/ch16-part2-gpt2.py
¹²⁴⁵ views
1
# coding: utf-8
2

3

4
import sys
5
from python_environment_check import check_packages
6
from transformers import pipeline, set_seed
7
from transformers import GPT2Tokenizer
8
from transformers import GPT2Model
9

10
# # Machine Learning with PyTorch and Scikit-Learn  
11
# # -- Code Examples
12

13
# ## Package version checks
14

15
# Add folder to path in order to load from the check_packages.py script:
16

17

18

19
sys.path.insert(0, '..')
20

21

22
# Check recommended package versions:
23

24

25

26

27

28
d = {
29
    'torch': '1.9.0',
30
    'transformers': '4.9.1',
31
}
32
check_packages(d)
33

34

35
# # Chapter 16: Transformers – Improving Natural Language Processing with Attention Mechanisms (Part 2/3)
36

37
# **Outline**
38
# 
39
# - [Building large-scale language models by leveraging unlabeled data](#Building-large-scale-language-models-by-leveraging-unlabeled-data)
40
#   - [Pre-training and fine-tuning transformer models](#Pre-training-and-fine-tuning-transformer-models)
41
#   - [Leveraging unlabeled data with GPT](#Leveraging-unlabeled-data-with-GPT)
42
#   - [Using GPT-2 to generate new text](#Using-GPT-2-to-generate-new-text)
43
#   - [Bidirectional pre-training with BERT](#Bidirectional-pre-training-with-BERT)
44
#   - [The best of both worlds: BART](#The-best-of-both-worlds-BART)
45

46

47

48

49

50
# ## Building large-scale language models by leveraging unlabeled data
51
# ##  Pre-training and fine-tuning transformer models
52
# 
53
# 
54

55

56

57

58

59
# ## Leveraging unlabeled data with GPT
60

61

62

63

64

65

66

67

68

69
# ### Using GPT-2 to generate new text
70

71

72

73

74

75
generator = pipeline('text-generation', model='gpt2')
76
set_seed(123)
77
generator("Hey readers, today is",
78
          max_length=20,
79
          num_return_sequences=3)
80

81

82

83

84

85
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
86
text = "Let us encode this sentence"
87
encoded_input = tokenizer(text, return_tensors='pt')
88
encoded_input
89

90

91

92

93
model = GPT2Model.from_pretrained('gpt2')
94

95

96

97

98
output = model(**encoded_input)
99
output['last_hidden_state'].shape
100

101

102
# ### Bidirectional pre-training with BERT
103
# 
104

105

106

107

108

109

110

111

112

113

114

115

116

117
# ### The best of both worlds: BART
118

119

120

121

122

123
# ---
124
# 
125
# Readers may ignore the next cell.
126

127

128

129

130

131

132

133

134

135

136
Product

Resources

Company