# -*- coding: utf-8 -*-12"""3Chatbot Tutorial4================5**Author:** `Matthew Inkawhich <https://github.com/MatthewInkawhich>`_6"""789######################################################################10# In this tutorial, we explore a fun and interesting use-case of recurrent11# sequence-to-sequence models. We will train a simple chatbot using movie12# scripts from the `Cornell Movie-Dialogs13# Corpus <https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html>`__.14#15# Conversational models are a hot topic in artificial intelligence16# research. Chatbots can be found in a variety of settings, including17# customer service applications and online helpdesks. These bots are often18# powered by retrieval-based models, which output predefined responses to19# questions of certain forms. In a highly restricted domain like a20# company’s IT helpdesk, these models may be sufficient, however, they are21# not robust enough for more general use-cases. Teaching a machine to22# carry out a meaningful conversation with a human in multiple domains is23# a research question that is far from solved. Recently, the deep learning24# boom has allowed for powerful generative models like Google’s `Neural25# Conversational Model <https://arxiv.org/abs/1506.05869>`__, which marks26# a large step towards multi-domain generative conversational models. In27# this tutorial, we will implement this kind of model in PyTorch.28#29# .. figure:: /_static/img/chatbot/bot.png30# :align: center31# :alt: bot32#33# .. code-block:: python34#35# > hello?36# Bot: hello .37# > where am I?38# Bot: you re in a hospital .39# > who are you?40# Bot: i m a lawyer .41# > how are you doing?42# Bot: i m fine .43# > are you my friend?44# Bot: no .45# > you're under arrest46# Bot: i m trying to help you !47# > i'm just kidding48# Bot: i m sorry .49# > where are you from?50# Bot: san francisco .51# > it's time for me to leave52# Bot: i know .53# > goodbye54# Bot: goodbye .55#56# **Tutorial Highlights**57#58# - Handle loading and preprocessing of `Cornell Movie-Dialogs59# Corpus <https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html>`__60# dataset61# - Implement a sequence-to-sequence model with `Luong attention62# mechanism(s) <https://arxiv.org/abs/1508.04025>`__63# - Jointly train encoder and decoder models using mini-batches64# - Implement greedy-search decoding module65# - Interact with trained chatbot66#67# **Acknowledgments**68#69# This tutorial borrows code from the following sources:70#71# 1) Yuan-Kuei Wu’s pytorch-chatbot implementation:72# https://github.com/ywk991112/pytorch-chatbot73#74# 2) Sean Robertson’s practical-pytorch seq2seq-translation example:75# https://github.com/spro/practical-pytorch/tree/master/seq2seq-translation76#77# 3) FloydHub Cornell Movie Corpus preprocessing code:78# https://github.com/floydhub/textutil-preprocess-cornell-movie-corpus79#808182######################################################################83# Preparations84# ------------85#86# To get started, `download <https://zissou.infosci.cornell.edu/convokit/datasets/movie-corpus/movie-corpus.zip>`__ the Movie-Dialogs Corpus zip file.8788# and put in a ``data/`` directory under the current directory.89#90# After that, let’s import some necessities.91#9293import torch94from torch.jit import script, trace95import torch.nn as nn96from torch import optim97import torch.nn.functional as F98import csv99import random100import re101import os102import unicodedata103import codecs104from io import open105import itertools106import math107import json108109110# If the current `accelerator <https://pytorch.org/docs/stable/torch.html#accelerators>`__ is available,111# we will use it. Otherwise, we use the CPU.112device = torch.accelerator.current_accelerator().type if torch.accelerator.is_available() else "cpu"113print(f"Using {device} device")114115116######################################################################117# Load & Preprocess Data118# ----------------------119#120# The next step is to reformat our data file and load the data into121# structures that we can work with.122#123# The `Cornell Movie-Dialogs124# Corpus <https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html>`__125# is a rich dataset of movie character dialog:126#127# - 220,579 conversational exchanges between 10,292 pairs of movie128# characters129# - 9,035 characters from 617 movies130# - 304,713 total utterances131#132# This dataset is large and diverse, and there is a great variation of133# language formality, time periods, sentiment, etc. Our hope is that this134# diversity makes our model robust to many forms of inputs and queries.135#136# First, we’ll take a look at some lines of our datafile to see the137# original format.138#139140corpus_name = "movie-corpus"141corpus = os.path.join("data", corpus_name)142143def printLines(file, n=10):144with open(file, 'rb') as datafile:145lines = datafile.readlines()146for line in lines[:n]:147print(line)148149printLines(os.path.join(corpus, "utterances.jsonl"))150151152######################################################################153# Create formatted data file154# ~~~~~~~~~~~~~~~~~~~~~~~~~~155#156# For convenience, we'll create a nicely formatted data file in which each line157# contains a tab-separated *query sentence* and a *response sentence* pair.158#159# The following functions facilitate the parsing of the raw160# ``utterances.jsonl`` data file.161#162# - ``loadLinesAndConversations`` splits each line of the file into a dictionary of163# lines with fields: ``lineID``, ``characterID``, and text and then groups them164# into conversations with fields: ``conversationID``, ``movieID``, and lines.165# - ``extractSentencePairs`` extracts pairs of sentences from166# conversations167#168169# Splits each line of the file to create lines and conversations170def loadLinesAndConversations(fileName):171lines = {}172conversations = {}173with open(fileName, 'r', encoding='iso-8859-1') as f:174for line in f:175lineJson = json.loads(line)176# Extract fields for line object177lineObj = {}178lineObj["lineID"] = lineJson["id"]179lineObj["characterID"] = lineJson["speaker"]180lineObj["text"] = lineJson["text"]181lines[lineObj['lineID']] = lineObj182183# Extract fields for conversation object184if lineJson["conversation_id"] not in conversations:185convObj = {}186convObj["conversationID"] = lineJson["conversation_id"]187convObj["movieID"] = lineJson["meta"]["movie_id"]188convObj["lines"] = [lineObj]189else:190convObj = conversations[lineJson["conversation_id"]]191convObj["lines"].insert(0, lineObj)192conversations[convObj["conversationID"]] = convObj193194return lines, conversations195196197# Extracts pairs of sentences from conversations198def extractSentencePairs(conversations):199qa_pairs = []200for conversation in conversations.values():201# Iterate over all the lines of the conversation202for i in range(len(conversation["lines"]) - 1): # We ignore the last line (no answer for it)203inputLine = conversation["lines"][i]["text"].strip()204targetLine = conversation["lines"][i+1]["text"].strip()205# Filter wrong samples (if one of the lists is empty)206if inputLine and targetLine:207qa_pairs.append([inputLine, targetLine])208return qa_pairs209210211######################################################################212# Now we’ll call these functions and create the file. We’ll call it213# ``formatted_movie_lines.txt``.214#215216# Define path to new file217datafile = os.path.join(corpus, "formatted_movie_lines.txt")218219delimiter = '\t'220# Unescape the delimiter221delimiter = str(codecs.decode(delimiter, "unicode_escape"))222223# Initialize lines dict and conversations dict224lines = {}225conversations = {}226# Load lines and conversations227print("\nProcessing corpus into lines and conversations...")228lines, conversations = loadLinesAndConversations(os.path.join(corpus, "utterances.jsonl"))229230# Write new csv file231print("\nWriting newly formatted file...")232with open(datafile, 'w', encoding='utf-8') as outputfile:233writer = csv.writer(outputfile, delimiter=delimiter, lineterminator='\n')234for pair in extractSentencePairs(conversations):235writer.writerow(pair)236237# Print a sample of lines238print("\nSample lines from file:")239printLines(datafile)240241242######################################################################243# Load and trim data244# ~~~~~~~~~~~~~~~~~~245#246# Our next order of business is to create a vocabulary and load247# query/response sentence pairs into memory.248#249# Note that we are dealing with sequences of **words**, which do not have250# an implicit mapping to a discrete numerical space. Thus, we must create251# one by mapping each unique word that we encounter in our dataset to an252# index value.253#254# For this we define a ``Voc`` class, which keeps a mapping from words to255# indexes, a reverse mapping of indexes to words, a count of each word and256# a total word count. The class provides methods for adding a word to the257# vocabulary (``addWord``), adding all words in a sentence258# (``addSentence``) and trimming infrequently seen words (``trim``). More259# on trimming later.260#261262# Default word tokens263PAD_token = 0 # Used for padding short sentences264SOS_token = 1 # Start-of-sentence token265EOS_token = 2 # End-of-sentence token266267class Voc:268def __init__(self, name):269self.name = name270self.trimmed = False271self.word2index = {}272self.word2count = {}273self.index2word = {PAD_token: "PAD", SOS_token: "SOS", EOS_token: "EOS"}274self.num_words = 3 # Count SOS, EOS, PAD275276def addSentence(self, sentence):277for word in sentence.split(' '):278self.addWord(word)279280def addWord(self, word):281if word not in self.word2index:282self.word2index[word] = self.num_words283self.word2count[word] = 1284self.index2word[self.num_words] = word285self.num_words += 1286else:287self.word2count[word] += 1288289# Remove words below a certain count threshold290def trim(self, min_count):291if self.trimmed:292return293self.trimmed = True294295keep_words = []296297for k, v in self.word2count.items():298if v >= min_count:299keep_words.append(k)300301print('keep_words {} / {} = {:.4f}'.format(302len(keep_words), len(self.word2index), len(keep_words) / len(self.word2index)303))304305# Reinitialize dictionaries306self.word2index = {}307self.word2count = {}308self.index2word = {PAD_token: "PAD", SOS_token: "SOS", EOS_token: "EOS"}309self.num_words = 3 # Count default tokens310311for word in keep_words:312self.addWord(word)313314315######################################################################316# Now we can assemble our vocabulary and query/response sentence pairs.317# Before we are ready to use this data, we must perform some318# preprocessing.319#320# First, we must convert the Unicode strings to ASCII using321# ``unicodeToAscii``. Next, we should convert all letters to lowercase and322# trim all non-letter characters except for basic punctuation323# (``normalizeString``). Finally, to aid in training convergence, we will324# filter out sentences with length greater than the ``MAX_LENGTH``325# threshold (``filterPairs``).326#327328MAX_LENGTH = 10 # Maximum sentence length to consider329330# Turn a Unicode string to plain ASCII, thanks to331# https://stackoverflow.com/a/518232/2809427332def unicodeToAscii(s):333return ''.join(334c for c in unicodedata.normalize('NFD', s)335if unicodedata.category(c) != 'Mn'336)337338# Lowercase, trim, and remove non-letter characters339def normalizeString(s):340s = unicodeToAscii(s.lower().strip())341s = re.sub(r"([.!?])", r" \1", s)342s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)343s = re.sub(r"\s+", r" ", s).strip()344return s345346# Read query/response pairs and return a voc object347def readVocs(datafile, corpus_name):348print("Reading lines...")349# Read the file and split into lines350lines = open(datafile, encoding='utf-8').\351read().strip().split('\n')352# Split every line into pairs and normalize353pairs = [[normalizeString(s) for s in l.split('\t')] for l in lines]354voc = Voc(corpus_name)355return voc, pairs356357# Returns True if both sentences in a pair 'p' are under the MAX_LENGTH threshold358def filterPair(p):359# Input sequences need to preserve the last word for EOS token360return len(p[0].split(' ')) < MAX_LENGTH and len(p[1].split(' ')) < MAX_LENGTH361362# Filter pairs using the ``filterPair`` condition363def filterPairs(pairs):364return [pair for pair in pairs if filterPair(pair)]365366# Using the functions defined above, return a populated voc object and pairs list367def loadPrepareData(corpus, corpus_name, datafile, save_dir):368print("Start preparing training data ...")369voc, pairs = readVocs(datafile, corpus_name)370print("Read {!s} sentence pairs".format(len(pairs)))371pairs = filterPairs(pairs)372print("Trimmed to {!s} sentence pairs".format(len(pairs)))373print("Counting words...")374for pair in pairs:375voc.addSentence(pair[0])376voc.addSentence(pair[1])377print("Counted words:", voc.num_words)378return voc, pairs379380381# Load/Assemble voc and pairs382save_dir = os.path.join("data", "save")383voc, pairs = loadPrepareData(corpus, corpus_name, datafile, save_dir)384# Print some pairs to validate385print("\npairs:")386for pair in pairs[:10]:387print(pair)388389390######################################################################391# Another tactic that is beneficial to achieving faster convergence during392# training is trimming rarely used words out of our vocabulary. Decreasing393# the feature space will also soften the difficulty of the function that394# the model must learn to approximate. We will do this as a two-step395# process:396#397# 1) Trim words used under ``MIN_COUNT`` threshold using the ``voc.trim``398# function.399#400# 2) Filter out pairs with trimmed words.401#402403MIN_COUNT = 3 # Minimum word count threshold for trimming404405def trimRareWords(voc, pairs, MIN_COUNT):406# Trim words used under the MIN_COUNT from the voc407voc.trim(MIN_COUNT)408# Filter out pairs with trimmed words409keep_pairs = []410for pair in pairs:411input_sentence = pair[0]412output_sentence = pair[1]413keep_input = True414keep_output = True415# Check input sentence416for word in input_sentence.split(' '):417if word not in voc.word2index:418keep_input = False419break420# Check output sentence421for word in output_sentence.split(' '):422if word not in voc.word2index:423keep_output = False424break425426# Only keep pairs that do not contain trimmed word(s) in their input or output sentence427if keep_input and keep_output:428keep_pairs.append(pair)429430print("Trimmed from {} pairs to {}, {:.4f} of total".format(len(pairs), len(keep_pairs), len(keep_pairs) / len(pairs)))431return keep_pairs432433434# Trim voc and pairs435pairs = trimRareWords(voc, pairs, MIN_COUNT)436437438######################################################################439# Prepare Data for Models440# -----------------------441#442# Although we have put a great deal of effort into preparing and massaging our443# data into a nice vocabulary object and list of sentence pairs, our models444# will ultimately expect numerical torch tensors as inputs. One way to445# prepare the processed data for the models can be found in the `seq2seq446# translation447# tutorial <https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html>`__.448# In that tutorial, we use a batch size of 1, meaning that all we have to449# do is convert the words in our sentence pairs to their corresponding450# indexes from the vocabulary and feed this to the models.451#452# However, if you’re interested in speeding up training and/or would like453# to leverage GPU parallelization capabilities, you will need to train454# with mini-batches.455#456# Using mini-batches also means that we must be mindful of the variation457# of sentence length in our batches. To accommodate sentences of different458# sizes in the same batch, we will make our batched input tensor of shape459# *(max_length, batch_size)*, where sentences shorter than the460# *max_length* are zero padded after an *EOS_token*.461#462# If we simply convert our English sentences to tensors by converting463# words to their indexes(\ ``indexesFromSentence``) and zero-pad, our464# tensor would have shape *(batch_size, max_length)* and indexing the465# first dimension would return a full sequence across all time-steps.466# However, we need to be able to index our batch along time, and across467# all sequences in the batch. Therefore, we transpose our input batch468# shape to *(max_length, batch_size)*, so that indexing across the first469# dimension returns a time step across all sentences in the batch. We470# handle this transpose implicitly in the ``zeroPadding`` function.471#472# .. figure:: /_static/img/chatbot/seq2seq_batches.png473# :align: center474# :alt: batches475#476# The ``inputVar`` function handles the process of converting sentences to477# tensor, ultimately creating a correctly shaped zero-padded tensor. It478# also returns a tensor of ``lengths`` for each of the sequences in the479# batch which will be passed to our decoder later.480#481# The ``outputVar`` function performs a similar function to ``inputVar``,482# but instead of returning a ``lengths`` tensor, it returns a binary mask483# tensor and a maximum target sentence length. The binary mask tensor has484# the same shape as the output target tensor, but every element that is a485# *PAD_token* is 0 and all others are 1.486#487# ``batch2TrainData`` simply takes a bunch of pairs and returns the input488# and target tensors using the aforementioned functions.489#490491def indexesFromSentence(voc, sentence):492return [voc.word2index[word] for word in sentence.split(' ')] + [EOS_token]493494495def zeroPadding(l, fillvalue=PAD_token):496return list(itertools.zip_longest(*l, fillvalue=fillvalue))497498def binaryMatrix(l, value=PAD_token):499m = []500for i, seq in enumerate(l):501m.append([])502for token in seq:503if token == PAD_token:504m[i].append(0)505else:506m[i].append(1)507return m508509# Returns padded input sequence tensor and lengths510def inputVar(l, voc):511indexes_batch = [indexesFromSentence(voc, sentence) for sentence in l]512lengths = torch.tensor([len(indexes) for indexes in indexes_batch])513padList = zeroPadding(indexes_batch)514padVar = torch.LongTensor(padList)515return padVar, lengths516517# Returns padded target sequence tensor, padding mask, and max target length518def outputVar(l, voc):519indexes_batch = [indexesFromSentence(voc, sentence) for sentence in l]520max_target_len = max([len(indexes) for indexes in indexes_batch])521padList = zeroPadding(indexes_batch)522mask = binaryMatrix(padList)523mask = torch.BoolTensor(mask)524padVar = torch.LongTensor(padList)525return padVar, mask, max_target_len526527# Returns all items for a given batch of pairs528def batch2TrainData(voc, pair_batch):529pair_batch.sort(key=lambda x: len(x[0].split(" ")), reverse=True)530input_batch, output_batch = [], []531for pair in pair_batch:532input_batch.append(pair[0])533output_batch.append(pair[1])534inp, lengths = inputVar(input_batch, voc)535output, mask, max_target_len = outputVar(output_batch, voc)536return inp, lengths, output, mask, max_target_len537538539# Example for validation540small_batch_size = 5541batches = batch2TrainData(voc, [random.choice(pairs) for _ in range(small_batch_size)])542input_variable, lengths, target_variable, mask, max_target_len = batches543544print("input_variable:", input_variable)545print("lengths:", lengths)546print("target_variable:", target_variable)547print("mask:", mask)548print("max_target_len:", max_target_len)549550551######################################################################552# Define Models553# -------------554#555# Seq2Seq Model556# ~~~~~~~~~~~~~557#558# The brains of our chatbot is a sequence-to-sequence (seq2seq) model. The559# goal of a seq2seq model is to take a variable-length sequence as an560# input, and return a variable-length sequence as an output using a561# fixed-sized model.562#563# `Sutskever et al. <https://arxiv.org/abs/1409.3215>`__ discovered that564# by using two separate recurrent neural nets together, we can accomplish565# this task. One RNN acts as an **encoder**, which encodes a variable566# length input sequence to a fixed-length context vector. In theory, this567# context vector (the final hidden layer of the RNN) will contain semantic568# information about the query sentence that is input to the bot. The569# second RNN is a **decoder**, which takes an input word and the context570# vector, and returns a guess for the next word in the sequence and a571# hidden state to use in the next iteration.572#573# .. figure:: /_static/img/chatbot/seq2seq_ts.png574# :align: center575# :alt: model576#577# Image source:578# https://jeddy92.github.io/JEddy92.github.io/ts_seq2seq_intro/579#580581582######################################################################583# Encoder584# ~~~~~~~585#586# The encoder RNN iterates through the input sentence one token587# (e.g. word) at a time, at each time step outputting an “output” vector588# and a “hidden state” vector. The hidden state vector is then passed to589# the next time step, while the output vector is recorded. The encoder590# transforms the context it saw at each point in the sequence into a set591# of points in a high-dimensional space, which the decoder will use to592# generate a meaningful output for the given task.593#594# At the heart of our encoder is a multi-layered Gated Recurrent Unit,595# invented by `Cho et al. <https://arxiv.org/pdf/1406.1078v3.pdf>`__ in596# 2014. We will use a bidirectional variant of the GRU, meaning that there597# are essentially two independent RNNs: one that is fed the input sequence598# in normal sequential order, and one that is fed the input sequence in599# reverse order. The outputs of each network are summed at each time step.600# Using a bidirectional GRU will give us the advantage of encoding both601# past and future contexts.602#603# Bidirectional RNN:604#605# .. figure:: /_static/img/chatbot/RNN-bidirectional.png606# :width: 70%607# :align: center608# :alt: rnn_bidir609#610# Image source: https://colah.github.io/posts/2015-09-NN-Types-FP/611#612# Note that an ``embedding`` layer is used to encode our word indices in613# an arbitrarily sized feature space. For our models, this layer will map614# each word to a feature space of size *hidden_size*. When trained, these615# values should encode semantic similarity between similar meaning words.616#617# Finally, if passing a padded batch of sequences to an RNN module, we618# must pack and unpack padding around the RNN pass using619# ``nn.utils.rnn.pack_padded_sequence`` and620# ``nn.utils.rnn.pad_packed_sequence`` respectively.621#622# **Computation Graph:**623#624# 1) Convert word indexes to embeddings.625# 2) Pack padded batch of sequences for RNN module.626# 3) Forward pass through GRU.627# 4) Unpack padding.628# 5) Sum bidirectional GRU outputs.629# 6) Return output and final hidden state.630#631# **Inputs:**632#633# - ``input_seq``: batch of input sentences; shape=\ *(max_length,634# batch_size)*635# - ``input_lengths``: list of sentence lengths corresponding to each636# sentence in the batch; shape=\ *(batch_size)*637# - ``hidden``: hidden state; shape=\ *(n_layers x num_directions,638# batch_size, hidden_size)*639#640# **Outputs:**641#642# - ``outputs``: output features from the last hidden layer of the GRU643# (sum of bidirectional outputs); shape=\ *(max_length, batch_size,644# hidden_size)*645# - ``hidden``: updated hidden state from GRU; shape=\ *(n_layers x646# num_directions, batch_size, hidden_size)*647#648#649650class EncoderRNN(nn.Module):651def __init__(self, hidden_size, embedding, n_layers=1, dropout=0):652super(EncoderRNN, self).__init__()653self.n_layers = n_layers654self.hidden_size = hidden_size655self.embedding = embedding656657# Initialize GRU; the input_size and hidden_size parameters are both set to 'hidden_size'658# because our input size is a word embedding with number of features == hidden_size659self.gru = nn.GRU(hidden_size, hidden_size, n_layers,660dropout=(0 if n_layers == 1 else dropout), bidirectional=True)661662def forward(self, input_seq, input_lengths, hidden=None):663# Convert word indexes to embeddings664embedded = self.embedding(input_seq)665# Pack padded batch of sequences for RNN module666packed = nn.utils.rnn.pack_padded_sequence(embedded, input_lengths)667# Forward pass through GRU668outputs, hidden = self.gru(packed, hidden)669# Unpack padding670outputs, _ = nn.utils.rnn.pad_packed_sequence(outputs)671# Sum bidirectional GRU outputs672outputs = outputs[:, :, :self.hidden_size] + outputs[:, : ,self.hidden_size:]673# Return output and final hidden state674return outputs, hidden675676677######################################################################678# Decoder679# ~~~~~~~680#681# The decoder RNN generates the response sentence in a token-by-token682# fashion. It uses the encoder’s context vectors, and internal hidden683# states to generate the next word in the sequence. It continues684# generating words until it outputs an *EOS_token*, representing the end685# of the sentence. A common problem with a vanilla seq2seq decoder is that686# if we rely solely on the context vector to encode the entire input687# sequence’s meaning, it is likely that we will have information loss.688# This is especially the case when dealing with long input sequences,689# greatly limiting the capability of our decoder.690#691# To combat this, `Bahdanau et al. <https://arxiv.org/abs/1409.0473>`__692# created an “attention mechanism” that allows the decoder to pay693# attention to certain parts of the input sequence, rather than using the694# entire fixed context at every step.695#696# At a high level, attention is calculated using the decoder’s current697# hidden state and the encoder’s outputs. The output attention weights698# have the same shape as the input sequence, allowing us to multiply them699# by the encoder outputs, giving us a weighted sum which indicates the700# parts of encoder output to pay attention to. `Sean701# Robertson’s <https://github.com/spro>`__ figure describes this very702# well:703#704# .. figure:: /_static/img/chatbot/attn2.png705# :align: center706# :alt: attn2707#708# `Luong et al. <https://arxiv.org/abs/1508.04025>`__ improved upon709# Bahdanau et al.’s groundwork by creating “Global attention”. The key710# difference is that with “Global attention”, we consider all of the711# encoder’s hidden states, as opposed to Bahdanau et al.’s “Local712# attention”, which only considers the encoder’s hidden state from the713# current time step. Another difference is that with “Global attention”,714# we calculate attention weights, or energies, using the hidden state of715# the decoder from the current time step only. Bahdanau et al.’s attention716# calculation requires knowledge of the decoder’s state from the previous717# time step. Also, Luong et al. provides various methods to calculate the718# attention energies between the encoder output and decoder output which719# are called “score functions”:720#721# .. figure:: /_static/img/chatbot/scores.png722# :width: 60%723# :align: center724# :alt: scores725#726# where :math:`h_t` = current target decoder state and :math:`\bar{h}_s` =727# all encoder states.728#729# Overall, the Global attention mechanism can be summarized by the730# following figure. Note that we will implement the “Attention Layer” as a731# separate ``nn.Module`` called ``Attn``. The output of this module is a732# softmax normalized weights tensor of shape *(batch_size, 1,733# max_length)*.734#735# .. figure:: /_static/img/chatbot/global_attn.png736# :align: center737# :width: 60%738# :alt: global_attn739#740741# Luong attention layer742class Attn(nn.Module):743def __init__(self, method, hidden_size):744super(Attn, self).__init__()745self.method = method746if self.method not in ['dot', 'general', 'concat']:747raise ValueError(self.method, "is not an appropriate attention method.")748self.hidden_size = hidden_size749if self.method == 'general':750self.attn = nn.Linear(self.hidden_size, hidden_size)751elif self.method == 'concat':752self.attn = nn.Linear(self.hidden_size * 2, hidden_size)753self.v = nn.Parameter(torch.FloatTensor(hidden_size))754755def dot_score(self, hidden, encoder_output):756return torch.sum(hidden * encoder_output, dim=2)757758def general_score(self, hidden, encoder_output):759energy = self.attn(encoder_output)760return torch.sum(hidden * energy, dim=2)761762def concat_score(self, hidden, encoder_output):763energy = self.attn(torch.cat((hidden.expand(encoder_output.size(0), -1, -1), encoder_output), 2)).tanh()764return torch.sum(self.v * energy, dim=2)765766def forward(self, hidden, encoder_outputs):767# Calculate the attention weights (energies) based on the given method768if self.method == 'general':769attn_energies = self.general_score(hidden, encoder_outputs)770elif self.method == 'concat':771attn_energies = self.concat_score(hidden, encoder_outputs)772elif self.method == 'dot':773attn_energies = self.dot_score(hidden, encoder_outputs)774775# Transpose max_length and batch_size dimensions776attn_energies = attn_energies.t()777778# Return the softmax normalized probability scores (with added dimension)779return F.softmax(attn_energies, dim=1).unsqueeze(1)780781782######################################################################783# Now that we have defined our attention submodule, we can implement the784# actual decoder model. For the decoder, we will manually feed our batch785# one time step at a time. This means that our embedded word tensor and786# GRU output will both have shape *(1, batch_size, hidden_size)*.787#788# **Computation Graph:**789#790# 1) Get embedding of current input word.791# 2) Forward through unidirectional GRU.792# 3) Calculate attention weights from the current GRU output from (2).793# 4) Multiply attention weights to encoder outputs to get new "weighted sum" context vector.794# 5) Concatenate weighted context vector and GRU output using Luong eq. 5.795# 6) Predict next word using Luong eq. 6 (without softmax).796# 7) Return output and final hidden state.797#798# **Inputs:**799#800# - ``input_step``: one time step (one word) of input sequence batch;801# shape=\ *(1, batch_size)*802# - ``last_hidden``: final hidden layer of GRU; shape=\ *(n_layers x803# num_directions, batch_size, hidden_size)*804# - ``encoder_outputs``: encoder model’s output; shape=\ *(max_length,805# batch_size, hidden_size)*806#807# **Outputs:**808#809# - ``output``: softmax normalized tensor giving probabilities of each810# word being the correct next word in the decoded sequence;811# shape=\ *(batch_size, voc.num_words)*812# - ``hidden``: final hidden state of GRU; shape=\ *(n_layers x813# num_directions, batch_size, hidden_size)*814#815816class LuongAttnDecoderRNN(nn.Module):817def __init__(self, attn_model, embedding, hidden_size, output_size, n_layers=1, dropout=0.1):818super(LuongAttnDecoderRNN, self).__init__()819820# Keep for reference821self.attn_model = attn_model822self.hidden_size = hidden_size823self.output_size = output_size824self.n_layers = n_layers825self.dropout = dropout826827# Define layers828self.embedding = embedding829self.embedding_dropout = nn.Dropout(dropout)830self.gru = nn.GRU(hidden_size, hidden_size, n_layers, dropout=(0 if n_layers == 1 else dropout))831self.concat = nn.Linear(hidden_size * 2, hidden_size)832self.out = nn.Linear(hidden_size, output_size)833834self.attn = Attn(attn_model, hidden_size)835836def forward(self, input_step, last_hidden, encoder_outputs):837# Note: we run this one step (word) at a time838# Get embedding of current input word839embedded = self.embedding(input_step)840embedded = self.embedding_dropout(embedded)841# Forward through unidirectional GRU842rnn_output, hidden = self.gru(embedded, last_hidden)843# Calculate attention weights from the current GRU output844attn_weights = self.attn(rnn_output, encoder_outputs)845# Multiply attention weights to encoder outputs to get new "weighted sum" context vector846context = attn_weights.bmm(encoder_outputs.transpose(0, 1))847# Concatenate weighted context vector and GRU output using Luong eq. 5848rnn_output = rnn_output.squeeze(0)849context = context.squeeze(1)850concat_input = torch.cat((rnn_output, context), 1)851concat_output = torch.tanh(self.concat(concat_input))852# Predict next word using Luong eq. 6853output = self.out(concat_output)854output = F.softmax(output, dim=1)855# Return output and final hidden state856return output, hidden857858859######################################################################860# Define Training Procedure861# -------------------------862#863# Masked loss864# ~~~~~~~~~~~865#866# Since we are dealing with batches of padded sequences, we cannot simply867# consider all elements of the tensor when calculating loss. We define868# ``maskNLLLoss`` to calculate our loss based on our decoder’s output869# tensor, the target tensor, and a binary mask tensor describing the870# padding of the target tensor. This loss function calculates the average871# negative log likelihood of the elements that correspond to a *1* in the872# mask tensor.873#874875def maskNLLLoss(inp, target, mask):876nTotal = mask.sum()877crossEntropy = -torch.log(torch.gather(inp, 1, target.view(-1, 1)).squeeze(1))878loss = crossEntropy.masked_select(mask).mean()879loss = loss.to(device)880return loss, nTotal.item()881882883######################################################################884# Single training iteration885# ~~~~~~~~~~~~~~~~~~~~~~~~~886#887# The ``train`` function contains the algorithm for a single training888# iteration (a single batch of inputs).889#890# We will use a couple of clever tricks to aid in convergence:891#892# - The first trick is using **teacher forcing**. This means that at some893# probability, set by ``teacher_forcing_ratio``, we use the current894# target word as the decoder’s next input rather than using the895# decoder’s current guess. This technique acts as training wheels for896# the decoder, aiding in more efficient training. However, teacher897# forcing can lead to model instability during inference, as the898# decoder may not have a sufficient chance to truly craft its own899# output sequences during training. Thus, we must be mindful of how we900# are setting the ``teacher_forcing_ratio``, and not be fooled by fast901# convergence.902#903# - The second trick that we implement is **gradient clipping**. This is904# a commonly used technique for countering the “exploding gradient”905# problem. In essence, by clipping or thresholding gradients to a906# maximum value, we prevent the gradients from growing exponentially907# and either overflow (NaN), or overshoot steep cliffs in the cost908# function.909#910# .. figure:: /_static/img/chatbot/grad_clip.png911# :align: center912# :width: 60%913# :alt: grad_clip914#915# Image source: Goodfellow et al. *Deep Learning*. 2016. https://www.deeplearningbook.org/916#917# **Sequence of Operations:**918#919# 1) Forward pass entire input batch through encoder.920# 2) Initialize decoder inputs as SOS_token, and hidden state as the encoder's final hidden state.921# 3) Forward input batch sequence through decoder one time step at a time.922# 4) If teacher forcing: set next decoder input as the current target; else: set next decoder input as current decoder output.923# 5) Calculate and accumulate loss.924# 6) Perform backpropagation.925# 7) Clip gradients.926# 8) Update encoder and decoder model parameters.927#928#929# .. Note ::930#931# PyTorch’s RNN modules (``RNN``, ``LSTM``, ``GRU``) can be used like any932# other non-recurrent layers by simply passing them the entire input933# sequence (or batch of sequences). We use the ``GRU`` layer like this in934# the ``encoder``. The reality is that under the hood, there is an935# iterative process looping over each time step calculating hidden states.936# Alternatively, you can run these modules one time-step at a time. In937# this case, we manually loop over the sequences during the training938# process like we must do for the ``decoder`` model. As long as you939# maintain the correct conceptual model of these modules, implementing940# sequential models can be very straightforward.941#942#943944945def train(input_variable, lengths, target_variable, mask, max_target_len, encoder, decoder, embedding,946encoder_optimizer, decoder_optimizer, batch_size, clip, max_length=MAX_LENGTH):947948# Zero gradients949encoder_optimizer.zero_grad()950decoder_optimizer.zero_grad()951952# Set device options953input_variable = input_variable.to(device)954target_variable = target_variable.to(device)955mask = mask.to(device)956# Lengths for RNN packing should always be on the CPU957lengths = lengths.to("cpu")958959# Initialize variables960loss = 0961print_losses = []962n_totals = 0963964# Forward pass through encoder965encoder_outputs, encoder_hidden = encoder(input_variable, lengths)966967# Create initial decoder input (start with SOS tokens for each sentence)968decoder_input = torch.LongTensor([[SOS_token for _ in range(batch_size)]])969decoder_input = decoder_input.to(device)970971# Set initial decoder hidden state to the encoder's final hidden state972decoder_hidden = encoder_hidden[:decoder.n_layers]973974# Determine if we are using teacher forcing this iteration975use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False976977# Forward batch of sequences through decoder one time step at a time978if use_teacher_forcing:979for t in range(max_target_len):980decoder_output, decoder_hidden = decoder(981decoder_input, decoder_hidden, encoder_outputs982)983# Teacher forcing: next input is current target984decoder_input = target_variable[t].view(1, -1)985# Calculate and accumulate loss986mask_loss, nTotal = maskNLLLoss(decoder_output, target_variable[t], mask[t])987loss += mask_loss988print_losses.append(mask_loss.item() * nTotal)989n_totals += nTotal990else:991for t in range(max_target_len):992decoder_output, decoder_hidden = decoder(993decoder_input, decoder_hidden, encoder_outputs994)995# No teacher forcing: next input is decoder's own current output996_, topi = decoder_output.topk(1)997decoder_input = torch.LongTensor([[topi[i][0] for i in range(batch_size)]])998decoder_input = decoder_input.to(device)999# Calculate and accumulate loss1000mask_loss, nTotal = maskNLLLoss(decoder_output, target_variable[t], mask[t])1001loss += mask_loss1002print_losses.append(mask_loss.item() * nTotal)1003n_totals += nTotal10041005# Perform backpropagation1006loss.backward()10071008# Clip gradients: gradients are modified in place1009_ = nn.utils.clip_grad_norm_(encoder.parameters(), clip)1010_ = nn.utils.clip_grad_norm_(decoder.parameters(), clip)10111012# Adjust model weights1013encoder_optimizer.step()1014decoder_optimizer.step()10151016return sum(print_losses) / n_totals101710181019######################################################################1020# Training iterations1021# ~~~~~~~~~~~~~~~~~~~1022#1023# It is finally time to tie the full training procedure together with the1024# data. The ``trainIters`` function is responsible for running1025# ``n_iterations`` of training given the passed models, optimizers, data,1026# etc. This function is quite self explanatory, as we have done the heavy1027# lifting with the ``train`` function.1028#1029# One thing to note is that when we save our model, we save a tarball1030# containing the encoder and decoder ``state_dicts`` (parameters), the1031# optimizers’ ``state_dicts``, the loss, the iteration, etc. Saving the model1032# in this way will give us the ultimate flexibility with the checkpoint.1033# After loading a checkpoint, we will be able to use the model parameters1034# to run inference, or we can continue training right where we left off.1035#10361037def trainIters(model_name, voc, pairs, encoder, decoder, encoder_optimizer, decoder_optimizer, embedding, encoder_n_layers, decoder_n_layers, save_dir, n_iteration, batch_size, print_every, save_every, clip, corpus_name, loadFilename):10381039# Load batches for each iteration1040training_batches = [batch2TrainData(voc, [random.choice(pairs) for _ in range(batch_size)])1041for _ in range(n_iteration)]10421043# Initializations1044print('Initializing ...')1045start_iteration = 11046print_loss = 01047if loadFilename:1048start_iteration = checkpoint['iteration'] + 110491050# Training loop1051print("Training...")1052for iteration in range(start_iteration, n_iteration + 1):1053training_batch = training_batches[iteration - 1]1054# Extract fields from batch1055input_variable, lengths, target_variable, mask, max_target_len = training_batch10561057# Run a training iteration with batch1058loss = train(input_variable, lengths, target_variable, mask, max_target_len, encoder,1059decoder, embedding, encoder_optimizer, decoder_optimizer, batch_size, clip)1060print_loss += loss10611062# Print progress1063if iteration % print_every == 0:1064print_loss_avg = print_loss / print_every1065print("Iteration: {}; Percent complete: {:.1f}%; Average loss: {:.4f}".format(iteration, iteration / n_iteration * 100, print_loss_avg))1066print_loss = 010671068# Save checkpoint1069if (iteration % save_every == 0):1070directory = os.path.join(save_dir, model_name, corpus_name, '{}-{}_{}'.format(encoder_n_layers, decoder_n_layers, hidden_size))1071if not os.path.exists(directory):1072os.makedirs(directory)1073torch.save({1074'iteration': iteration,1075'en': encoder.state_dict(),1076'de': decoder.state_dict(),1077'en_opt': encoder_optimizer.state_dict(),1078'de_opt': decoder_optimizer.state_dict(),1079'loss': loss,1080'voc_dict': voc.__dict__,1081'embedding': embedding.state_dict()1082}, os.path.join(directory, '{}_{}.tar'.format(iteration, 'checkpoint')))108310841085######################################################################1086# Define Evaluation1087# -----------------1088#1089# After training a model, we want to be able to talk to the bot ourselves.1090# First, we must define how we want the model to decode the encoded input.1091#1092# Greedy decoding1093# ~~~~~~~~~~~~~~~1094#1095# Greedy decoding is the decoding method that we use during training when1096# we are **NOT** using teacher forcing. In other words, for each time1097# step, we simply choose the word from ``decoder_output`` with the highest1098# softmax value. This decoding method is optimal on a single time-step1099# level.1100#1101# To facilitate the greedy decoding operation, we define a1102# ``GreedySearchDecoder`` class. When run, an object of this class takes1103# an input sequence (``input_seq``) of shape *(input_seq length, 1)*, a1104# scalar input length (``input_length``) tensor, and a ``max_length`` to1105# bound the response sentence length. The input sentence is evaluated1106# using the following computational graph:1107#1108# **Computation Graph:**1109#1110# 1) Forward input through encoder model.1111# 2) Prepare encoder's final hidden layer to be first hidden input to the decoder.1112# 3) Initialize decoder's first input as SOS_token.1113# 4) Initialize tensors to append decoded words to.1114# 5) Iteratively decode one word token at a time:1115# a) Forward pass through decoder.1116# b) Obtain most likely word token and its softmax score.1117# c) Record token and score.1118# d) Prepare current token to be next decoder input.1119# 6) Return collections of word tokens and scores.1120#11211122class GreedySearchDecoder(nn.Module):1123def __init__(self, encoder, decoder):1124super(GreedySearchDecoder, self).__init__()1125self.encoder = encoder1126self.decoder = decoder11271128def forward(self, input_seq, input_length, max_length):1129# Forward input through encoder model1130encoder_outputs, encoder_hidden = self.encoder(input_seq, input_length)1131# Prepare encoder's final hidden layer to be first hidden input to the decoder1132decoder_hidden = encoder_hidden[:self.decoder.n_layers]1133# Initialize decoder input with SOS_token1134decoder_input = torch.ones(1, 1, device=device, dtype=torch.long) * SOS_token1135# Initialize tensors to append decoded words to1136all_tokens = torch.zeros([0], device=device, dtype=torch.long)1137all_scores = torch.zeros([0], device=device)1138# Iteratively decode one word token at a time1139for _ in range(max_length):1140# Forward pass through decoder1141decoder_output, decoder_hidden = self.decoder(decoder_input, decoder_hidden, encoder_outputs)1142# Obtain most likely word token and its softmax score1143decoder_scores, decoder_input = torch.max(decoder_output, dim=1)1144# Record token and score1145all_tokens = torch.cat((all_tokens, decoder_input), dim=0)1146all_scores = torch.cat((all_scores, decoder_scores), dim=0)1147# Prepare current token to be next decoder input (add a dimension)1148decoder_input = torch.unsqueeze(decoder_input, 0)1149# Return collections of word tokens and scores1150return all_tokens, all_scores115111521153######################################################################1154# Evaluate my text1155# ~~~~~~~~~~~~~~~~1156#1157# Now that we have our decoding method defined, we can write functions for1158# evaluating a string input sentence. The ``evaluate`` function manages1159# the low-level process of handling the input sentence. We first format1160# the sentence as an input batch of word indexes with *batch_size==1*. We1161# do this by converting the words of the sentence to their corresponding1162# indexes, and transposing the dimensions to prepare the tensor for our1163# models. We also create a ``lengths`` tensor which contains the length of1164# our input sentence. In this case, ``lengths`` is scalar because we are1165# only evaluating one sentence at a time (batch_size==1). Next, we obtain1166# the decoded response sentence tensor using our ``GreedySearchDecoder``1167# object (``searcher``). Finally, we convert the response’s indexes to1168# words and return the list of decoded words.1169#1170# ``evaluateInput`` acts as the user interface for our chatbot. When1171# called, an input text field will spawn in which we can enter our query1172# sentence. After typing our input sentence and pressing *Enter*, our text1173# is normalized in the same way as our training data, and is ultimately1174# fed to the ``evaluate`` function to obtain a decoded output sentence. We1175# loop this process, so we can keep chatting with our bot until we enter1176# either “q” or “quit”.1177#1178# Finally, if a sentence is entered that contains a word that is not in1179# the vocabulary, we handle this gracefully by printing an error message1180# and prompting the user to enter another sentence.1181#11821183def evaluate(encoder, decoder, searcher, voc, sentence, max_length=MAX_LENGTH):1184### Format input sentence as a batch1185# words -> indexes1186indexes_batch = [indexesFromSentence(voc, sentence)]1187# Create lengths tensor1188lengths = torch.tensor([len(indexes) for indexes in indexes_batch])1189# Transpose dimensions of batch to match models' expectations1190input_batch = torch.LongTensor(indexes_batch).transpose(0, 1)1191# Use appropriate device1192input_batch = input_batch.to(device)1193lengths = lengths.to("cpu")1194# Decode sentence with searcher1195tokens, scores = searcher(input_batch, lengths, max_length)1196# indexes -> words1197decoded_words = [voc.index2word[token.item()] for token in tokens]1198return decoded_words119912001201def evaluateInput(encoder, decoder, searcher, voc):1202input_sentence = ''1203while(1):1204try:1205# Get input sentence1206input_sentence = input('> ')1207# Check if it is quit case1208if input_sentence == 'q' or input_sentence == 'quit': break1209# Normalize sentence1210input_sentence = normalizeString(input_sentence)1211# Evaluate sentence1212output_words = evaluate(encoder, decoder, searcher, voc, input_sentence)1213# Format and print response sentence1214output_words[:] = [x for x in output_words if not (x == 'EOS' or x == 'PAD')]1215print('Bot:', ' '.join(output_words))12161217except KeyError:1218print("Error: Encountered unknown word.")121912201221######################################################################1222# Run Model1223# ---------1224#1225# Finally, it is time to run our model!1226#1227# Regardless of whether we want to train or test the chatbot model, we1228# must initialize the individual encoder and decoder models. In the1229# following block, we set our desired configurations, choose to start from1230# scratch or set a checkpoint to load from, and build and initialize the1231# models. Feel free to play with different model configurations to1232# optimize performance.1233#12341235# Configure models1236model_name = 'cb_model'1237attn_model = 'dot'1238#``attn_model = 'general'``1239#``attn_model = 'concat'``1240hidden_size = 5001241encoder_n_layers = 21242decoder_n_layers = 21243dropout = 0.11244batch_size = 6412451246# Set checkpoint to load from; set to None if starting from scratch1247loadFilename = None1248checkpoint_iter = 400012491250#############################################################1251# Sample code to load from a checkpoint:1252#1253# .. code-block:: python1254#1255# loadFilename = os.path.join(save_dir, model_name, corpus_name,1256# '{}-{}_{}'.format(encoder_n_layers, decoder_n_layers, hidden_size),1257# '{}_checkpoint.tar'.format(checkpoint_iter))12581259# Load model if a ``loadFilename`` is provided1260if loadFilename:1261# If loading on same machine the model was trained on1262checkpoint = torch.load(loadFilename)1263# If loading a model trained on GPU to CPU1264#checkpoint = torch.load(loadFilename, map_location=torch.device('cpu'))1265encoder_sd = checkpoint['en']1266decoder_sd = checkpoint['de']1267encoder_optimizer_sd = checkpoint['en_opt']1268decoder_optimizer_sd = checkpoint['de_opt']1269embedding_sd = checkpoint['embedding']1270voc.__dict__ = checkpoint['voc_dict']127112721273print('Building encoder and decoder ...')1274# Initialize word embeddings1275embedding = nn.Embedding(voc.num_words, hidden_size)1276if loadFilename:1277embedding.load_state_dict(embedding_sd)1278# Initialize encoder & decoder models1279encoder = EncoderRNN(hidden_size, embedding, encoder_n_layers, dropout)1280decoder = LuongAttnDecoderRNN(attn_model, embedding, hidden_size, voc.num_words, decoder_n_layers, dropout)1281if loadFilename:1282encoder.load_state_dict(encoder_sd)1283decoder.load_state_dict(decoder_sd)1284# Use appropriate device1285encoder = encoder.to(device)1286decoder = decoder.to(device)1287print('Models built and ready to go!')128812891290######################################################################1291# Run Training1292# ~~~~~~~~~~~~1293#1294# Run the following block if you want to train the model.1295#1296# First we set training parameters, then we initialize our optimizers, and1297# finally we call the ``trainIters`` function to run our training1298# iterations.1299#13001301# Configure training/optimization1302clip = 50.01303teacher_forcing_ratio = 1.01304learning_rate = 0.00011305decoder_learning_ratio = 5.01306n_iteration = 40001307print_every = 11308save_every = 50013091310# Ensure dropout layers are in train mode1311encoder.train()1312decoder.train()13131314# Initialize optimizers1315print('Building optimizers ...')1316encoder_optimizer = optim.Adam(encoder.parameters(), lr=learning_rate)1317decoder_optimizer = optim.Adam(decoder.parameters(), lr=learning_rate * decoder_learning_ratio)1318if loadFilename:1319encoder_optimizer.load_state_dict(encoder_optimizer_sd)1320decoder_optimizer.load_state_dict(decoder_optimizer_sd)13211322# If you have an accelerator, configure it to call1323for state in encoder_optimizer.state.values():1324for k, v in state.items():1325if isinstance(v, torch.Tensor):1326state[k] = v.to(device)13271328for state in decoder_optimizer.state.values():1329for k, v in state.items():1330if isinstance(v, torch.Tensor):1331state[k] = v.to(device)13321333# Run training iterations1334print("Starting Training!")1335trainIters(model_name, voc, pairs, encoder, decoder, encoder_optimizer, decoder_optimizer,1336embedding, encoder_n_layers, decoder_n_layers, save_dir, n_iteration, batch_size,1337print_every, save_every, clip, corpus_name, loadFilename)133813391340######################################################################1341# Run Evaluation1342# ~~~~~~~~~~~~~~1343#1344# To chat with your model, run the following block.1345#13461347# Set dropout layers to ``eval`` mode1348encoder.eval()1349decoder.eval()13501351# Initialize search module1352searcher = GreedySearchDecoder(encoder, decoder)13531354# Begin chatting (uncomment and run the following line to begin)1355# evaluateInput(encoder, decoder, searcher, voc)135613571358######################################################################1359# Conclusion1360# ----------1361#1362# That’s all for this one, folks. Congratulations, you now know the1363# fundamentals to building a generative chatbot model! If you’re1364# interested, you can try tailoring the chatbot’s behavior by tweaking the1365# model and training parameters and customizing the data that you train1366# the model on.1367#1368# Check out the other tutorials for more cool deep learning applications1369# in PyTorch!1370#137113721373