Path: blob/master/notebooks/tutorials/text_preproc_jax.ipynb
1192 views
Please find torch implementation of this notebook here: https://colab.research.google.com/github/probml/pyprobml/blob/master/notebooks/book1/01/text_preproc_torch.ipynb
Text preprocessing
We discuss how to convert a sequence of words or characters into numeric form, which can then be fed into an ML model.
Basics
This section is based on sec 8.2 of http://d2l.ai/chapter_recurrent-neural-networks/text-preprocessing.html
Data
As a simple example, we use the book "The Time Machine" by H G Wells, since it is short (30k words) and public domain.
Tokenization
Vocabulary
We map each word to a unique integer id, sorted by decreasing frequency. We reserve the special id of 0 for the "unknown word". We also allow for a list of reserved tokens, such as “pad" for padding, "bos" to present the beginning for a sequence, and “eos” for the end of a sequence.
Here are the top 10 words (and their codes) in our corpus.
Here is a tokenization of a few sentences.
Putting it altogether
We tokenize the corpus at the character level, and return the sequence of integers, as well as the corresponding Vocab object.
One-hot encodings
We can convert a sequence of N integers into a N*V one-hot matrix, where V is the vocabulary size.
Language modeling
When fitting language models, we often need to chop up a long sequence into a set of short sequences, which may be overlapping, as shown below, where we extract subsequences of length .
Below we show how to do this.
This section is based on sec 8.3.4 of http://d2l.ai/chapter_recurrent-neural-networks/language-models-and-dataset.html#reading-long-sequence-data
Random ordering
To increase variety of the data, we can start the extraction at a random offset. We can thus create a random sequence data iterator, as follows.
For example, let us generate a sequence 0,1,..,34, and then extract subsequences of length 5. Each minibatch will have 2 such subsequences, starting at random offsets. There is no ordering between the subsequences, either within or across minibatches. There are such subsequences, so the iterator will generate 3 minibatches, each of size 2.
For language modeling tasks, we define to be the first tokens, and to be the 'th token, which is the one to be predicted.
Sequential ordering
We can also require that the 'th subsequence in minibatch follows the 'th subsequence in minibatch . This is useful when training RNNs, since when the model encounters batch , the hidden state of the model will already be initialized by the last token in sequence of batch .
Below we give an example. We see that the first subsequence in batch 1 is [0,1,2,3,4], and the first subsequence in batch 2 is [5,6,7,8,9], as desired.
Data iterator
Machine translation
When dealing with sequence-to-sequence tasks, such as NMT, we need to create a vocabulary for the source and target language. In addition, the input and output sequences may have different lengths, so we need to use padding to ensure that we can create fixed-size minibatches. We show how to do this below.
This is based on sec 9.5 of http://d2l.ai/chapter_recurrent-modern/machine-translation-and-dataset.html
Data
We use an English-French dataset that consists of bilingual sentence pairs from the Tatoeba Project. Each line in the dataset is a tab-delimited pair of an English text sequence (source) and the translated French text sequence (target).
Preprocessing
We apply several preprocessing steps: we replace non-breaking space with space, convert uppercase letters to lowercase ones, and insert space between words and punctuation marks.
We tokenize at the word level. The following tokenize_nmt function tokenizes the the first num_examples
text sequence pairs, where each token is either a word or a punctuation mark.
Vocabulary
We can make a source and target vocabulary. To avoid having too many unique tokens, we specify a minimum frequency of 2 - all others will get replaced by "unk". We also add special tags for padding, begin of sentence, and end of sentence.
Truncation and padding
To create minibatches of sequences, all of the same length, we truncate sentences that are too long, and pad ones that are too short.
Data iterator
Below we combine all of the above pieces into a handy function.
Show the first minibatch.