Path: blob/master/keras/text_classification/keras_subword_tokenization.ipynb
2611 views
Subword Tokenization for Text Classification
In this notebook, we will be experimenting with subword tokenization. Tokenization is often times one of the first mandatory task that's performed in NLP task, where we break down a piece of text into meaningful individual units/tokens.
There're three major ways of performing tokenization.
Character Level
Treats each character (or unicode) as one individual token.
Pros: This one requires the least amount of preprocessing techniques.
Cons: The downstream task needs to be able to learn relative positions of the characters, dependencies, spellings, making it harder to achieve good performance.
Word Level
Performs word segmentation on top of our text data.
Pros: Words are how we as human process text information.
Cons: The correctness of the segmentation is highly dependent on the software we're using. e.g. Spacy's Tokenization performs language specific rules to segment the original text into words. Also word level can't handle unseen words (a.k.a. out of vocabulary words) and performs poorly on rare words.
Blog: Language modeling a billion words also shared some thoughts comparing character based tokenization v.s. word based tokenization. Taken directly from the post.
Word-level models have an important advantage over char-level models. Take the following sequence as an example (a quote from Robert A. Heinlein):
Progress isn't made by early risers. It's made by lazy men trying to find easier ways to do something.
After tokenization, the word-level model might view this sequence as containing 22 tokens. On the other hand, the char-level will view this sequence as containing 102 tokens. This longer sequence makes the task of the character model harder than the word model, as it must take into account dependencies between more tokens over more time-steps. Another issue with character language models is that they need to learn spelling in addition to syntax, semantics, etc. In any case, word language models will typically have lower error than character models.
The main advantage of character over word language models is that they have a really small vocabulary. For example, the GBW dataset will contain approximately 800 characters compared to 800,000 words (after pruning low-frequency tokens). In practice this means that character models will require less memory and have faster inference than their word counterparts. Another advantage is that they do not require tokenization as a preprocessing step.
Subword Level
As we can probably imagine, subword level is somewhere between character level and word level, hence tries to bring in the the pros (being able to handle out of vocabulary or rare words better) and mitigate the drawback (too fine-grained for downstream tasks) from both approaches. With subword level, what we are aiming for is to represent open vocabulary through a fixed-sized vocabulary of variable length character sequences. e.g. the word highest might be segmented into subwords high and est.
There're many different methods for generating these subwords. e.g.
A naive way way is to brute force generate the subwords by sliding through a fix sized window. e.g. highest -> hig, igh, ghe, etc.
More clever approaches such as Byte Pair Encoding, Unigram models. We won't be covering the internals of these approaches here. There's another document that goes more in-depth into Byte Pair Encoding and sentencepiece, the open-sourced package that we'll be using here to experiment with subword tokenization.
Data Preprocessing
We'll use the movie review sentiment analysis dataset from Kaggle for this example. It's a binary classification problem with AUC as the ultimate evaluation metric. The next few code chunk performs the usual text preprocessing, build up the word vocabulary and performing a train/test split.
Model
To train our text classifier, we specify a 1D convolutional network. The comparison we'll be experimenting is whether subword-level model gives a better performance than word-level model.
Subword-Level Tokenizer
The next couple of code chunks trains the subword vocabulary, encode our original text into these subwords and pads the sequences into a fixed length.
Note the the pad_sequences function from keras assumes that index 0 is reserved for padding, hence when learning the subword vocabulary using sentencepiece, we make sure to keep the index consistent.
Word-Level Tokenizer
Submission
For the submission section, we read in and preprocess the test data provided by the competition, then generate the predicted probability column for both the model that uses word-level tokenization and one that uses subword tokenization to compare their performance.
Summary
We've looked at the performance of leveraging subword tokenization for our text classification task. Note that some other ideas that we did not try out are:
Use other word-level tokenizers. Another popular choice at the point of writing this documentation is spacy's tokenizer.
Sentencepiece suggests that it can be trained on raw text without the need to perform language specific segmentation beforehand, e.g. using the spacy tokenizer on our raw text data before feeding it to sentencepiece to learn the subword vocabulary. We can conduct our own experiment on the task at hand to verify that claim. Sentencepiece also includes an experiments page that documents some of the experiments they've conducted.