GitHub Repository: ethen8181/machine-learning
Path: blob/master/keras/text_classification/keras_subword_tokenization.ipynb
²⁶¹¹ views

Kernel: Python 3

In [1]:

# code for loading the format for the notebook
import os

# path : store the current path to convert back to it later
path = os.getcwd()
os.chdir(os.path.join('..', '..', 'notebook_format'))

from formats import load_style
load_style(plot_style=False)

Out[1]:

In [2]:

os.chdir(path)

# 1. magic for inline plot
# 2. magic to print version
# 3. magic so that the notebook will reload external python modules
# 4. magic to enable retina (high resolution) plots
# https://gist.github.com/minrk/3301035
%matplotlib inline
%load_ext watermark
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format='retina'

import os
import time
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from typing import List, Tuple
from keras import layers
from keras.models import Model
from keras.preprocessing.text import Tokenizer
from keras.utils.np_utils import to_categorical
from keras.preprocessing.sequence import pad_sequences

# prevent scientific notations
pd.set_option('display.float_format', lambda x: '%.3f' % x)

%watermark -a 'Ethen' -d -t -v -p numpy,pandas,sklearn,keras,sentencepiece

Out[2]:

Using TensorFlow backend.

Ethen 2019-12-31 11:20:36 

CPython 3.6.4
IPython 7.9.0

numpy 1.16.5
pandas 0.25.0
sklearn 0.21.2
keras 2.2.2
sentencepiece n

Subword Tokenization for Text Classification

In this notebook, we will be experimenting with subword tokenization. Tokenization is often times one of the first mandatory task that's performed in NLP task, where we break down a piece of text into meaningful individual units/tokens.

There're three major ways of performing tokenization.

Character Level

Treats each character (or unicode) as one individual token.

Pros: This one requires the least amount of preprocessing techniques.
Cons: The downstream task needs to be able to learn relative positions of the characters, dependencies, spellings, making it harder to achieve good performance.

Word Level

Performs word segmentation on top of our text data.

Pros: Words are how we as human process text information.
Cons: The correctness of the segmentation is highly dependent on the software we're using. e.g. Spacy's Tokenization performs language specific rules to segment the original text into words. Also word level can't handle unseen words (a.k.a. out of vocabulary words) and performs poorly on rare words.

Blog: Language modeling a billion words also shared some thoughts comparing character based tokenization v.s. word based tokenization. Taken directly from the post.

Word-level models have an important advantage over char-level models. Take the following sequence as an example (a quote from Robert A. Heinlein):
Progress isn't made by early risers. It's made by lazy men trying to find easier ways to do something.
After tokenization, the word-level model might view this sequence as containing 22 tokens. On the other hand, the char-level will view this sequence as containing 102 tokens. This longer sequence makes the task of the character model harder than the word model, as it must take into account dependencies between more tokens over more time-steps. Another issue with character language models is that they need to learn spelling in addition to syntax, semantics, etc. In any case, word language models will typically have lower error than character models.
The main advantage of character over word language models is that they have a really small vocabulary. For example, the GBW dataset will contain approximately 800 characters compared to 800,000 words (after pruning low-frequency tokens). In practice this means that character models will require less memory and have faster inference than their word counterparts. Another advantage is that they do not require tokenization as a preprocessing step.

Subword Level

As we can probably imagine, subword level is somewhere between character level and word level, hence tries to bring in the the pros (being able to handle out of vocabulary or rare words better) and mitigate the drawback (too fine-grained for downstream tasks) from both approaches. With subword level, what we are aiming for is to represent open vocabulary through a fixed-sized vocabulary of variable length character sequences. e.g. the word highest might be segmented into subwords high and est.

There're many different methods for generating these subwords. e.g.

A naive way way is to brute force generate the subwords by sliding through a fix sized window. e.g. highest -> hig, igh, ghe, etc.
More clever approaches such as Byte Pair Encoding, Unigram models. We won't be covering the internals of these approaches here. There's another document that goes more in-depth into Byte Pair Encoding and sentencepiece, the open-sourced package that we'll be using here to experiment with subword tokenization.

Data Preprocessing

We'll use the movie review sentiment analysis dataset from Kaggle for this example. It's a binary classification problem with AUC as the ultimate evaluation metric. The next few code chunk performs the usual text preprocessing, build up the word vocabulary and performing a train/test split.

In [3]:

data_dir = 'data'
submission_dir = 'submission'

In [4]:

input_path = os.path.join(data_dir, 'word2vec-nlp-tutorial', 'labeledTrainData.tsv')
df = pd.read_csv(input_path, delimiter='\t')
print(df.shape)
df.head()

Out[4]:

(25000, 3)

In [5]:

raw_text = df['review'].iloc[0]
raw_text

Out[5]:

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally starts is only on for 20 minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord. Why he wants MJ dead so bad is beyond me. Because MJ overheard his plans? Nah, Joe Pesci's character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates MJ's music.<br /><br />Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence. Also, the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.<br /><br />Bottom line, this movie is for people who like MJ on one level or another (which i think is most people). If not, then stay away. It does try and give off a wholesome message and ironically MJ's bestest buddy in this movie is a girl! Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty? Well, with all the attention i've gave this subject....hmmm well i don't know because people can be different behind closed doors, i know this for a fact. He is either an extremely nice but stupid guy or one of the most sickest liars. I hope he is not the latter."

In [6]:

import re

def clean_str(string: str) -> str:
    string = re.sub(r"\\", "", string)    
    string = re.sub(r"\'", "", string)    
    string = re.sub(r"\"", "", string)    
    return string.strip().lower()

In [7]:

from bs4 import BeautifulSoup

def clean_text(df: pd.DataFrame,
               text_col: str,
               label_col: str) -> Tuple[List[str], List[int]]:
    texts = []
    labels = []
    for raw_text, label in zip(df[text_col], df[label_col]):  
        text = BeautifulSoup(raw_text).get_text()
        cleaned_text = clean_str(text)
        texts.append(cleaned_text)
        labels.append(label)

    return texts, labels

In [8]:

text_col = 'review'
label_col = 'sentiment'

texts, labels = clean_text(df, text_col, label_col)
print('sample text: ', texts[0])
print('corresponding label:', labels[0])

Out[8]:

sample text:  with all this stuff going down at the moment with mj ive started listening to his music, watching the odd documentary here and there, watched the wiz and watched moonwalker again. maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. some of it has subtle messages about mjs feeling towards the press and also the obvious message of drugs are bad mkay.visually impressive but of course this is all about michael jackson so unless you remotely like mj in anyway then you are going to hate this and find it boring. some may call mj an egotist for consenting to the making of this movie but mj and most of his fans would say that he made it for the fans which if true is really nice of him.the actual feature film bit when it finally starts is only on for 20 minutes or so excluding the smooth criminal sequence and joe pesci is convincing as a psychopathic all powerful drug lord. why he wants mj dead so bad is beyond me. because mj overheard his plans? nah, joe pescis character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates mjs music.lots of cool things in this like mj turning into a car and a robot and the whole speed demon sequence. also, the director must have had the patience of a saint when it came to filming the kiddy bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.bottom line, this movie is for people who like mj on one level or another (which i think is most people). if not, then stay away. it does try and give off a wholesome message and ironically mjs bestest buddy in this movie is a girl! michael jackson is truly one of the most talented people ever to grace this planet but is he guilty? well, with all the attention ive gave this subject....hmmm well i dont know because people can be different behind closed doors, i know this for a fact. he is either an extremely nice but stupid guy or one of the most sickest liars. i hope he is not the latter.
corresponding label: 1

In [9]:

random_state = 1234
val_split = 0.2

labels = to_categorical(labels)
texts_train, texts_val, y_train, y_val = train_test_split(
    texts, labels,
    test_size=val_split,
    random_state=random_state)

print('labels shape:', labels.shape)
print('train size: ', len(texts_train))
print('validation size: ', len(texts_val))

Out[9]:

labels shape: (25000, 2)
train size:  20000
validation size:  5000

Model

To train our text classifier, we specify a 1D convolutional network. The comparison we'll be experimenting is whether subword-level model gives a better performance than word-level model.

In [10]:

def simple_text_cnn(max_sequence_len: int, max_features: int, num_classes: int,
                    optimizer: str='adam', metrics: List[str]=['acc']) -> Model:

    sequence_input = layers.Input(shape=(max_sequence_len,), dtype='int32')
    embedded_sequences = layers.Embedding(max_features, 100,
                                          trainable=True)(sequence_input)
    conv1 = layers.Conv1D(128, 5, activation='relu')(embedded_sequences)
    pool1 = layers.MaxPooling1D(5)(conv1)
    conv2 = layers.Conv1D(128, 5, activation='relu')(pool1)
    pool2 = layers.MaxPooling1D(5)(conv2)
    conv3 = layers.Conv1D(128, 5, activation='relu')(pool2)
    pool3 = layers.MaxPooling1D(35)(conv3)
    flatten = layers.Flatten()(pool3)
    dense = layers.Dense(128, activation='relu')(flatten)
    preds = layers.Dense(num_classes, activation='softmax')(dense)

    model = Model(sequence_input, preds)
    model.compile(loss='categorical_crossentropy',
                  optimizer=optimizer,
                  metrics=metrics)
    return model

Subword-Level Tokenizer

The next couple of code chunks trains the subword vocabulary, encode our original text into these subwords and pads the sequences into a fixed length.

Note the the pad_sequences function from keras assumes that index 0 is reserved for padding, hence when learning the subword vocabulary using sentencepiece, we make sure to keep the index consistent.

In [11]:

# write the raw text so that sentencepiece can consume it
temp_file = 'train.txt'
with open(temp_file, 'w') as f:
    f.write('\n'.join(texts))

In [12]:

from sentencepiece import SentencePieceTrainer, SentencePieceProcessor

max_num_words = 30000
model_type = 'unigram'
model_prefix = model_type
pad_id = 0
unk_id = 1
bos_id = 2
eos_id = 3

sentencepiece_params = ' '.join([
    '--input={}'.format(temp_file),
    '--model_type={}'.format(model_type),
    '--model_prefix={}'.format(model_type),
    '--vocab_size={}'.format(max_num_words),
    '--pad_id={}'.format(pad_id),
    '--unk_id={}'.format(unk_id),
    '--bos_id={}'.format(bos_id),
    '--eos_id={}'.format(eos_id)
])
print(sentencepiece_params)
SentencePieceTrainer.train(sentencepiece_params)

Out[12]:

--input=train.txt --model_type=unigram --model_prefix=unigram --vocab_size=30000 --pad_id=0 --unk_id=1 --bos_id=2 --eos_id=3

True

In [13]:

sp = SentencePieceProcessor()
sp.load("{}.model".format(model_prefix))
print('Found %s unique tokens.' % sp.get_piece_size())

Out[13]:

Found 30000 unique tokens.

In [14]:

max_sequence_len = 1000

sequences_train = [sp.encode_as_ids(text) for text in texts_train]
x_train = pad_sequences(sequences_train, maxlen=max_sequence_len)

sequences_val = [sp.encode_as_ids(text) for text in texts_val]
x_val = pad_sequences(sequences_val, maxlen=max_sequence_len)

sequences_train[0][:5]

Out[14]:

[62, 5086, 4170, 2260, 2520]

In [15]:

print('sample text: ', texts_train[0])
print('sample text: ', sp.encode_as_pieces(sp.decode_ids(x_train[0].tolist())))

Out[15]:

sample text:  when gundam0079 became the movie trilogy most of us are familiar with, a lot of it was sheer action and less of anything else. this ova is kinda the opposite. though therere only half a dozen episodes, it isnt filled with action, but emotional things. the two main action sequences in this, i believe, are enough to satisfy me. after seeing so many gundam series, movies, and ovas, i was completely ready for a civilian-esquire movie. this movie did a fantastic job of that. what makes this movie stand out is that shows both sides of the war have good and bad people. it made the zeons seem more human rather than the original movies where theyre depicted as the second rise of evil nazis. most people that dont like anime that ive forced to watch this movie (lol), liked it. so, id recommend it to a lot of people just for the anti-war story. if youre a gundam fan, and havent seen this, you shouldnt be reading this; you should already be watching it right now.
sample text:  ['▁when', '▁gundam', '00', '7', '9', '▁became', '▁the', '▁movie', '▁trilogy', '▁most', '▁of', '▁us', '▁are', '▁familiar', '▁with', ',', '▁a', '▁lot', '▁of', '▁it', '▁was', '▁sheer', '▁action', '▁and', '▁less', '▁of', '▁anything', '▁else', '.', '▁this', '▁ova', '▁is', '▁kinda', '▁the', '▁opposite', '.', '▁though', '▁there', 're', '▁only', '▁half', '▁a', '▁dozen', '▁episodes', ',', '▁it', '▁isnt', '▁filled', '▁with', '▁action', ',', '▁but', '▁emotional', '▁things', '.', '▁the', '▁two', '▁main', '▁action', '▁sequences', '▁in', '▁this', ',', '▁i', '▁believe', ',', '▁are', '▁enough', '▁to', '▁satisfy', '▁me', '.', '▁after', '▁seeing', '▁so', '▁many', '▁gundam', '▁series', ',', '▁movies', ',', '▁and', '▁ova', 's', ',', '▁i', '▁was', '▁completely', '▁ready', '▁for', '▁a', '▁civilian', '-', 'esquire', '▁movie', '.', '▁this', '▁movie', '▁did', '▁a', '▁fantastic', '▁job', '▁of', '▁that', '.', '▁what', '▁makes', '▁this', '▁movie', '▁stand', '▁out', '▁is', '▁that', '▁shows', '▁both', '▁sides', '▁of', '▁the', '▁war', '▁have', '▁good', '▁and', '▁bad', '▁people', '.', '▁it', '▁made', '▁the', '▁zeon', 's', '▁seem', '▁more', '▁human', '▁rather', '▁than', '▁the', '▁original', '▁movies', '▁where', '▁theyre', '▁depicted', '▁as', '▁the', '▁second', '▁rise', '▁of', '▁evil', '▁nazis', '.', '▁most', '▁people', '▁that', '▁dont', '▁like', '▁anime', '▁that', '▁ive', '▁forced', '▁to', '▁watch', '▁this', '▁movie', '▁(', 'lol', '),', '▁liked', '▁it', '.', '▁so', ',', '▁id', '▁recommend', '▁it', '▁to', '▁a', '▁lot', '▁of', '▁people', '▁just', '▁for', '▁the', '▁anti', '-', 'war', '▁story', '.', '▁if', '▁youre', '▁a', '▁gundam', '▁fan', ',', '▁and', '▁havent', '▁seen', '▁this', ',', '▁you', '▁shouldnt', '▁be', '▁reading', '▁this', ';', '▁you', '▁should', '▁already', '▁be', '▁watching', '▁it', '▁right', '▁now', '.']

In [0]:

num_classes = 2
model1 = simple_text_cnn(max_sequence_len, max_num_words + 1, num_classes)
model1.summary()

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:66: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:541: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4432: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4267: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:793: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3576: The name tf.log is deprecated. Please use tf.math.log instead.

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 1000)              0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 1000, 100)         3000100   
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 996, 128)          64128     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 199, 128)          0         
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 195, 128)          82048     
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 39, 128)           0         
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 35, 128)           82048     
_________________________________________________________________
max_pooling1d_3 (MaxPooling1 (None, 1, 128)            0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 128)               16512     
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 258       
=================================================================
Total params: 3,245,094
Trainable params: 3,245,094
Non-trainable params: 0
_________________________________________________________________

In [0]:

# time : 120
# performance : 0.92936
start = time.time()
history1 = model1.fit(x_train, y_train,
                      validation_data=(x_val, y_val),
                      batch_size=128,
                      epochs=8)
end = time.time()
elapse1 = end - start
elapse1

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/math_grad.py:1424: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1033: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1020: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3005: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

Train on 20000 samples, validate on 5000 samples
Epoch 1/8
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:190: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:197: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:207: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:216: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:223: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.

20000/20000 [==============================] - 7s 363us/step - loss: 0.5963 - acc: 0.6101 - val_loss: 0.3138 - val_acc: 0.8702
Epoch 2/8
20000/20000 [==============================] - 4s 224us/step - loss: 0.2239 - acc: 0.9120 - val_loss: 0.2991 - val_acc: 0.8820
Epoch 3/8
20000/20000 [==============================] - 4s 223us/step - loss: 0.0797 - acc: 0.9738 - val_loss: 0.3427 - val_acc: 0.8852
Epoch 4/8
20000/20000 [==============================] - 4s 224us/step - loss: 0.0193 - acc: 0.9946 - val_loss: 0.5095 - val_acc: 0.8814
Epoch 5/8
20000/20000 [==============================] - 4s 222us/step - loss: 0.0050 - acc: 0.9988 - val_loss: 0.7519 - val_acc: 0.8704
Epoch 6/8
20000/20000 [==============================] - 4s 223us/step - loss: 0.0016 - acc: 0.9999 - val_loss: 0.7487 - val_acc: 0.8840
Epoch 7/8
20000/20000 [==============================] - 4s 223us/step - loss: 2.0759e-04 - acc: 1.0000 - val_loss: 0.8045 - val_acc: 0.8810
Epoch 8/8
20000/20000 [==============================] - 4s 223us/step - loss: 5.2034e-05 - acc: 1.0000 - val_loss: 0.8260 - val_acc: 0.8824

39.04836106300354

Word-Level Tokenizer

In [0]:

tokenizer = Tokenizer(num_words=max_num_words, oov_token='<unk>')
tokenizer.fit_on_texts(texts_train)
print('Found %s unique tokens.' % len(tokenizer.word_index))

Found 74207 unique tokens.

In [0]:

sequences_train = tokenizer.texts_to_sequences(texts_train)
x_train = pad_sequences(sequences_train, maxlen=max_sequence_len)

sequences_val = tokenizer.texts_to_sequences(texts_val)
x_val = pad_sequences(sequences_val, maxlen=max_sequence_len)

In [0]:

num_classes = 2
model2 = simple_text_cnn(max_sequence_len, max_num_words + 1, num_classes)
model2.summary()

Model: "model_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_2 (InputLayer)         (None, 1000)              0         
_________________________________________________________________
embedding_2 (Embedding)      (None, 1000, 100)         3000100   
_________________________________________________________________
conv1d_4 (Conv1D)            (None, 996, 128)          64128     
_________________________________________________________________
max_pooling1d_4 (MaxPooling1 (None, 199, 128)          0         
_________________________________________________________________
conv1d_5 (Conv1D)            (None, 195, 128)          82048     
_________________________________________________________________
max_pooling1d_5 (MaxPooling1 (None, 39, 128)           0         
_________________________________________________________________
conv1d_6 (Conv1D)            (None, 35, 128)           82048     
_________________________________________________________________
max_pooling1d_6 (MaxPooling1 (None, 1, 128)            0         
_________________________________________________________________
flatten_2 (Flatten)          (None, 128)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 128)               16512     
_________________________________________________________________
dense_4 (Dense)              (None, 2)                 258       
=================================================================
Total params: 3,245,094
Trainable params: 3,245,094
Non-trainable params: 0
_________________________________________________________________

In [0]:

# time : 120
# performance : 0.92520
start = time.time()
history2 = model2.fit(x_train, y_train,
                      validation_data=(x_val, y_val),
                      batch_size=128,
                      epochs=8)
end = time.time()
elapse2 = end - start
elapse2

Train on 20000 samples, validate on 5000 samples
Epoch 1/8
20000/20000 [==============================] - 5s 257us/step - loss: 0.5386 - acc: 0.6734 - val_loss: 0.3237 - val_acc: 0.8708
Epoch 2/8
20000/20000 [==============================] - 5s 227us/step - loss: 0.2028 - acc: 0.9216 - val_loss: 0.2670 - val_acc: 0.8908
Epoch 3/8
20000/20000 [==============================] - 4s 225us/step - loss: 0.0668 - acc: 0.9785 - val_loss: 0.3612 - val_acc: 0.8886
Epoch 4/8
20000/20000 [==============================] - 5s 225us/step - loss: 0.0205 - acc: 0.9937 - val_loss: 0.4852 - val_acc: 0.8826
Epoch 5/8
20000/20000 [==============================] - 5s 225us/step - loss: 0.0059 - acc: 0.9985 - val_loss: 0.6764 - val_acc: 0.8786
Epoch 6/8
20000/20000 [==============================] - 5s 228us/step - loss: 0.0021 - acc: 0.9995 - val_loss: 0.7321 - val_acc: 0.8788
Epoch 7/8
20000/20000 [==============================] - 5s 226us/step - loss: 0.0022 - acc: 0.9995 - val_loss: 0.8057 - val_acc: 0.8840
Epoch 8/8
20000/20000 [==============================] - 5s 226us/step - loss: 0.0034 - acc: 0.9990 - val_loss: 0.8816 - val_acc: 0.8808

37.271193742752075

Submission

For the submission section, we read in and preprocess the test data provided by the competition, then generate the predicted probability column for both the model that uses word-level tokenization and one that uses subword tokenization to compare their performance.

In [0]:

input_path = os.path.join(data_dir, 'word2vec-nlp-tutorial', 'testData.tsv')
df_test = pd.read_csv(input_path, delimiter='\t')
print(df_test.shape)
df_test.head()

(25000, 2)

In [0]:

def clean_text_without_label(df: pd.DataFrame, text_col: str) -> List[str]:
    texts = []
    for raw_text in df[text_col]:
        text = BeautifulSoup(raw_text).get_text()
        cleaned_text = clean_str(text)
        texts.append(cleaned_text)

    return texts

In [0]:

texts_test = clean_text_without_label(df_test, text_col)

# word-level
word_sequences_test = tokenizer.texts_to_sequences(texts_test)
word_x_test = pad_sequences(word_sequences_test, maxlen=max_sequence_len)
len(word_x_test)

25000

In [0]:

# subword-level
sentencepiece_sequences_test = [sp.encode_as_ids(text) for text in texts_test]
sentencepiece_x_test = pad_sequences(sentencepiece_sequences_test, maxlen=max_sequence_len)
len(sentencepiece_x_test)

25000

In [0]:

def create_submission(ids, predictions, ids_col, prediction_col, submission_path) -> pd.DataFrame:
    df_submission = pd.DataFrame({
        ids_col: ids,
        prediction_col: predictions
    }, columns=[ids_col, prediction_col])

    if submission_path is not None:
        # create the directory if need be, e.g. if the submission_path = submission/submission.csv
        # we'll create the submission directory first if it doesn't exist
        directory = os.path.split(submission_path)[0]
        if (directory != '' or directory != '.') and not os.path.isdir(directory):
            os.makedirs(directory, exist_ok=True)

        df_submission.to_csv(submission_path, index=False, header=True)

    return df_submission

In [0]:

ids_col = 'id'
prediction_col = 'sentiment'
ids = df_test[ids_col]

predictions_dict = {
    'sentencepiece_cnn': model1.predict(sentencepiece_x_test)[:, 1], # 0.92936
    'word_cnn': model2.predict(word_x_test)[:, 1] # 0.92520
}

for model_name, predictions in predictions_dict.items():
    print('generating submission for: ', model_name)
    submission_path = os.path.join(submission_dir, '{}_submission.csv'.format(model_name))
    df_submission = create_submission(ids, predictions, ids_col, prediction_col, submission_path)

# sanity check to make sure the size and the output of the submission makes sense
print(df_submission.shape)
df_submission.head()

generating submission for:  sentencepiece_cnn
generating submission for:  word_cnn
(25000, 2)

Summary

We've looked at the performance of leveraging subword tokenization for our text classification task. Note that some other ideas that we did not try out are:

Use other word-level tokenizers. Another popular choice at the point of writing this documentation is spacy's tokenizer.
Sentencepiece suggests that it can be trained on raw text without the need to perform language specific segmentation beforehand, e.g. using the spacy tokenizer on our raw text data before feeding it to sentencepiece to learn the subword vocabulary. We can conduct our own experiment on the task at hand to verify that claim. Sentencepiece also includes an experiments page that documents some of the experiments they've conducted.

Table of Contents

Subword Tokenization for Text Classification

Data Preprocessing

Model

Subword-Level Tokenizer

Word-Level Tokenizer

Submission

Summary

Reference

Product

Resources

Company