CoCalc -- BERT_email_classification-handle-imbalance.ipynb

GitHub Repository: codebasics/deep-learning-keras-tf-tutorial
Path: blob/master/47_BERT_text_classification/BERT_email_classification-handle-imbalance.ipynb
¹¹⁴¹ views

Kernel: Python 3

BERT tutorial: Classify spam vs no spam emails

In [1]:

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text

Import the dataset (Dataset is taken from kaggle)

In [2]:

import pandas as pd

df = pd.read_csv("spam.csv")
df.head(5)

Out[2]:

In [3]:

df.groupby('Category').describe()

Out[3]:

In [5]:

df['Category'].value_counts()

Out[5]:

ham     4825
spam     747
Name: Category, dtype: int64

In [6]:

747/4825

Out[6]:

0.15481865284974095

15% spam emails, 85% ham emails: This indicates class imbalance

In [9]:

df_spam = df[df['Category']=='spam']
df_spam.shape

Out[9]:

(747, 2)

In [10]:

df_ham = df[df['Category']=='ham']
df_ham.shape

Out[10]:

(4825, 2)

In [12]:

df_ham_downsampled = df_ham.sample(df_spam.shape[0])
df_ham_downsampled.shape

Out[12]:

(747, 2)

In [13]:

df_balanced = pd.concat([df_ham_downsampled, df_spam])
df_balanced.shape

Out[13]:

(1494, 2)

In [14]:

df_balanced['Category'].value_counts()

Out[14]:

spam    747
ham     747
Name: Category, dtype: int64

In [16]:

df_balanced['spam']=df_balanced['Category'].apply(lambda x: 1 if x=='spam' else 0)
df_balanced.sample(5)

Out[16]:

Split it into training and test data set

In [18]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df_balanced['Message'],df_balanced['spam'], stratify=df_balanced['spam'])

In [19]:

X_train.head(4)

Out[19]:

  I emailed yifeng my part oredi.. Can ü get it ...
   great princess! I love giving and receiving or...
  URGENT!! Your 4* Costa Del Sol Holiday or £500...
  Mystery solved! Just opened my email and he's ...
Name: Message, dtype: object

Now lets import BERT model and get embeding vectors for few sample statements

In [20]:

bert_preprocess = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
bert_encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4")

In [21]:

def get_sentence_embeding(sentences):
    preprocessed_text = bert_preprocess(sentences)
    return bert_encoder(preprocessed_text)['pooled_output']

get_sentence_embeding([
    "500$ discount. hurry up", 
    "Bhavin, are you up for a volleybal game tomorrow?"]
)

Out[21]:

<tf.Tensor: shape=(2, 768), dtype=float32, numpy=
array([[-0.8435169 , -0.51327276, -0.8884574 , ..., -0.74748874,
        -0.75314736,  0.91964483],
       [-0.87208366, -0.50543964, -0.94446677, ..., -0.858475  ,
        -0.7174535 ,  0.8808298 ]], dtype=float32)>

Get embeding vectors for few sample words. Compare them using cosine similarity

In [22]:

e = get_sentence_embeding([
    "banana", 
    "grapes",
    "mango",
    "jeff bezos",
    "elon musk",
    "bill gates"
]
)

In [23]:

from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity([e[0]],[e[1]])

Out[23]:

array([[0.9911089]], dtype=float32)

Values near to 1 means they are similar. 0 means they are very different. Above you can use comparing "banana" vs "grapes" you get 0.99 similarity as they both are fruits

In [24]:

cosine_similarity([e[0]],[e[3]])

Out[24]:

array([[0.8470385]], dtype=float32)

Comparing banana with jeff bezos you still get 0.84 but it is not as close as 0.99 that we got with grapes

In [25]:

cosine_similarity([e[3]],[e[4]])

Out[25]:

array([[0.98720354]], dtype=float32)

Jeff bezos and Elon musk are more similar then Jeff bezos and banana as indicated above

Build Model

There are two types of models you can build in tensorflow.

(1) Sequential (2) Functional

So far we have built sequential model. But below we will build functional model. More information on these two is here: https://becominghuman.ai/sequential-vs-functional-model-in-keras-20684f766057

In [26]:

# Bert layers
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
preprocessed_text = bert_preprocess(text_input)
outputs = bert_encoder(preprocessed_text)

# Neural network layers
l = tf.keras.layers.Dropout(0.1, name="dropout")(outputs['pooled_output'])
l = tf.keras.layers.Dense(1, activation='sigmoid', name="output")(l)

# Use inputs and outputs to construct a final model
model = tf.keras.Model(inputs=[text_input], outputs = [l])

https://stackoverflow.com/questions/47605558/importerror-failed-to-import-pydot-you-must-install-pydot-and-graphviz-for-py

In [27]:

model.summary()

Out[27]:

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
text (InputLayer)               [(None,)]            0                                            
__________________________________________________________________________________________________
keras_layer (KerasLayer)        {'input_mask': (None 0           text[0][0]                       
__________________________________________________________________________________________________
keras_layer_1 (KerasLayer)      {'default': (None, 7 109482241   keras_layer[0][0]                
                                                                 keras_layer[0][1]                
                                                                 keras_layer[0][2]                
__________________________________________________________________________________________________
dropout (Dropout)               (None, 768)          0           keras_layer_1[0][13]             
__________________________________________________________________________________________________
output (Dense)                  (None, 1)            769         dropout[0][0]                    
==================================================================================================
Total params: 109,483,010
Trainable params: 769
Non-trainable params: 109,482,241
__________________________________________________________________________________________________

In [28]:

len(X_train)

Out[28]:

1120

In [29]:

METRICS = [
      tf.keras.metrics.BinaryAccuracy(name='accuracy'),
      tf.keras.metrics.Precision(name='precision'),
      tf.keras.metrics.Recall(name='recall')
]

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=METRICS)

Train the model

In [31]:

model.fit(X_train, y_train, epochs=10)

Out[31]:

Epoch 1/10
35/35 [==============================] - 7s 189ms/step - loss: 0.3398 - accuracy: 0.8857 - precision: 0.8750 - recall: 0.9000 2s - loss: 0.3473 - accuracy: 0.8854 - precision: 0.8672 -
Epoch 2/10
35/35 [==============================] - 6s 185ms/step - loss: 0.3271 - accuracy: 0.8857 - precision: 0.8649 - recall: 0.9143
Epoch 3/10
35/35 [==============================] - 7s 187ms/step - loss: 0.3093 - accuracy: 0.8920 - precision: 0.8844 - recall: 0.9018
Epoch 4/10
35/35 [==============================] - 7s 187ms/step - loss: 0.2920 - accuracy: 0.9071 - precision: 0.8986 - recall: 0.9179
Epoch 5/10
35/35 [==============================] - 7s 187ms/step - loss: 0.2837 - accuracy: 0.9098 - precision: 0.9076 - recall: 0.9125
Epoch 6/10
35/35 [==============================] - 7s 187ms/step - loss: 0.2741 - accuracy: 0.9062 - precision: 0.9027 - recall: 0.9107
Epoch 7/10
35/35 [==============================] - 7s 189ms/step - loss: 0.2643 - accuracy: 0.9089 - precision: 0.8962 - recall: 0.9250 4s - loss: 0.2845 - accuracy: 0.8924 - precisi
Epoch 8/10
35/35 [==============================] - 7s 186ms/step - loss: 0.2570 - accuracy: 0.9161 - precision: 0.9161 - recall: 0.9161
Epoch 9/10
35/35 [==============================] - 7s 196ms/step - loss: 0.2512 - accuracy: 0.9134 - precision: 0.9026 - recall: 0.9268
Epoch 10/10
35/35 [==============================] - 7s 193ms/step - loss: 0.2419 - accuracy: 0.9179 - precision: 0.9239 - recall: 0.9107

<tensorflow.python.keras.callbacks.History at 0x1db822fcf70>

In [32]:

model.evaluate(X_test, y_test)

Out[32]:

12/12 [==============================] - 4s 194ms/step - loss: 0.2600 - accuracy: 0.9064 - precision: 0.8486 - recall: 0.9893

[0.2599719762802124,
9064171314239502,
8486238718032837,
9893048405647278]

In [33]:

y_predicted = model.predict(X_test)
y_predicted = y_predicted.flatten()

In [34]:

import numpy as np

y_predicted = np.where(y_predicted > 0.5, 1, 0)
y_predicted

Out[34]:

array([1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1,
       1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0,
       0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0,
       0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0,
       1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0,
       1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1,
       1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1,
       1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1,
       1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1,
       0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1,
       1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0,
       1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1,
       0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0])

In [35]:

from sklearn.metrics import confusion_matrix, classification_report

cm = confusion_matrix(y_test, y_predicted)
cm

Out[35]:

array([[154,  33],
       [  2, 185]], dtype=int64)

In [43]:

from matplotlib import pyplot as plt
import seaborn as sn
sn.heatmap(cm, annot=True, fmt='d')
plt.xlabel('Predicted')
plt.ylabel('Truth')

Out[43]:

Text(33.0, 0.5, 'Truth')

In [36]:

print(classification_report(y_test, y_predicted))

Out[36]:

              precision    recall  f1-score   support

           0       0.99      0.82      0.90       187
           1       0.85      0.99      0.91       187

    accuracy                           0.91       374
   macro avg       0.92      0.91      0.91       374
weighted avg       0.92      0.91      0.91       374

Inference

In [49]:

reviews = [
    'Enter a chance to win $5000, hurry up, offer valid until march 31, 2021',
    'You are awarded a SiPix Digital Camera! call 09061221061 from landline. Delivery within 28days. T Cs Box177. M221BP. 2yr warranty. 150ppm. 16 . p pÂ£3.99',
    'it to 80488. Your 500 free text messages are valid until 31 December 2005.',
    'Hey Sam, Are you coming for a cricket game tomorrow',
    "Why don't you wait 'til at least wednesday to see if you get your ."
]
model.predict(reviews)

Out[49]:

array([[0.8734353 ],
       [0.92858446],
       [0.8960864 ],
       [0.29311982],
       [0.13262196]], dtype=float32)