Path: blob/master/47_BERT_text_classification/BERT_email_classification-handle-imbalance.ipynb
1141 views
Kernel: Python 3
BERT tutorial: Classify spam vs no spam emails
In [1]:
Import the dataset (Dataset is taken from kaggle)
In [2]:
Out[2]:
In [3]:
Out[3]:
In [5]:
Out[5]:
ham 4825
spam 747
Name: Category, dtype: int64
In [6]:
Out[6]:
0.15481865284974095
15% spam emails, 85% ham emails: This indicates class imbalance
In [9]:
Out[9]:
(747, 2)
In [10]:
Out[10]:
(4825, 2)
In [12]:
Out[12]:
(747, 2)
In [13]:
Out[13]:
(1494, 2)
In [14]:
Out[14]:
spam 747
ham 747
Name: Category, dtype: int64
In [16]:
Out[16]:
Split it into training and test data set
In [18]:
In [19]:
Out[19]:
3354 I emailed yifeng my part oredi.. Can ü get it ...
466 great princess! I love giving and receiving or...
4154 URGENT!! Your 4* Costa Del Sol Holiday or £500...
3162 Mystery solved! Just opened my email and he's ...
Name: Message, dtype: object
Now lets import BERT model and get embeding vectors for few sample statements
In [20]:
In [21]:
Out[21]:
<tf.Tensor: shape=(2, 768), dtype=float32, numpy=
array([[-0.8435169 , -0.51327276, -0.8884574 , ..., -0.74748874,
-0.75314736, 0.91964483],
[-0.87208366, -0.50543964, -0.94446677, ..., -0.858475 ,
-0.7174535 , 0.8808298 ]], dtype=float32)>
Get embeding vectors for few sample words. Compare them using cosine similarity
In [22]:
In [23]:
Out[23]:
array([[0.9911089]], dtype=float32)
Values near to 1 means they are similar. 0 means they are very different. Above you can use comparing "banana" vs "grapes" you get 0.99 similarity as they both are fruits
In [24]:
Out[24]:
array([[0.8470385]], dtype=float32)
Comparing banana with jeff bezos you still get 0.84 but it is not as close as 0.99 that we got with grapes
In [25]:
Out[25]:
array([[0.98720354]], dtype=float32)
Jeff bezos and Elon musk are more similar then Jeff bezos and banana as indicated above
Build Model
There are two types of models you can build in tensorflow.
(1) Sequential (2) Functional
So far we have built sequential model. But below we will build functional model. More information on these two is here: https://becominghuman.ai/sequential-vs-functional-model-in-keras-20684f766057
In [26]:
In [27]:
Out[27]:
Model: "model"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
text (InputLayer) [(None,)] 0
__________________________________________________________________________________________________
keras_layer (KerasLayer) {'input_mask': (None 0 text[0][0]
__________________________________________________________________________________________________
keras_layer_1 (KerasLayer) {'default': (None, 7 109482241 keras_layer[0][0]
keras_layer[0][1]
keras_layer[0][2]
__________________________________________________________________________________________________
dropout (Dropout) (None, 768) 0 keras_layer_1[0][13]
__________________________________________________________________________________________________
output (Dense) (None, 1) 769 dropout[0][0]
==================================================================================================
Total params: 109,483,010
Trainable params: 769
Non-trainable params: 109,482,241
__________________________________________________________________________________________________
In [28]:
Out[28]:
1120
In [29]:
Train the model
In [31]:
Out[31]:
Epoch 1/10
35/35 [==============================] - 7s 189ms/step - loss: 0.3398 - accuracy: 0.8857 - precision: 0.8750 - recall: 0.9000 2s - loss: 0.3473 - accuracy: 0.8854 - precision: 0.8672 -
Epoch 2/10
35/35 [==============================] - 6s 185ms/step - loss: 0.3271 - accuracy: 0.8857 - precision: 0.8649 - recall: 0.9143
Epoch 3/10
35/35 [==============================] - 7s 187ms/step - loss: 0.3093 - accuracy: 0.8920 - precision: 0.8844 - recall: 0.9018
Epoch 4/10
35/35 [==============================] - 7s 187ms/step - loss: 0.2920 - accuracy: 0.9071 - precision: 0.8986 - recall: 0.9179
Epoch 5/10
35/35 [==============================] - 7s 187ms/step - loss: 0.2837 - accuracy: 0.9098 - precision: 0.9076 - recall: 0.9125
Epoch 6/10
35/35 [==============================] - 7s 187ms/step - loss: 0.2741 - accuracy: 0.9062 - precision: 0.9027 - recall: 0.9107
Epoch 7/10
35/35 [==============================] - 7s 189ms/step - loss: 0.2643 - accuracy: 0.9089 - precision: 0.8962 - recall: 0.9250 4s - loss: 0.2845 - accuracy: 0.8924 - precisi
Epoch 8/10
35/35 [==============================] - 7s 186ms/step - loss: 0.2570 - accuracy: 0.9161 - precision: 0.9161 - recall: 0.9161
Epoch 9/10
35/35 [==============================] - 7s 196ms/step - loss: 0.2512 - accuracy: 0.9134 - precision: 0.9026 - recall: 0.9268
Epoch 10/10
35/35 [==============================] - 7s 193ms/step - loss: 0.2419 - accuracy: 0.9179 - precision: 0.9239 - recall: 0.9107
<tensorflow.python.keras.callbacks.History at 0x1db822fcf70>
In [32]:
Out[32]:
12/12 [==============================] - 4s 194ms/step - loss: 0.2600 - accuracy: 0.9064 - precision: 0.8486 - recall: 0.9893
[0.2599719762802124,
0.9064171314239502,
0.8486238718032837,
0.9893048405647278]
In [33]:
In [34]:
Out[34]:
array([1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1,
1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1,
1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0,
0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0,
0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0,
1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0,
1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1,
1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1,
1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1,
1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0,
0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1,
1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1,
0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1,
1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0,
0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0,
1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1,
0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0])
In [35]:
Out[35]:
array([[154, 33],
[ 2, 185]], dtype=int64)
In [43]:
Out[43]:
Text(33.0, 0.5, 'Truth')
In [36]:
Out[36]:
precision recall f1-score support
0 0.99 0.82 0.90 187
1 0.85 0.99 0.91 187
accuracy 0.91 374
macro avg 0.92 0.91 0.91 374
weighted avg 0.92 0.91 0.91 374
Inference
In [49]:
Out[49]:
array([[0.8734353 ],
[0.92858446],
[0.8960864 ],
[0.29311982],
[0.13262196]], dtype=float32)