Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
codebasics
GitHub Repository: codebasics/deep-learning-keras-tf-tutorial
Path: blob/master/47_BERT_text_classification/BERT_email_classification-Copy2.ipynb
1141 views
Kernel: Python 3

BERT tutorial: Classify spam vs no spam emails

import tensorflow as tf import tensorflow_hub as hub import tensorflow_text as text

Import the dataset (Dataset is taken from kaggle)

import pandas as pd df = pd.read_csv("spam.csv") df.head(5)
df.groupby('Category').describe()
df['spam']=df['Category'].apply(lambda x: 1 if x=='spam' else 0) df.head()

Split it into training and test data set

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(df['Message'],df['spam'], stratify=df['spam'])
X_train.head(4)
1717 Sorry about earlier. Putting out fires.Are you... 707 So when do you wanna gym harri 4667 Not..tel software name.. 5188 Okie Name: Message, dtype: object

Now lets import BERT model and get embeding vectors for few sample statements

bert_preprocess = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3") bert_encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4", trainable=True)
def get_sentence_embeding(sentences): preprocessed_text = bert_preprocess(sentences) return bert_encoder(preprocessed_text)['pooled_output'] get_sentence_embeding([ "500$ discount. hurry up", "Bhavin, are you up for a volleybal game tomorrow?"] )
ERROR:absl:hub.KerasLayer is trainable but has zero trainable weights.
<tf.Tensor: shape=(2, 768), dtype=float32, numpy= array([[-0.8435169 , -0.51327276, -0.8884574 , ..., -0.74748874, -0.75314736, 0.91964483], [-0.87208366, -0.50543964, -0.94446677, ..., -0.858475 , -0.7174535 , 0.8808298 ]], dtype=float32)>

Get embeding vectors for few sample words. Compare them using cosine similarity

e = get_sentence_embeding([ "banana", "grapes", "mango", "jeff bezos", "elon musk", "bill gates" ] )
from sklearn.metrics.pairwise import cosine_similarity cosine_similarity([e[0]],[e[1]])
array([[0.9911089]], dtype=float32)

Values near to 1 means they are similar. 0 means they are very different. Above you can use comparing "banana" vs "grapes" you get 0.99 similarity as they both are fruits

cosine_similarity([e[0]],[e[3]])
array([[0.8470385]], dtype=float32)

Comparing banana with jeff bezos you still get 0.84 but it is not as close as 0.99 that we got with grapes

cosine_similarity([e[3]],[e[4]])
array([[0.98720354]], dtype=float32)

Jeff bezos and Elon musk are more similar then Jeff bezos and banana as indicated above

Build Model

There are two types of models you can build in tensorflow.

(1) Sequential (2) Functional

So far we have built sequential model. But below we will build functional model. More information on these two is here: https://becominghuman.ai/sequential-vs-functional-model-in-keras-20684f766057

# Bert layers text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text') preprocessed_text = bert_preprocess(text_input) outputs = bert_encoder(preprocessed_text) # Neural network layers l = tf.keras.layers.Dropout(0.1, name="dropout")(outputs['pooled_output']) l = tf.keras.layers.Dense(1, activation='sigmoid', name="output")(l) # Use inputs and outputs to construct a final model model = tf.keras.Model(inputs=[text_input], outputs = [l])
model.summary()
Model: "model" __________________________________________________________________________________________________ Layer (type) Output Shape Param # Connected to ================================================================================================== text (InputLayer) [(None,)] 0 __________________________________________________________________________________________________ keras_layer (KerasLayer) {'input_mask': (None 0 text[0][0] __________________________________________________________________________________________________ keras_layer_1 (KerasLayer) {'default': (None, 7 109482241 keras_layer[0][0] keras_layer[0][1] keras_layer[0][2] __________________________________________________________________________________________________ dropout (Dropout) (None, 768) 0 keras_layer_1[0][13] __________________________________________________________________________________________________ output (Dense) (None, 1) 769 dropout[0][0] ================================================================================================== Total params: 109,483,010 Trainable params: 109,483,009 Non-trainable params: 1 __________________________________________________________________________________________________
len(X_train)
4179
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

Train the model

model.fit(X_train, y_train, epochs=5)
Epoch 1/5
model.evaluate(X_test, y_test)
44/44 [==============================] - 9s 182ms/step - loss: 0.1475 - accuracy: 0.9548
[0.14750021696090698, 0.9547738432884216]

Inference

reviews = [ 'Reply to win £100 weekly! Where will the 2006 FIFA World Cup be held? Send STOP to 87239 to end service', 'You are awarded a SiPix Digital Camera! call 09061221061 from landline. Delivery within 28days. T Cs Box177. M221BP. 2yr warranty. 150ppm. 16 . p p£3.99', 'it to 80488. Your 500 free text messages are valid until 31 December 2005.', 'Hey Sam, Are you coming for a cricket game tomorrow', "Why don't you wait 'til at least wednesday to see if you get your ." ] model.predict(reviews)
array([[0.6472808 ], [0.7122627 ], [0.5710311 ], [0.06721176], [0.02479185]], dtype=float32)