Path: blob/master/final_customization.ipynb
251 views
Kernel: Python 3
Hate Speech Detection
In [103]:
In [105]:
Out[105]:
In [106]:
Out[106]:
Unnamed: 0 count hate_speech offensive_language neither class \
0 0 3 0 0 3 2
1 1 3 0 3 0 1
2 2 3 0 3 0 1
3 3 3 0 2 1 1
4 4 6 0 6 0 1
tweet text length
0 !!! RT @mayasolovely: As a woman you shouldn't... 140
1 !!!!! RT @mleew17: boy dats cold...tyga dwn ba... 85
2 !!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby... 120
3 !!!!!!!!! RT @C_G_Anderson: @viva_based she lo... 62
4 !!!!!!!!!!!!! RT @ShenikaRoberts: The shit you... 137
In [107]:
Out[107]:
<seaborn.axisgrid.FacetGrid at 0x1c0b3cc71d0>
a. Distribution of text-length almost seem to be similar across all three classes
b. Number of tweets seem to be skewed a lot higher towards the class-1
In [108]:
Out[108]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c0b39eeb00>
From the box-plot, looks like the class-1 tweets have much longer text. There are also outliers present so text-length won’t be a useful feature to consider.
In [109]:
Out[109]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c0b3b2d6a0>
The above histogram shows that most of the tweets are considered to be offensive words by the CF coders.
In [110]:
Preprocessing of the tweets
In [112]:
Out[112]:
tweet \
0 !!! RT @mayasolovely: As a woman you shouldn't...
1 !!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2 !!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3 !!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4 !!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...
5 !!!!!!!!!!!!!!!!!!"@T_Madison_x: The shit just...
6 !!!!!!"@__BrighterDays: I can not just sit up ...
7 !!!!“@selfiequeenbri: cause I'm tired of...
8 " & you might not get ya bitch back & ...
9 " @rhythmixx_ :hobbies include: fighting Maria...
processed_tweets
0 woman complain clean hous amp man alway take t...
1 boy dat cold tyga dwn bad cuffin dat hoe st place
2 dawg ever fuck bitch start cri confus shit
3 look like tranni
4 shit hear might true might faker bitch told ya
5 shit blow claim faith somebodi still fuck hoe
6 sit hate anoth bitch got much shit go
7 caus tire big bitch come us skinni girl
8 amp might get ya bitch back amp that
9 hobbi includ fight mariam bitch
Visualizations
In [113]:
Out[113]:
In [114]:
Out[114]:
In [115]:
Out[115]:
Feature Engineering
In [116]:
Out[116]:
<24783x6441 sparse matrix of type '<class 'numpy.float64'>'
with 189618 stored elements in Compressed Sparse Row format>
Running various model Using TFIDF without additional features
In [117]:
Out[117]:
precision recall f1-score support
0 0.54 0.15 0.23 290
1 0.91 0.97 0.93 3832
2 0.85 0.80 0.83 835
micro avg 0.89 0.89 0.89 4957
macro avg 0.77 0.64 0.66 4957
weighted avg 0.87 0.89 0.88 4957
Logistic Regression, Accuracy Score: 0.8910631430300585
In [118]:
Out[118]:
precision recall f1-score support
0 0.51 0.29 0.37 290
1 0.93 0.95 0.94 3832
2 0.83 0.89 0.86 835
micro avg 0.90 0.90 0.90 4957
macro avg 0.76 0.71 0.72 4957
weighted avg 0.89 0.90 0.89 4957
Random Forest, Accuracy Score: 0.9009481541254791
In [119]:
Out[119]:
precision recall f1-score support
0 0.10 0.39 0.16 290
1 0.89 0.68 0.77 3832
2 0.54 0.58 0.56 835
micro avg 0.65 0.65 0.65 4957
macro avg 0.51 0.55 0.50 4957
weighted avg 0.79 0.65 0.70 4957
Naive Bayes, Accuracy Score: 0.6491829735727255
In [120]:
Out[120]:
precision recall f1-score support
0 0.46 0.26 0.33 290
1 0.92 0.95 0.94 3832
2 0.83 0.85 0.84 835
micro avg 0.89 0.89 0.89 4957
macro avg 0.74 0.69 0.70 4957
weighted avg 0.88 0.89 0.89 4957
SVM, Accuracy Score: 0.8932822271535202
In [121]:
Out[121]:
Sentiment Analysis, using polarity scores as features
In [122]:
Out[122]:
In [123]:
Out[123]:
(24783, 6448)
Running various model Using TFIDF and additional features
In [181]:
Out[181]:
precision recall f1-score support
0 0.56 0.15 0.24 290
1 0.91 0.96 0.94 3832
2 0.84 0.83 0.84 835
micro avg 0.89 0.89 0.89 4957
macro avg 0.77 0.65 0.67 4957
weighted avg 0.88 0.89 0.88 4957
Logistic Regression,Accuracy Score: 0.8946943715957232
In [125]:
Out[125]:
precision recall f1-score support
0 0.44 0.15 0.23 290
1 0.90 0.97 0.93 3832
2 0.86 0.76 0.81 835
micro avg 0.88 0.88 0.88 4957
macro avg 0.73 0.63 0.66 4957
weighted avg 0.87 0.88 0.87 4957
Random Forest, Accuracy Score: 0.8840024208190438
In [126]:
Out[126]:
precision recall f1-score support
0 0.10 0.39 0.16 290
1 0.89 0.68 0.77 3832
2 0.54 0.59 0.56 835
micro avg 0.65 0.65 0.65 4957
macro avg 0.51 0.55 0.50 4957
weighted avg 0.79 0.65 0.70 4957
Naive Bayes, Accuracy Score: 0.650191648174299
In [127]:
Out[127]:
precision recall f1-score support
0 0.46 0.26 0.33 290
1 0.92 0.95 0.94 3832
2 0.83 0.85 0.84 835
micro avg 0.89 0.89 0.89 4957
macro avg 0.74 0.69 0.70 4957
weighted avg 0.88 0.89 0.88 4957
SVM, Accuracy Score: 0.891466612870688
C:\Users\NAKUL LAKHOTIA\Anaconda3\lib\site-packages\sklearn\svm\base.py:931: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
"the number of iterations.", ConvergenceWarning)
In [128]:
Out[128]:
In [129]:
Out[129]:
C:\Users\NAKUL LAKHOTIA\Anaconda3\lib\site-packages\gensim\models\base_any2vec.py:743: UserWarning: C extension not loaded, training will be slow. Install a C compiler and reinstall gensim for fast training.
"C extension not loaded, training will be slow. "
In [130]:
Out[130]:
(24783, 6453)
Running the models Using TFIDF with additional features from sentiment analysis and doc2vec
In [179]:
Out[179]:
precision recall f1-score support
0 0.56 0.15 0.24 290
1 0.91 0.96 0.94 3832
2 0.84 0.83 0.84 835
micro avg 0.89 0.89 0.89 4957
macro avg 0.77 0.65 0.67 4957
weighted avg 0.88 0.89 0.88 4957
Logistic Regression, Accuracy Score: 0.8946943715957232
In [133]:
Out[133]:
precision recall f1-score support
0 0.46 0.17 0.24 290
1 0.89 0.96 0.93 3832
2 0.84 0.70 0.77 835
micro avg 0.87 0.87 0.87 4957
macro avg 0.73 0.61 0.65 4957
weighted avg 0.86 0.87 0.86 4957
Random Forest, Accuracy Score: 0.8739156748033085
In [172]:
Out[172]:
precision recall f1-score support
0 0.10 0.39 0.16 290
1 0.89 0.68 0.77 3832
2 0.54 0.59 0.56 835
micro avg 0.65 0.65 0.65 4957
macro avg 0.51 0.55 0.50 4957
weighted avg 0.79 0.65 0.70 4957
Naive Bayes, Accuracy Score: 0.650191648174299
In [135]:
Out[135]:
precision recall f1-score support
0 0.44 0.26 0.33 290
1 0.92 0.95 0.94 3832
2 0.82 0.85 0.84 835
micro avg 0.89 0.89 0.89 4957
macro avg 0.73 0.68 0.70 4957
weighted avg 0.88 0.89 0.88 4957
SVM, Accuracy Score: 0.890054468428485
C:\Users\NAKUL LAKHOTIA\Anaconda3\lib\site-packages\sklearn\svm\base.py:931: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
"the number of iterations.", ConvergenceWarning)
In [136]:
Out[136]:
In [137]:
In [138]:
Out[138]:
(24783, 6461)
Running the models Using TFIDF with sentiment scores,doc2vec and enhanced features
In [183]:
Out[183]:
precision recall f1-score support
0 0.56 0.14 0.23 279
1 0.91 0.97 0.94 3852
2 0.85 0.82 0.84 826
micro avg 0.90 0.90 0.90 4957
macro avg 0.78 0.64 0.67 4957
weighted avg 0.88 0.90 0.88 4957
Logistic Regression, Accuracy Score: 0.8961065160379261
In [184]:
Out[184]:
In [141]:
In [142]:
Out[142]:
Predicted Class: [2 1 1 1 1 1 1 1 2 1]
Actual Class: [2, 1, 1, 0, 2, 1, 1, 1, 2, 2]
In [143]:
Out[143]:
Text(0, 0.5, 'Count')
In [144]:
Out[144]:
Text(0, 0.5, 'Count')
In [153]:
Out[153]:
precision recall f1-score support
0 0.41 0.09 0.15 279
1 0.88 0.97 0.92 3852
2 0.84 0.63 0.72 826
micro avg 0.86 0.86 0.86 4957
macro avg 0.71 0.56 0.60 4957
weighted avg 0.84 0.86 0.84 4957
Random Forest, Accuracy Score: 0.8642323986282026
In [155]:
Out[155]:
precision recall f1-score support
0 0.09 0.36 0.15 279
1 0.90 0.69 0.78 3852
2 0.59 0.65 0.62 826
micro avg 0.66 0.66 0.66 4957
macro avg 0.53 0.57 0.51 4957
weighted avg 0.80 0.66 0.72 4957
Naive Bayes, Accuracy Score: 0.662497478313496
In [156]:
Out[156]:
In [186]:
Out[186]:
precision recall f1-score support
0 0.40 0.01 0.03 279
1 0.83 0.99 0.90 3852
2 0.92 0.36 0.51 826
micro avg 0.83 0.83 0.83 4957
macro avg 0.72 0.45 0.48 4957
weighted avg 0.82 0.83 0.79 4957
SVM, Accuracy Score: 0.8317530764575348
C:\Users\NAKUL LAKHOTIA\Anaconda3\lib\site-packages\sklearn\svm\base.py:931: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
"the number of iterations.", ConvergenceWarning)
In [159]:
Out[159]:
Combining different features
In [160]:
Out[160]:
(24783, 6454)
In [161]:
Out[161]:
C:\Users\NAKUL LAKHOTIA\Anaconda3\lib\site-packages\sklearn\svm\base.py:931: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
"the number of iterations.", ConvergenceWarning)
precision recall f1-score support
0 0.57 0.15 0.24 279
1 0.89 0.97 0.93 3852
2 0.86 0.73 0.79 826
micro avg 0.88 0.88 0.88 4957
macro avg 0.78 0.62 0.65 4957
weighted avg 0.87 0.88 0.87 4957
SVM, Accuracy Score: 0.8846076255799878
In [72]:
Out[72]:
(24783, 6456)
In [73]:
Out[73]:
C:\Users\NAKUL LAKHOTIA\Anaconda3\lib\site-packages\sklearn\svm\base.py:931: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
"the number of iterations.", ConvergenceWarning)
precision recall f1-score support
0 0.53 0.19 0.28 279
1 0.93 0.94 0.93 3852
2 0.76 0.88 0.81 826
micro avg 0.89 0.89 0.89 4957
macro avg 0.74 0.67 0.67 4957
weighted avg 0.88 0.89 0.88 4957
SVM, Accuracy Score: 0.8876336493847085
In [163]:
Out[163]:
(24783, 15)
In [164]:
Out[164]:
precision recall f1-score support
0 0.14 0.01 0.02 279
1 0.79 0.99 0.88 3852
2 0.66 0.06 0.11 826
micro avg 0.78 0.78 0.78 4957
macro avg 0.53 0.35 0.34 4957
weighted avg 0.73 0.78 0.70 4957
SVM, Accuracy Score: 0.7805124066975994
C:\Users\NAKUL LAKHOTIA\Anaconda3\lib\site-packages\sklearn\svm\base.py:931: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
"the number of iterations.", ConvergenceWarning)
In [166]:
In [178]:
Out[178]:
precision recall f1-score support
0 0.09 0.36 0.15 279
1 0.90 0.69 0.78 3852
2 0.59 0.65 0.62 826
micro avg 0.66 0.66 0.66 4957
macro avg 0.53 0.57 0.51 4957
weighted avg 0.80 0.66 0.72 4957
Naive Bayes, Accuracy Score: 0.662497478313496
In [168]:
In [176]:
Out[176]:
precision recall f1-score support
0 0.56 0.14 0.23 279
1 0.91 0.97 0.94 3852
2 0.84 0.82 0.83 826
micro avg 0.89 0.89 0.89 4957
macro avg 0.77 0.64 0.67 4957
weighted avg 0.88 0.89 0.88 4957
Logistic Regression, Accuracy Score: 0.8946943715957232
In [174]:
Out[174]:
precision recall f1-score support
0 0.13 0.04 0.06 279
1 0.82 0.92 0.86 3852
2 0.47 0.32 0.38 826
micro avg 0.77 0.77 0.77 4957
macro avg 0.48 0.42 0.44 4957
weighted avg 0.72 0.77 0.74 4957
Random Forest, Accuracy Score: 0.7682065765584023
In [171]:
Out[171]: