Path: blob/master/Natural Language Processing using Python/Text Classification.ipynb
3074 views
Kernel: Python 3 (ipykernel)
Text Classification with NLTK
Text Classification is the process of assigning a label or category to a given piece of text. For example, we can classify emails as spam or not spam, tweets as positive or negative, and articles as relevant or not relevant to a given topic.
Let us take an example:
I am happy with your response
We have a robbery reported on 20th July near Delhi
I am not satisfied with your feedback
I am ok with this situation
Example: Let us take and example of news classification, where we can train a model to predict the category of a news article based on its content. Here’s an example of how to do this using scikit-learn:
In [1]:
In [2]:
Out[2]:
In [3]:
Out[3]:
(2445, 3)
In [4]:
In [5]:
Out[5]:
In [18]:
Out[18]:
Accuracy: 0.901840490797546
In [19]:
In [20]:
Out[20]:
array([[229, 20],
[ 28, 212]], dtype=int64)
from sklearn.metrics import multilabel_confusion_matrix multilabel_confusion_matrix(y_test, y_pred,
In [21]:
Out[21]:
Text(0.5, 257.44, 'Predicted label')
In [22]:
Out[22]:
Misclassified samples: 48
Accuracy: 0.901840490797546
90% times we will be able to classify whether news is true or False
Checking the Model Performance on new data
In [24]:
Out[24]:
Text: This is a legitimate news article about recent scientific discoveries.
Prediction: true
Text: Breaking: UFO spotted over New York City!
Prediction: fake
Text: Study finds that eating chocolate can improve memory.
Prediction: true
Text: Breaking News: Famous celebrity announces presidential run.
Prediction: fake
Text: New research reveals the dangers of excessive screen time for children.
Prediction: true