Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
suyashi29
GitHub Repository: suyashi29/python-su
Path: blob/master/Natural Language Processing using Python/Text Classification.ipynb
3074 views
Kernel: Python 3 (ipykernel)
import nltk nltk.download()

Text Classification with NLTK

Text Classification is the process of assigning a label or category to a given piece of text. For example, we can classify emails as spam or not spam, tweets as positive or negative, and articles as relevant or not relevant to a given topic.

Let us take an example:

  • I am happy with your response

  • We have a robbery reported on 20th July near Delhi

  • I am not satisfied with your feedback

  • I am ok with this situation

Example: Let us take and example of news classification, where we can train a model to predict the category of a news article based on its content. Here’s an example of how to do this using scikit-learn:

from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression import pandas as pd # Load the dataset news = pd.read_csv('newss.csv')
news.head()
news.shape
(2445, 3)
news = news.dropna(how='any',axis=0)
news.describe(include="object")
# Split the data into training and testing sets from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(news['text'], news['category'], test_size=0.2) # Create a TfidfVectorizer to convert text to numerical features vectorizer = TfidfVectorizer() X_train = vectorizer.fit_transform(X_train) X_test = vectorizer.transform(X_test) # Train a logistic regression model news_pred = LogisticRegression() news_pred.fit(X_train, y_train) # Evaluate the model on the test set score = news_pred.score(X_test, y_test) print('Accuracy:', score)
Accuracy: 0.901840490797546
y_pred=news_pred.predict(X_test)

image.png

# import the metrics class from sklearn import metrics cnf_matrix = metrics.confusion_matrix(y_test, y_pred) cnf_matrix
array([[229, 20], [ 28, 212]], dtype=int64)

from sklearn.metrics import multilabel_confusion_matrix multilabel_confusion_matrix(y_test, y_pred,

labels=[0 ,1])
import seaborn as sns import pandas as pd import matplotlib.pyplot as plt from matplotlib.pyplot import pie, axis, show %matplotlib inline import numpy as np class_names=[0,1] # name of classes fig, ax = plt.subplots() tick_marks = np.arange(len(class_names)) plt.xticks(tick_marks, class_names) plt.yticks(tick_marks, class_names) # create heatmap sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g') ax.xaxis.set_label_position("top") plt.tight_layout() plt.title('Confusion matrix', y=1.1) plt.ylabel('Actual label') plt.xlabel('Predicted label') # TF+TP/TF+TP+FP+FN
Text(0.5, 257.44, 'Predicted label')
Image in a Jupyter notebook
## Model parameters study : from sklearn import metrics count_misclassified = (y_test != y_pred).sum() print('Misclassified samples: {}'.format(count_misclassified)) print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
Misclassified samples: 48 Accuracy: 0.901840490797546

90% times we will be able to classify whether news is true or False

Checking the Model Performance on new data

# Sample data for testing sample_data = [ "This is a legitimate news article about recent scientific discoveries.", "Breaking: UFO spotted over New York City!", "Study finds that eating chocolate can improve memory.", "Breaking News: Famous celebrity announces presidential run.", "New research reveals the dangers of excessive screen time for children." ] def news_pred_predict(sample_texts): # This function should take a list of text samples and return the predictions made by your model # Replace this with the code that uses your trained model to make predictions predictions = ['fake' if "Breaking" in text else 'true' for text in sample_texts] # Placeholder prediction logic return predictions # Test the model predictions = news_pred_predict(sample_data) # Print the predictions for text, prediction in zip(sample_data, predictions): print("Text:", text) print("Prediction:", prediction) print()
Text: This is a legitimate news article about recent scientific discoveries. Prediction: true Text: Breaking: UFO spotted over New York City! Prediction: fake Text: Study finds that eating chocolate can improve memory. Prediction: true Text: Breaking News: Famous celebrity announces presidential run. Prediction: fake Text: New research reveals the dangers of excessive screen time for children. Prediction: true