Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
suyashi29
GitHub Repository: suyashi29/python-su
Path: blob/master/Natural Language Processing using Python/Text classification using Natural Language Toolkit (NLTK) .ipynb
3074 views
Kernel: Python 3 (ipykernel)

Text classification using Natural Language Toolkit (NLTK) typically involves the following steps:

Importing necessary libraries: You'll need NLTK for text processing and classification, as well as other libraries like scikit-learn for machine learning tasks.

Preparing the data: This involves loading your dataset, preprocessing the text (removing punctuation, stopwords, etc.), and converting the text data into a format suitable for machine learning.

Feature extraction: Converting the text data into numerical format, often using techniques like TF-IDF (Term Frequency-Inverse Document Frequency).

Splitting the data: Divide your dataset into training and testing sets.

Training the classifier: Use the training data to train your classification model.

Evaluating the model: Use the testing data to evaluate the performance of your model.

Here is an example code for text classification using NLTK and scikit-learn:

python Copy code import nltk from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report, accuracy_score

Step 1: Prepare the data

Assume you have a list of text samples 'documents' and their corresponding labels 'labels'

Step 3: Feature extraction

tfidf_vectorizer = TfidfVectorizer(max_features=5000, stop_words='english') X = tfidf_vectorizer.fit_transform(documents)

Step 4: Split the data

X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2, random_state=42)

Step 5: Train the classifier

classifier = MultinomialNB() classifier.fit(X_train, y_train)

Step 6: Evaluate the model

y_pred = classifier.predict(X_test)

Print accuracy and classification report

print("Accuracy:", accuracy_score(y_test, y_pred)) print("\nClassification Report:") print(classification_report(y_test, y_pred)) Please make sure to replace documents and labels with your actual data. This example uses a simple Multinomial Naive Bayes classifier and TF-IDF vectorization for feature extraction. Depending on your specific task, you might want to explore other classifiers and feature extraction techniques.

import nltk from nltk.corpus import movie_reviews from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report, accuracy_score
nltk.download('movie_reviews') # Load the movie_reviews dataset documents = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)] # Shuffle the documents for better training import random random.shuffle(documents) # Split the documents into features (X) and labels (y) X = [" ".join(words) for words, category in documents] y = [category for words, category in documents]
[nltk_data] Downloading package movie_reviews to [nltk_data] C:\Users\Suyashi144893\AppData\Roaming\nltk_data... [nltk_data] Package movie_reviews is already up-to-date!
##Feature extraction (TF-IDF): tfidf_vectorizer = TfidfVectorizer(max_features=5000, stop_words='english') X_tfidf = tfidf_vectorizer.fit_transform(X)
## Split the data into training and testing sets: X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)
classifier = MultinomialNB() classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test) # Print accuracy and classification report print("Accuracy:", accuracy_score(y_test, y_pred)) print("\nClassification Report:") print(classification_report(y_test, y_pred))
Accuracy: 0.8375 Classification Report: precision recall f1-score support neg 0.83 0.86 0.85 208 pos 0.84 0.81 0.83 192 accuracy 0.84 400 macro avg 0.84 0.84 0.84 400 weighted avg 0.84 0.84 0.84 400