Text classification using Natural Language Toolkit (NLTK) typically involves the following steps:

Importing necessary libraries: You'll need NLTK for text processing and classification, as well as other libraries like scikit-learn for machine learning tasks.

Preparing the data: This involves loading your dataset, preprocessing the text (removing punctuation, stopwords, etc.), and converting the text data into a format suitable for machine learning.

Feature extraction: Converting the text data into numerical format, often using techniques like TF-IDF (Term Frequency-Inverse Document Frequency).

Splitting the data: Divide your dataset into training and testing sets.

Training the classifier: Use the training data to train your classification model.

Evaluating the model: Use the testing data to evaluate the performance of your model.

Here is an example code for text classification using NLTK and scikit-learn:

python Copy code import nltk from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report, accuracy_score

Step 1: Prepare the data

Assume you have a list of text samples 'documents' and their corresponding labels 'labels'

Step 3: Feature extraction

tfidf_vectorizer = TfidfVectorizer(max_features=5000, stop_words='english') X = tfidf_vectorizer.fit_transform(documents)

Step 4: Split the data

X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2, random_state=42)

Step 5: Train the classifier

classifier = MultinomialNB() classifier.fit(X_train, y_train)

Step 6: Evaluate the model

y_pred = classifier.predict(X_test)

Print accuracy and classification report

print("Accuracy:", accuracy_score(y_test, y_pred)) print("\nClassification Report:") print(classification_report(y_test, y_pred)) Please make sure to replace documents and labels with your actual data. This example uses a simple Multinomial Naive Bayes classifier and TF-IDF vectorization for feature extraction. Depending on your specific task, you might want to explore other classifiers and feature extraction techniques.

In [1]:

import nltk
from nltk.corpus import movie_reviews
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score

In [2]:

nltk.download('movie_reviews')

# Load the movie_reviews dataset
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

# Shuffle the documents for better training
import random
random.shuffle(documents)

# Split the documents into features (X) and labels (y)
X = [" ".join(words) for words, category in documents]
y = [category for words, category in documents]

Out[2]:

[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\Suyashi144893\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!

In [3]:

##Feature extraction (TF-IDF):
tfidf_vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X_tfidf = tfidf_vectorizer.fit_transform(X)

In [4]:

## Split the data into training and testing sets:
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

In [5]:

classifier = MultinomialNB()
classifier.fit(X_train, y_train)

Out[5]:

In [6]:

y_pred = classifier.predict(X_test)

# Print accuracy and classification report
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Out[6]:

Accuracy: 0.8375

Classification Report:
              precision    recall  f1-score   support

         neg       0.83      0.86      0.85       208
         pos       0.84      0.81      0.83       192

    accuracy                           0.84       400
   macro avg       0.84      0.84      0.84       400
weighted avg       0.84      0.84      0.84       400

In [ ]: