Path: blob/master/Natural Language Processing using Python/Text classification using Natural Language Toolkit (NLTK) .ipynb
3074 views
Text classification using Natural Language Toolkit (NLTK) typically involves the following steps:
Importing necessary libraries: You'll need NLTK for text processing and classification, as well as other libraries like scikit-learn for machine learning tasks.
Preparing the data: This involves loading your dataset, preprocessing the text (removing punctuation, stopwords, etc.), and converting the text data into a format suitable for machine learning.
Feature extraction: Converting the text data into numerical format, often using techniques like TF-IDF (Term Frequency-Inverse Document Frequency).
Splitting the data: Divide your dataset into training and testing sets.
Training the classifier: Use the training data to train your classification model.
Evaluating the model: Use the testing data to evaluate the performance of your model.
Here is an example code for text classification using NLTK and scikit-learn:
python Copy code import nltk from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report, accuracy_score
Step 1: Prepare the data
Assume you have a list of text samples 'documents' and their corresponding labels 'labels'
Step 3: Feature extraction
tfidf_vectorizer = TfidfVectorizer(max_features=5000, stop_words='english') X = tfidf_vectorizer.fit_transform(documents)
Step 4: Split the data
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2, random_state=42)
Step 5: Train the classifier
classifier = MultinomialNB() classifier.fit(X_train, y_train)
Step 6: Evaluate the model
y_pred = classifier.predict(X_test)
Print accuracy and classification report
print("Accuracy:", accuracy_score(y_test, y_pred)) print("\nClassification Report:") print(classification_report(y_test, y_pred)) Please make sure to replace documents and labels with your actual data. This example uses a simple Multinomial Naive Bayes classifier and TF-IDF vectorization for feature extraction. Depending on your specific task, you might want to explore other classifiers and feature extraction techniques.