Path: blob/master/Week 7/Programming Assignment - 6/ex6/ex6_spam.m
863 views
%% Machine Learning Online Class1% Exercise 6 | Spam Classification with SVMs2%3% Instructions4% ------------5%6% This file contains code that helps you get started on the7% exercise. You will need to complete the following functions:8%9% gaussianKernel.m10% dataset3Params.m11% processEmail.m12% emailFeatures.m13%14% For this exercise, you will not need to change any code in this file,15% or any other files other than those mentioned above.16%1718%% Initialization19clear ; close all; clc2021%% ==================== Part 1: Email Preprocessing ====================22% To use an SVM to classify emails into Spam v.s. Non-Spam, you first need23% to convert each email into a vector of features. In this part, you will24% implement the preprocessing steps for each email. You should25% complete the code in processEmail.m to produce a word indices vector26% for a given email.2728fprintf('\nPreprocessing sample email (emailSample1.txt)\n');2930% Extract Features31file_contents = readFile('emailSample1.txt');32word_indices = processEmail(file_contents);3334% Print Stats35fprintf('Word Indices: \n');36fprintf(' %d', word_indices);37fprintf('\n\n');3839fprintf('Program paused. Press enter to continue.\n');40pause;4142%% ==================== Part 2: Feature Extraction ====================43% Now, you will convert each email into a vector of features in R^n.44% You should complete the code in emailFeatures.m to produce a feature45% vector for a given email.4647fprintf('\nExtracting features from sample email (emailSample1.txt)\n');4849% Extract Features50file_contents = readFile('emailSample1.txt');51word_indices = processEmail(file_contents);52features = emailFeatures(word_indices);5354% Print Stats55fprintf('Length of feature vector: %d\n', length(features));56fprintf('Number of non-zero entries: %d\n', sum(features > 0));5758fprintf('Program paused. Press enter to continue.\n');59pause;6061%% =========== Part 3: Train Linear SVM for Spam Classification ========62% In this section, you will train a linear classifier to determine if an63% email is Spam or Not-Spam.6465% Load the Spam Email dataset66% You will have X, y in your environment67load('spamTrain.mat');6869fprintf('\nTraining Linear SVM (Spam Classification)\n')70fprintf('(this may take 1 to 2 minutes) ...\n')7172C = 0.1;73model = svmTrain(X, y, C, @linearKernel);7475p = svmPredict(model, X);7677fprintf('Training Accuracy: %f\n', mean(double(p == y)) * 100);7879%% =================== Part 4: Test Spam Classification ================80% After training the classifier, we can evaluate it on a test set. We have81% included a test set in spamTest.mat8283% Load the test dataset84% You will have Xtest, ytest in your environment85load('spamTest.mat');8687fprintf('\nEvaluating the trained Linear SVM on a test set ...\n')8889p = svmPredict(model, Xtest);9091fprintf('Test Accuracy: %f\n', mean(double(p == ytest)) * 100);92pause;939495%% ================= Part 5: Top Predictors of Spam ====================96% Since the model we are training is a linear SVM, we can inspect the97% weights learned by the model to understand better how it is determining98% whether an email is spam or not. The following code finds the words with99% the highest weights in the classifier. Informally, the classifier100% 'thinks' that these words are the most likely indicators of spam.101%102103% Sort the weights and obtin the vocabulary list104[weight, idx] = sort(model.w, 'descend');105vocabList = getVocabList();106107fprintf('\nTop predictors of spam: \n');108for i = 1:15109fprintf(' %-15s (%f) \n', vocabList{idx(i)}, weight(i));110end111112fprintf('\n\n');113fprintf('\nProgram paused. Press enter to continue.\n');114pause;115116%% =================== Part 6: Try Your Own Emails =====================117% Now that you've trained the spam classifier, you can use it on your own118% emails! In the starter code, we have included spamSample1.txt,119% spamSample2.txt, emailSample1.txt and emailSample2.txt as examples.120% The following code reads in one of these emails and then uses your121% learned SVM classifier to determine whether the email is Spam or122% Not Spam123124% Set the file to be read in (change this to spamSample2.txt,125% emailSample1.txt or emailSample2.txt to see different predictions on126% different emails types). Try your own emails as well!127filename = 'spamSample1.txt';128129% Read and predict130file_contents = readFile(filename);131word_indices = processEmail(file_contents);132x = emailFeatures(word_indices);133p = svmPredict(model, x);134135fprintf('\nProcessed %s\n\nSpam Classification: %d\n', filename, p);136fprintf('(1 indicates spam, 0 indicates not spam)\n\n');137138139140