Path: blob/master/Week 7/Programming Assignment - 6/ex6/processEmail.m
863 views
function word_indices = processEmail(email_contents)1%PROCESSEMAIL preprocesses a the body of an email and2%returns a list of word_indices3% word_indices = PROCESSEMAIL(email_contents) preprocesses4% the body of an email and returns a list of indices of the5% words contained in the email.6%78% Load Vocabulary9vocabList = getVocabList();1011% Init return value12word_indices = [];1314% ========================== Preprocess Email ===========================1516% Find the Headers ( \n\n and remove )17% Uncomment the following lines if you are working with raw emails with the18% full headers1920% hdrstart = strfind(email_contents, ([char(10) char(10)]));21% email_contents = email_contents(hdrstart(1):end);2223% Lower case24email_contents = lower(email_contents);2526% Strip all HTML27% Looks for any expression that starts with < and ends with > and replace28% and does not have any < or > in the tag it with a space29email_contents = regexprep(email_contents, '<[^<>]+>', ' ');3031% Handle Numbers32% Look for one or more characters between 0-933email_contents = regexprep(email_contents, '[0-9]+', 'number');3435% Handle URLS36% Look for strings starting with http:// or https://37email_contents = regexprep(email_contents, ...38'(http|https)://[^\s]*', 'httpaddr');3940% Handle Email Addresses41% Look for strings with @ in the middle42email_contents = regexprep(email_contents, '[^\s]+@[^\s]+', 'emailaddr');4344% Handle $ sign45email_contents = regexprep(email_contents, '[$]+', 'dollar');464748% ========================== Tokenize Email ===========================4950% Output the email to screen as well51fprintf('\n==== Processed Email ====\n\n');5253% Process file54l = 0;5556while ~isempty(email_contents)5758% Tokenize and also get rid of any punctuation59[str, email_contents] = ...60strtok(email_contents, ...61[' @$/#.-:&*+=[]?!(){},''">_<;%' char(10) char(13)]);6263% Remove any non alphanumeric characters64str = regexprep(str, '[^a-zA-Z0-9]', '');6566% Stem the word67% (the porterStemmer sometimes has issues, so we use a try catch block)68try str = porterStemmer(strtrim(str));69catch str = ''; continue;70end;7172% Skip the word if it is too short73if length(str) < 174continue;75end7677% Look up the word in the dictionary and add to word_indices if78% found79% ====================== YOUR CODE HERE ======================80% Instructions: Fill in this function to add the index of str to81% word_indices if it is in the vocabulary. At this point82% of the code, you have a stemmed word from the email in83% the variable str. You should look up str in the84% vocabulary list (vocabList). If a match exists, you85% should add the index of the word to the word_indices86% vector. Concretely, if str = 'action', then you should87% look up the vocabulary list to find where in vocabList88% 'action' appears. For example, if vocabList{18} =89% 'action', then, you should add 18 to the word_indices90% vector (e.g., word_indices = [word_indices ; 18]; ).91%92% Note: vocabList{idx} returns a the word with index idx in the93% vocabulary list.94%95% Note: You can use strcmp(str1, str2) to compare two strings (str1 and96% str2). It will return 1 only if the two strings are equivalent.97%9899100101102103104105106107108% =============================================================109110111% Print to screen, ensuring that the output lines are not too long112if (l + length(str) + 1) > 78113fprintf('\n');114l = 0;115end116fprintf('%s ', str);117l = l + length(str) + 1;118119end120121% Print footer122fprintf('\n\n=========================\n');123124end125126127