Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_14/Enron Dataset.ipynb
1904 views
Kernel: Python 3

Spam classification with the Enron Email dataset

import pandas as pd import numpy as np import os from sklearn import utils

Read in Data

PATH = 'assets/dataset/enron1-training-data-raw/' folders = os.listdir(PATH) folders
['ham', 'spam']
from collections import defaultdict df = defaultdict(lambda: defaultdict(list)) for category in folders: files = os.listdir(os.path.join(PATH, category)) ## only read in the text files files = [i for i in files if '.txt' in i] num_docs = 0 for file in files: file_path = os.path.join(PATH, category, file) with open(file_path, encoding = 'latin-1') as fp: line= fp.readlines() df[category][num_docs] = ' '.join(line) num_docs+=1 ## Throw everything into a pandas dataframe for easy processing df = pd.DataFrame.from_dict(df) ## Turn column names (labels into a variable) df = pd.melt(df, var_name = "Label", value_name="Features") df = utils.shuffle(df) df.dropna(inplace=True) df.head()
df['Features'].loc[3]
'Subject: re : issue\n fyi - see note below - already done .\n stella\n - - - - - - - - - - - - - - - - - - - - - - forwarded by stella l morris / hou / ect on 12 / 14 / 99 10 : 18\n am - - - - - - - - - - - - - - - - - - - - - - - - - - -\n from : sherlyn schumack on 12 / 14 / 99 10 : 06 am\n to : stella l morris / hou / ect @ ect\n cc : howard b camp / hou / ect @ ect\n subject : re : issue\n stella ,\n this has already been taken care of . you did this for me yesterday .\n thanks .\n howard b camp\n 12 / 14 / 99 09 : 10 am\n to : stella l morris / hou / ect @ ect\n cc : sherlyn schumack / hou / ect @ ect , howard b camp / hou / ect @ ect , stacey\n neuweiler / hou / ect @ ect , daren j farmer / hou / ect @ ect\n subject : issue\n stella ,\n can you work with stacey or daren to resolve\n hc\n - - - - - - - - - - - - - - - - - - - - - - forwarded by howard b camp / hou / ect on 12 / 14 / 99 09 : 08\n am - - - - - - - - - - - - - - - - - - - - - - - - - - -\n from : sherlyn schumack 12 / 13 / 99 01 : 14 pm\n to : howard b camp / hou / ect @ ect\n cc :\n subject : issue\n i have to create accounting arrangement for purchase from unocal energy at\n meter 986782 . deal not tracked for 5 / 99 . volume on deal 114427 expired 4 / 99 .'
df.shape
(5172, 2)

A bit of cleaning with textacy

#!pip install textacy
Collecting textacy Downloading https://files.pythonhosted.org/packages/41/9f/22b9dec63bff5e6ef7fb47b2cd37025087c3995b6ca5467d78160f5b0eb3/textacy-0.6.1-py2.py3-none-any.whl (137kB) 100% |████████████████████████████████| 143kB 3.2MB/s ta 0:00:01 Collecting tqdm>=4.11.1 (from textacy) Downloading https://files.pythonhosted.org/packages/93/24/6ab1df969db228aed36a648a8959d1027099ce45fad67532b9673d533318/tqdm-4.23.4-py2.py3-none-any.whl (42kB) 100% |████████████████████████████████| 51kB 6.9MB/s eta 0:00:01 Collecting unidecode>=0.04.19 (from textacy) Downloading https://files.pythonhosted.org/packages/59/ef/67085e30e8bbcdd76e2f0a4ad8151c13a2c5bce77c85f8cad6e1f16fb141/Unidecode-1.0.22-py2.py3-none-any.whl (235kB) 100% |████████████████████████████████| 235kB 3.3MB/s ta 0:00:01 Collecting ijson>=2.3 (from textacy) Downloading https://files.pythonhosted.org/packages/7f/e9/8508c5f4987ba238a2b169e582c1f70a47272b22a2f1fb06b9318201bb9e/ijson-2.3-py2.py3-none-any.whl Requirement already satisfied: scipy>=0.17.0 in /anaconda3/lib/python3.6/site-packages (from textacy) Requirement already satisfied: scikit-learn>=0.17.0 in /anaconda3/lib/python3.6/site-packages (from textacy) Requirement already satisfied: cytoolz>=0.8.0 in /anaconda3/lib/python3.6/site-packages (from textacy) Collecting cachetools>=2.0.0 (from textacy) Downloading https://files.pythonhosted.org/packages/0a/58/cbee863250b31d80f47401d04f34038db6766f95dea1cc909ea099c7e571/cachetools-2.1.0-py2.py3-none-any.whl Collecting pyemd>=0.3.0 (from textacy) Downloading https://files.pythonhosted.org/packages/b8/b1/713de7261a0062ce41c4e2caaa16fe033890fd961b70d637c20951a1c7cf/pyemd-0.5.1-cp36-cp36m-macosx_10_13_x86_64.whl (81kB) 100% |████████████████████████████████| 81kB 3.6MB/s eta 0:00:01 Requirement already satisfied: numpy<2.0.0,>=1.9.0 in /anaconda3/lib/python3.6/site-packages (from textacy) Requirement already satisfied: requests>=2.10.0 in /anaconda3/lib/python3.6/site-packages (from textacy) Collecting pyphen>=0.9.4 (from textacy) Downloading https://files.pythonhosted.org/packages/dd/c4/74859f895e2361d92cfbb6208ea7afd06c2f1f05c491da71cbd7ce3887be/Pyphen-0.9.4-py2.py3-none-any.whl (1.9MB) 100% |████████████████████████████████| 2.0MB 455kB/s ta 0:00:011 Collecting ftfy<5.0.0,>=4.2.0 (from textacy) Downloading https://files.pythonhosted.org/packages/21/5d/9385540977b00df1f3a0c0f07b7e6c15b5e7a3109d7f6ae78a0a764dab22/ftfy-4.4.3.tar.gz (50kB) 100% |████████████████████████████████| 51kB 3.6MB/s ta 0:00:011 Collecting spacy>=2.0.0 (from textacy) Downloading https://files.pythonhosted.org/packages/3c/31/e60f88751e48851b002f78a35221d12300783d5a43d4ef12fbf10cca96c3/spacy-2.0.11.tar.gz (17.6MB) 100% |████████████████████████████████| 17.6MB 71kB/s eta 0:00:01 25% |████████▎ | 4.5MB 24.3MB/s eta 0:00:01 38% |████████████▌ | 6.8MB 13.9MB/s eta 0:00:01 54% |█████████████████▍ | 9.5MB 19.1MB/s eta 0:00:01 Requirement already satisfied: networkx>=1.11 in /anaconda3/lib/python3.6/site-packages (from textacy) Collecting python-levenshtein>=0.12.0 (from textacy) Downloading https://files.pythonhosted.org/packages/42/a9/d1785c85ebf9b7dfacd08938dd028209c34a0ea3b1bcdb895208bd40a67d/python-Levenshtein-0.12.0.tar.gz (48kB) 100% |████████████████████████████████| 51kB 7.0MB/s eta 0:00:01 Requirement already satisfied: toolz>=0.8.0 in /anaconda3/lib/python3.6/site-packages (from cytoolz>=0.8.0->textacy) Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /anaconda3/lib/python3.6/site-packages (from requests>=2.10.0->textacy) Requirement already satisfied: idna<2.7,>=2.5 in /anaconda3/lib/python3.6/site-packages (from requests>=2.10.0->textacy) Requirement already satisfied: urllib3<1.23,>=1.21.1 in /anaconda3/lib/python3.6/site-packages (from requests>=2.10.0->textacy) Requirement already satisfied: certifi>=2017.4.17 in /anaconda3/lib/python3.6/site-packages (from requests>=2.10.0->textacy) Requirement already satisfied: html5lib in /anaconda3/lib/python3.6/site-packages (from ftfy<5.0.0,>=4.2.0->textacy) Requirement already satisfied: wcwidth in /anaconda3/lib/python3.6/site-packages (from ftfy<5.0.0,>=4.2.0->textacy) Collecting murmurhash<0.29,>=0.28 (from spacy>=2.0.0->textacy) Downloading https://files.pythonhosted.org/packages/5e/31/c8c1ecafa44db30579c8c457ac7a0f819e8b1dbc3e58308394fff5ff9ba7/murmurhash-0.28.0.tar.gz Requirement already satisfied: cymem<1.32,>=1.30 in /anaconda3/lib/python3.6/site-packages (from spacy>=2.0.0->textacy) Collecting preshed<2.0.0,>=1.0.0 (from spacy>=2.0.0->textacy) Downloading https://files.pythonhosted.org/packages/1b/ac/7c17b1fd54b60972785b646d37da2826311cca70842c011c4ff84fbe95e0/preshed-1.0.0.tar.gz (89kB) 100% |████████████████████████████████| 92kB 6.8MB/s eta 0:00:01 Collecting thinc<6.11.0,>=6.10.1 (from spacy>=2.0.0->textacy) Downloading https://files.pythonhosted.org/packages/55/fd/e9f36081e6f53699943381858848f3b4d759e0dd03c43b98807dde34c252/thinc-6.10.2.tar.gz (1.2MB) 100% |████████████████████████████████| 1.2MB 869kB/s eta 0:00:01 Requirement already satisfied: plac<1.0.0,>=0.9.6 in /anaconda3/lib/python3.6/site-packages (from spacy>=2.0.0->textacy) Collecting pathlib (from spacy>=2.0.0->textacy) Downloading https://files.pythonhosted.org/packages/ac/aa/9b065a76b9af472437a0059f77e8f962fe350438b927cb80184c32f075eb/pathlib-1.0.1.tar.gz (49kB) 100% |████████████████████████████████| 51kB 6.1MB/s eta 0:00:01 Requirement already satisfied: ujson>=1.35 in /anaconda3/lib/python3.6/site-packages (from spacy>=2.0.0->textacy) Collecting dill<0.3,>=0.2 (from spacy>=2.0.0->textacy) Downloading https://files.pythonhosted.org/packages/91/a0/19d4d31dee064fc553ae01263b5c55e7fb93daff03a69debbedee647c5a0/dill-0.2.7.1.tar.gz (64kB) 100% |████████████████████████████████| 71kB 8.5MB/s eta 0:00:01 Collecting regex==2017.4.5 (from spacy>=2.0.0->textacy) Downloading https://files.pythonhosted.org/packages/36/62/c0c0d762ffd4ffaf39f372eb8561b8d491a11ace5a7884610424a8b40f95/regex-2017.04.05.tar.gz (601kB) 100% |████████████████████████████████| 604kB 2.2MB/s eta 0:00:01 Requirement already satisfied: decorator>=4.1.0 in /anaconda3/lib/python3.6/site-packages (from networkx>=1.11->textacy) Requirement already satisfied: setuptools in /anaconda3/lib/python3.6/site-packages (from python-levenshtein>=0.12.0->textacy) Requirement already satisfied: six>=1.9 in /anaconda3/lib/python3.6/site-packages (from html5lib->ftfy<5.0.0,>=4.2.0->textacy) Requirement already satisfied: webencodings in /anaconda3/lib/python3.6/site-packages (from html5lib->ftfy<5.0.0,>=4.2.0->textacy) Requirement already satisfied: wrapt in /anaconda3/lib/python3.6/site-packages (from thinc<6.11.0,>=6.10.1->spacy>=2.0.0->textacy) Collecting termcolor (from thinc<6.11.0,>=6.10.1->spacy>=2.0.0->textacy) Downloading https://files.pythonhosted.org/packages/8a/48/a76be51647d0eb9f10e2a4511bf3ffb8cc1e6b14e9e4fab46173aa79f981/termcolor-1.1.0.tar.gz Requirement already satisfied: msgpack-python in /anaconda3/lib/python3.6/site-packages (from thinc<6.11.0,>=6.10.1->spacy>=2.0.0->textacy) Collecting msgpack-numpy==0.4.1 (from thinc<6.11.0,>=6.10.1->spacy>=2.0.0->textacy) Downloading https://files.pythonhosted.org/packages/2e/43/393e30e2768b0357541ac95891f96b80ccc4d517e0dd2fa3042fc8926538/msgpack_numpy-0.4.1-py2.py3-none-any.whl Building wheels for collected packages: ftfy, spacy, python-levenshtein, murmurhash, preshed, thinc, pathlib, dill, regex, termcolor Running setup.py bdist_wheel for ftfy ... done Stored in directory: /Users/ystrano/Library/Caches/pip/wheels/37/54/00/d320239bfc8aad1455314f302dd82a75253fc585e17b81704e Running setup.py bdist_wheel for spacy ... done Stored in directory: /Users/ystrano/Library/Caches/pip/wheels/fb/00/28/75c85d5135e7d9a100639137d1847d41e914ed16c962d467e4 Running setup.py bdist_wheel for python-levenshtein ... done Stored in directory: /Users/ystrano/Library/Caches/pip/wheels/de/c2/93/660fd5f7559049268ad2dc6d81c4e39e9e36518766eaf7e342 Running setup.py bdist_wheel for murmurhash ... done Stored in directory: /Users/ystrano/Library/Caches/pip/wheels/b8/94/a4/f69f8664cdc1098603df44771b7fec5fd1b3d8364cdd83f512 Running setup.py bdist_wheel for preshed ... done Stored in directory: /Users/ystrano/Library/Caches/pip/wheels/8f/85/06/2d132fb649a6bbcab22487e4147880a55b0dd0f4b18fdfd6b5 Running setup.py bdist_wheel for thinc ... done Stored in directory: /Users/ystrano/Library/Caches/pip/wheels/d8/5c/3e/9acf5d9974fb1c9e7b467563ea5429c9325f67306e93147961 Running setup.py bdist_wheel for pathlib ... done Stored in directory: /Users/ystrano/Library/Caches/pip/wheels/f9/b2/4a/68efdfe5093638a9918bd1bb734af625526e849487200aa171 Running setup.py bdist_wheel for dill ... done Stored in directory: /Users/ystrano/Library/Caches/pip/wheels/99/c4/ed/1b64d2d5809e60d5a3685530432f6159d6a9959739facb61f2 Running setup.py bdist_wheel for regex ... done Stored in directory: /Users/ystrano/Library/Caches/pip/wheels/75/07/38/3c16b529d50cb4e0cd3dbc7b75cece8a09c132692c74450b01 Running setup.py bdist_wheel for termcolor ... done Stored in directory: /Users/ystrano/Library/Caches/pip/wheels/7c/06/54/bc84598ba1daf8f970247f550b175aaaee85f68b4b0c5ab2c6 Successfully built ftfy spacy python-levenshtein murmurhash preshed thinc pathlib dill regex termcolor Installing collected packages: tqdm, unidecode, ijson, cachetools, pyemd, pyphen, ftfy, murmurhash, preshed, dill, termcolor, pathlib, msgpack-numpy, thinc, regex, spacy, python-levenshtein, textacy Found existing installation: murmurhash 0.26.4 DEPRECATION: Uninstalling a distutils installed project (murmurhash) has been deprecated and will be removed in a future version. This is due to the fact that uninstalling a distutils project will only partially uninstall the project. Uninstalling murmurhash-0.26.4: Successfully uninstalled murmurhash-0.26.4 Found existing installation: preshed 0.46.4 DEPRECATION: Uninstalling a distutils installed project (preshed) has been deprecated and will be removed in a future version. This is due to the fact that uninstalling a distutils project will only partially uninstall the project. Uninstalling preshed-0.46.4: Successfully uninstalled preshed-0.46.4 Found existing installation: thinc 5.0.8 DEPRECATION: Uninstalling a distutils installed project (thinc) has been deprecated and will be removed in a future version. This is due to the fact that uninstalling a distutils project will only partially uninstall the project. Uninstalling thinc-5.0.8: Successfully uninstalled thinc-5.0.8 Found existing installation: spacy 0.101.0 DEPRECATION: Uninstalling a distutils installed project (spacy) has been deprecated and will be removed in a future version. This is due to the fact that uninstalling a distutils project will only partially uninstall the project. Uninstalling spacy-0.101.0: Successfully uninstalled spacy-0.101.0 Successfully installed cachetools-2.1.0 dill-0.2.7.1 ftfy-4.4.3 ijson-2.3 msgpack-numpy-0.4.1 murmurhash-0.28.0 pathlib-1.0.1 preshed-1.0.0 pyemd-0.5.1 pyphen-0.9.4 python-levenshtein-0.12.0 regex-2017.4.5 spacy-2.0.11 termcolor-1.1.0 textacy-0.6.1 thinc-6.10.2 tqdm-4.23.4 unidecode-1.0.22 You are using pip version 9.0.1, however version 10.0.1 is available. You should consider upgrading via the 'pip install --upgrade pip' command.
import textacy.preprocess as preprocess def clean_enron(text): return preprocess.preprocess_text(text, fix_unicode=True, lowercase=True, no_urls=False, no_emails=True, no_phone_numbers=True, no_numbers=True, no_currency_symbols=False, no_punct=True, no_contractions=True)
clean_enron(df['Features'].loc[3])
'subject re issue\n fyi see note below already done stella\n forwarded by stella l morris hou ect on number number number number number am from sherlyn schumack on number number number number number am\n to stella l morris hou ect ect\n cc howard b camp hou ect ect\n subject re issue\n stella this has already been taken care of you did this for me yesterday thanks howard b camp\n number number number number number am\n to stella l morris hou ect ect\n cc sherlyn schumack hou ect ect howard b camp hou ect ect stacey\n neuweiler hou ect ect daren j farmer hou ect ect\n subject issue\n stella can you work with stacey or daren to resolve\n hc\n forwarded by howard b camp hou ect on number number number number number am from sherlyn schumack number number number number number pm\n to howard b camp hou ect ect\n cc subject issue\n i have to create accounting arrangement for purchase from unocal energy at\n meter number deal not tracked for number number volume on deal number expired number number'

Perform your basic EDA - don't spend more than 5-10 minutes

df.shape
(5172, 2)
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 5172 entries, 77 to 460 Data columns (total 2 columns): Label 5172 non-null object Features 5172 non-null object dtypes: object(2) memory usage: 281.2+ KB

Using sklearn's count and tfidf vectorizer and then the model of your choice to classify the emails

Experiment with preprocessing steps. Do you get better results with cleaned or uncleaned data?

To clean the data, use: - df['Features'] = df['Features'].apply(lambda x: clean_enron(x))

df['Features'] = df['Features'].apply(lambda x: clean_enron(x))