Path: blob/master/lessons/lesson_14/Enron Dataset.ipynb
1904 views
Kernel: Python 3
Spam classification with the Enron Email dataset
In [1]:
Read in Data
In [2]:
Out[2]:
['ham', 'spam']
In [3]:
Out[3]:
In [4]:
Out[4]:
'Subject: re : issue\n fyi - see note below - already done .\n stella\n - - - - - - - - - - - - - - - - - - - - - - forwarded by stella l morris / hou / ect on 12 / 14 / 99 10 : 18\n am - - - - - - - - - - - - - - - - - - - - - - - - - - -\n from : sherlyn schumack on 12 / 14 / 99 10 : 06 am\n to : stella l morris / hou / ect @ ect\n cc : howard b camp / hou / ect @ ect\n subject : re : issue\n stella ,\n this has already been taken care of . you did this for me yesterday .\n thanks .\n howard b camp\n 12 / 14 / 99 09 : 10 am\n to : stella l morris / hou / ect @ ect\n cc : sherlyn schumack / hou / ect @ ect , howard b camp / hou / ect @ ect , stacey\n neuweiler / hou / ect @ ect , daren j farmer / hou / ect @ ect\n subject : issue\n stella ,\n can you work with stacey or daren to resolve\n hc\n - - - - - - - - - - - - - - - - - - - - - - forwarded by howard b camp / hou / ect on 12 / 14 / 99 09 : 08\n am - - - - - - - - - - - - - - - - - - - - - - - - - - -\n from : sherlyn schumack 12 / 13 / 99 01 : 14 pm\n to : howard b camp / hou / ect @ ect\n cc :\n subject : issue\n i have to create accounting arrangement for purchase from unocal energy at\n meter 986782 . deal not tracked for 5 / 99 . volume on deal 114427 expired 4 / 99 .'
In [5]:
Out[5]:
(5172, 2)
A bit of cleaning with textacy
In [8]:
Out[8]:
Collecting textacy
Downloading https://files.pythonhosted.org/packages/41/9f/22b9dec63bff5e6ef7fb47b2cd37025087c3995b6ca5467d78160f5b0eb3/textacy-0.6.1-py2.py3-none-any.whl (137kB)
100% |████████████████████████████████| 143kB 3.2MB/s ta 0:00:01
Collecting tqdm>=4.11.1 (from textacy)
Downloading https://files.pythonhosted.org/packages/93/24/6ab1df969db228aed36a648a8959d1027099ce45fad67532b9673d533318/tqdm-4.23.4-py2.py3-none-any.whl (42kB)
100% |████████████████████████████████| 51kB 6.9MB/s eta 0:00:01
Collecting unidecode>=0.04.19 (from textacy)
Downloading https://files.pythonhosted.org/packages/59/ef/67085e30e8bbcdd76e2f0a4ad8151c13a2c5bce77c85f8cad6e1f16fb141/Unidecode-1.0.22-py2.py3-none-any.whl (235kB)
100% |████████████████████████████████| 235kB 3.3MB/s ta 0:00:01
Collecting ijson>=2.3 (from textacy)
Downloading https://files.pythonhosted.org/packages/7f/e9/8508c5f4987ba238a2b169e582c1f70a47272b22a2f1fb06b9318201bb9e/ijson-2.3-py2.py3-none-any.whl
Requirement already satisfied: scipy>=0.17.0 in /anaconda3/lib/python3.6/site-packages (from textacy)
Requirement already satisfied: scikit-learn>=0.17.0 in /anaconda3/lib/python3.6/site-packages (from textacy)
Requirement already satisfied: cytoolz>=0.8.0 in /anaconda3/lib/python3.6/site-packages (from textacy)
Collecting cachetools>=2.0.0 (from textacy)
Downloading https://files.pythonhosted.org/packages/0a/58/cbee863250b31d80f47401d04f34038db6766f95dea1cc909ea099c7e571/cachetools-2.1.0-py2.py3-none-any.whl
Collecting pyemd>=0.3.0 (from textacy)
Downloading https://files.pythonhosted.org/packages/b8/b1/713de7261a0062ce41c4e2caaa16fe033890fd961b70d637c20951a1c7cf/pyemd-0.5.1-cp36-cp36m-macosx_10_13_x86_64.whl (81kB)
100% |████████████████████████████████| 81kB 3.6MB/s eta 0:00:01
Requirement already satisfied: numpy<2.0.0,>=1.9.0 in /anaconda3/lib/python3.6/site-packages (from textacy)
Requirement already satisfied: requests>=2.10.0 in /anaconda3/lib/python3.6/site-packages (from textacy)
Collecting pyphen>=0.9.4 (from textacy)
Downloading https://files.pythonhosted.org/packages/dd/c4/74859f895e2361d92cfbb6208ea7afd06c2f1f05c491da71cbd7ce3887be/Pyphen-0.9.4-py2.py3-none-any.whl (1.9MB)
100% |████████████████████████████████| 2.0MB 455kB/s ta 0:00:011
Collecting ftfy<5.0.0,>=4.2.0 (from textacy)
Downloading https://files.pythonhosted.org/packages/21/5d/9385540977b00df1f3a0c0f07b7e6c15b5e7a3109d7f6ae78a0a764dab22/ftfy-4.4.3.tar.gz (50kB)
100% |████████████████████████████████| 51kB 3.6MB/s ta 0:00:011
Collecting spacy>=2.0.0 (from textacy)
Downloading https://files.pythonhosted.org/packages/3c/31/e60f88751e48851b002f78a35221d12300783d5a43d4ef12fbf10cca96c3/spacy-2.0.11.tar.gz (17.6MB)
100% |████████████████████████████████| 17.6MB 71kB/s eta 0:00:01 25% |████████▎ | 4.5MB 24.3MB/s eta 0:00:01 38% |████████████▌ | 6.8MB 13.9MB/s eta 0:00:01 54% |█████████████████▍ | 9.5MB 19.1MB/s eta 0:00:01
Requirement already satisfied: networkx>=1.11 in /anaconda3/lib/python3.6/site-packages (from textacy)
Collecting python-levenshtein>=0.12.0 (from textacy)
Downloading https://files.pythonhosted.org/packages/42/a9/d1785c85ebf9b7dfacd08938dd028209c34a0ea3b1bcdb895208bd40a67d/python-Levenshtein-0.12.0.tar.gz (48kB)
100% |████████████████████████████████| 51kB 7.0MB/s eta 0:00:01
Requirement already satisfied: toolz>=0.8.0 in /anaconda3/lib/python3.6/site-packages (from cytoolz>=0.8.0->textacy)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /anaconda3/lib/python3.6/site-packages (from requests>=2.10.0->textacy)
Requirement already satisfied: idna<2.7,>=2.5 in /anaconda3/lib/python3.6/site-packages (from requests>=2.10.0->textacy)
Requirement already satisfied: urllib3<1.23,>=1.21.1 in /anaconda3/lib/python3.6/site-packages (from requests>=2.10.0->textacy)
Requirement already satisfied: certifi>=2017.4.17 in /anaconda3/lib/python3.6/site-packages (from requests>=2.10.0->textacy)
Requirement already satisfied: html5lib in /anaconda3/lib/python3.6/site-packages (from ftfy<5.0.0,>=4.2.0->textacy)
Requirement already satisfied: wcwidth in /anaconda3/lib/python3.6/site-packages (from ftfy<5.0.0,>=4.2.0->textacy)
Collecting murmurhash<0.29,>=0.28 (from spacy>=2.0.0->textacy)
Downloading https://files.pythonhosted.org/packages/5e/31/c8c1ecafa44db30579c8c457ac7a0f819e8b1dbc3e58308394fff5ff9ba7/murmurhash-0.28.0.tar.gz
Requirement already satisfied: cymem<1.32,>=1.30 in /anaconda3/lib/python3.6/site-packages (from spacy>=2.0.0->textacy)
Collecting preshed<2.0.0,>=1.0.0 (from spacy>=2.0.0->textacy)
Downloading https://files.pythonhosted.org/packages/1b/ac/7c17b1fd54b60972785b646d37da2826311cca70842c011c4ff84fbe95e0/preshed-1.0.0.tar.gz (89kB)
100% |████████████████████████████████| 92kB 6.8MB/s eta 0:00:01
Collecting thinc<6.11.0,>=6.10.1 (from spacy>=2.0.0->textacy)
Downloading https://files.pythonhosted.org/packages/55/fd/e9f36081e6f53699943381858848f3b4d759e0dd03c43b98807dde34c252/thinc-6.10.2.tar.gz (1.2MB)
100% |████████████████████████████████| 1.2MB 869kB/s eta 0:00:01
Requirement already satisfied: plac<1.0.0,>=0.9.6 in /anaconda3/lib/python3.6/site-packages (from spacy>=2.0.0->textacy)
Collecting pathlib (from spacy>=2.0.0->textacy)
Downloading https://files.pythonhosted.org/packages/ac/aa/9b065a76b9af472437a0059f77e8f962fe350438b927cb80184c32f075eb/pathlib-1.0.1.tar.gz (49kB)
100% |████████████████████████████████| 51kB 6.1MB/s eta 0:00:01
Requirement already satisfied: ujson>=1.35 in /anaconda3/lib/python3.6/site-packages (from spacy>=2.0.0->textacy)
Collecting dill<0.3,>=0.2 (from spacy>=2.0.0->textacy)
Downloading https://files.pythonhosted.org/packages/91/a0/19d4d31dee064fc553ae01263b5c55e7fb93daff03a69debbedee647c5a0/dill-0.2.7.1.tar.gz (64kB)
100% |████████████████████████████████| 71kB 8.5MB/s eta 0:00:01
Collecting regex==2017.4.5 (from spacy>=2.0.0->textacy)
Downloading https://files.pythonhosted.org/packages/36/62/c0c0d762ffd4ffaf39f372eb8561b8d491a11ace5a7884610424a8b40f95/regex-2017.04.05.tar.gz (601kB)
100% |████████████████████████████████| 604kB 2.2MB/s eta 0:00:01
Requirement already satisfied: decorator>=4.1.0 in /anaconda3/lib/python3.6/site-packages (from networkx>=1.11->textacy)
Requirement already satisfied: setuptools in /anaconda3/lib/python3.6/site-packages (from python-levenshtein>=0.12.0->textacy)
Requirement already satisfied: six>=1.9 in /anaconda3/lib/python3.6/site-packages (from html5lib->ftfy<5.0.0,>=4.2.0->textacy)
Requirement already satisfied: webencodings in /anaconda3/lib/python3.6/site-packages (from html5lib->ftfy<5.0.0,>=4.2.0->textacy)
Requirement already satisfied: wrapt in /anaconda3/lib/python3.6/site-packages (from thinc<6.11.0,>=6.10.1->spacy>=2.0.0->textacy)
Collecting termcolor (from thinc<6.11.0,>=6.10.1->spacy>=2.0.0->textacy)
Downloading https://files.pythonhosted.org/packages/8a/48/a76be51647d0eb9f10e2a4511bf3ffb8cc1e6b14e9e4fab46173aa79f981/termcolor-1.1.0.tar.gz
Requirement already satisfied: msgpack-python in /anaconda3/lib/python3.6/site-packages (from thinc<6.11.0,>=6.10.1->spacy>=2.0.0->textacy)
Collecting msgpack-numpy==0.4.1 (from thinc<6.11.0,>=6.10.1->spacy>=2.0.0->textacy)
Downloading https://files.pythonhosted.org/packages/2e/43/393e30e2768b0357541ac95891f96b80ccc4d517e0dd2fa3042fc8926538/msgpack_numpy-0.4.1-py2.py3-none-any.whl
Building wheels for collected packages: ftfy, spacy, python-levenshtein, murmurhash, preshed, thinc, pathlib, dill, regex, termcolor
Running setup.py bdist_wheel for ftfy ... done
Stored in directory: /Users/ystrano/Library/Caches/pip/wheels/37/54/00/d320239bfc8aad1455314f302dd82a75253fc585e17b81704e
Running setup.py bdist_wheel for spacy ... done
Stored in directory: /Users/ystrano/Library/Caches/pip/wheels/fb/00/28/75c85d5135e7d9a100639137d1847d41e914ed16c962d467e4
Running setup.py bdist_wheel for python-levenshtein ... done
Stored in directory: /Users/ystrano/Library/Caches/pip/wheels/de/c2/93/660fd5f7559049268ad2dc6d81c4e39e9e36518766eaf7e342
Running setup.py bdist_wheel for murmurhash ... done
Stored in directory: /Users/ystrano/Library/Caches/pip/wheels/b8/94/a4/f69f8664cdc1098603df44771b7fec5fd1b3d8364cdd83f512
Running setup.py bdist_wheel for preshed ... done
Stored in directory: /Users/ystrano/Library/Caches/pip/wheels/8f/85/06/2d132fb649a6bbcab22487e4147880a55b0dd0f4b18fdfd6b5
Running setup.py bdist_wheel for thinc ... done
Stored in directory: /Users/ystrano/Library/Caches/pip/wheels/d8/5c/3e/9acf5d9974fb1c9e7b467563ea5429c9325f67306e93147961
Running setup.py bdist_wheel for pathlib ... done
Stored in directory: /Users/ystrano/Library/Caches/pip/wheels/f9/b2/4a/68efdfe5093638a9918bd1bb734af625526e849487200aa171
Running setup.py bdist_wheel for dill ... done
Stored in directory: /Users/ystrano/Library/Caches/pip/wheels/99/c4/ed/1b64d2d5809e60d5a3685530432f6159d6a9959739facb61f2
Running setup.py bdist_wheel for regex ... done
Stored in directory: /Users/ystrano/Library/Caches/pip/wheels/75/07/38/3c16b529d50cb4e0cd3dbc7b75cece8a09c132692c74450b01
Running setup.py bdist_wheel for termcolor ... done
Stored in directory: /Users/ystrano/Library/Caches/pip/wheels/7c/06/54/bc84598ba1daf8f970247f550b175aaaee85f68b4b0c5ab2c6
Successfully built ftfy spacy python-levenshtein murmurhash preshed thinc pathlib dill regex termcolor
Installing collected packages: tqdm, unidecode, ijson, cachetools, pyemd, pyphen, ftfy, murmurhash, preshed, dill, termcolor, pathlib, msgpack-numpy, thinc, regex, spacy, python-levenshtein, textacy
Found existing installation: murmurhash 0.26.4
DEPRECATION: Uninstalling a distutils installed project (murmurhash) has been deprecated and will be removed in a future version. This is due to the fact that uninstalling a distutils project will only partially uninstall the project.
Uninstalling murmurhash-0.26.4:
Successfully uninstalled murmurhash-0.26.4
Found existing installation: preshed 0.46.4
DEPRECATION: Uninstalling a distutils installed project (preshed) has been deprecated and will be removed in a future version. This is due to the fact that uninstalling a distutils project will only partially uninstall the project.
Uninstalling preshed-0.46.4:
Successfully uninstalled preshed-0.46.4
Found existing installation: thinc 5.0.8
DEPRECATION: Uninstalling a distutils installed project (thinc) has been deprecated and will be removed in a future version. This is due to the fact that uninstalling a distutils project will only partially uninstall the project.
Uninstalling thinc-5.0.8:
Successfully uninstalled thinc-5.0.8
Found existing installation: spacy 0.101.0
DEPRECATION: Uninstalling a distutils installed project (spacy) has been deprecated and will be removed in a future version. This is due to the fact that uninstalling a distutils project will only partially uninstall the project.
Uninstalling spacy-0.101.0:
Successfully uninstalled spacy-0.101.0
Successfully installed cachetools-2.1.0 dill-0.2.7.1 ftfy-4.4.3 ijson-2.3 msgpack-numpy-0.4.1 murmurhash-0.28.0 pathlib-1.0.1 preshed-1.0.0 pyemd-0.5.1 pyphen-0.9.4 python-levenshtein-0.12.0 regex-2017.4.5 spacy-2.0.11 termcolor-1.1.0 textacy-0.6.1 thinc-6.10.2 tqdm-4.23.4 unidecode-1.0.22
You are using pip version 9.0.1, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
In [10]:
In [11]:
Out[11]:
'subject re issue\n fyi see note below already done stella\n forwarded by stella l morris hou ect on number number number number number am from sherlyn schumack on number number number number number am\n to stella l morris hou ect ect\n cc howard b camp hou ect ect\n subject re issue\n stella this has already been taken care of you did this for me yesterday thanks howard b camp\n number number number number number am\n to stella l morris hou ect ect\n cc sherlyn schumack hou ect ect howard b camp hou ect ect stacey\n neuweiler hou ect ect daren j farmer hou ect ect\n subject issue\n stella can you work with stacey or daren to resolve\n hc\n forwarded by howard b camp hou ect on number number number number number am from sherlyn schumack number number number number number pm\n to howard b camp hou ect ect\n cc subject issue\n i have to create accounting arrangement for purchase from unocal energy at\n meter number deal not tracked for number number volume on deal number expired number number'
Perform your basic EDA - don't spend more than 5-10 minutes
In [14]:
Out[14]:
(5172, 2)
In [16]:
Out[16]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5172 entries, 77 to 460
Data columns (total 2 columns):
Label 5172 non-null object
Features 5172 non-null object
dtypes: object(2)
memory usage: 281.2+ KB
Using sklearn's count and tfidf vectorizer and then the model of your choice to classify the emails
Experiment with preprocessing steps. Do you get better results with cleaned or uncleaned data?
To clean the data, use: - df['Features'] = df['Features'].apply(lambda x: clean_enron(x))
In [19]: