Path: blob/master/text-classification/notebooks/Text Classification with Logistic Regression.ipynb
314 views
Kernel: Python 3
Read Data
In [2]:
In [3]:
Out[3]:
authors object
category object
date datetime64[ns]
headline object
link object
short_description object
dtype: object
In [4]:
Out[4]:
124989
In [5]:
Out[5]:
Date range
Articles are between July 2014 and July 2018
In [50]:
Out[50]:
<matplotlib.axes._subplots.AxesSubplot at 0x138e995f8>
Category Distribution
Number of categories
In [7]:
Out[7]:
31
Category by count
Most of the articles are related to politics. Education related articles have the lowest volume.
In [8]:
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x1190fe668>
Texts for Classification
These are some of the fields we can use for the classification task. We create 3 different versions.
In [9]:
In [10]:
In [11]:
Train a Single Model
Model - 1 (binary features with description only)
In [12]:
Out[12]:
2019-11-25 12:41:33,612 : INFO : Starting model training...
2019-11-25 12:41:33,739 : INFO : Extracting features and creating vocabulary...
2019-11-25 12:41:36,742 : INFO : Training a Logistic Regression Model...
[LibLinear]
2019-11-25 12:44:30,134 : INFO : Starting evaluation...
2019-11-25 12:44:30,202 : INFO : Done training and evaluation.
Accuracy=0.5980542754736303; MRR=0.48048941798943345
Model - 2 (tfidf features with description only)
In [13]:
Out[13]:
2019-11-25 12:44:30,242 : INFO : Starting model training...
2019-11-25 12:44:30,308 : INFO : Extracting features and creating vocabulary...
2019-11-25 12:44:33,389 : INFO : Training a Logistic Regression Model...
[LibLinear]
2019-11-25 12:45:21,446 : INFO : Starting evaluation...
2019-11-25 12:45:21,515 : INFO : Done training and evaluation.
Accuracy=0.6306323604710702; MRR=0.5108380269670774
Model - 3 (tfidf features with description, headline, url)
In [14]:
Out[14]:
2019-11-25 12:45:21,554 : INFO : Starting model training...
2019-11-25 12:45:21,620 : INFO : Extracting features and creating vocabulary...
2019-11-25 12:45:27,755 : INFO : Training a Logistic Regression Model...
[LibLinear]
2019-11-25 12:46:27,562 : INFO : Starting evaluation...
2019-11-25 12:46:27,634 : INFO : Done training and evaluation.
Accuracy=0.8672555043522785; MRR=0.7511520737327071
Check Predictions on Unseen Articles from CNN (not HuffPost our training data)
In [15]:
Out[15]:
[['POLITICS', 'CRIME']]
In [16]:
Out[16]:
[['ENTERTAINMENT', 'STYLE']]
In [17]:
Out[17]:
[['ENTERTAINMENT', 'STYLE']]
In [18]:
Out[18]:
[['BUSINESS', 'POLITICS']]
In [19]:
Out[19]:
[['SCIENCE', 'HEALTHY LIVING']]
Train Different Types of Models
In [20]:
Out[20]:
2019-11-25 12:46:27,728 : INFO : Starting model training...
2019-11-25 12:46:27,788 : INFO : Extracting features and creating vocabulary...
2019-11-25 12:46:30,778 : INFO : Training a Logistic Regression Model...
[LibLinear]
2019-11-25 12:49:25,346 : INFO : Starting evaluation...
2019-11-25 12:49:25,419 : INFO : Done training and evaluation.
2019-11-25 12:49:25,462 : INFO : Starting model training...
2019-11-25 12:49:25,523 : INFO : Extracting features and creating vocabulary...
2019-11-25 12:49:28,496 : INFO : Training a Logistic Regression Model...
[LibLinear]
2019-11-25 12:53:27,625 : INFO : Starting evaluation...
2019-11-25 12:53:27,701 : INFO : Done training and evaluation.
2019-11-25 12:53:27,735 : INFO : Starting model training...
2019-11-25 12:53:27,797 : INFO : Extracting features and creating vocabulary...
2019-11-25 12:53:31,055 : INFO : Training a Logistic Regression Model...
[LibLinear]
2019-11-25 12:54:17,419 : INFO : Starting evaluation...
2019-11-25 12:54:17,493 : INFO : Done training and evaluation.
2019-11-25 12:54:17,527 : INFO : Starting model training...
2019-11-25 12:54:17,606 : INFO : Extracting features and creating vocabulary...
2019-11-25 12:54:22,294 : INFO : Training a Logistic Regression Model...
[LibLinear]
2019-11-25 12:57:33,965 : INFO : Starting evaluation...
2019-11-25 12:57:34,034 : INFO : Done training and evaluation.
2019-11-25 12:57:34,072 : INFO : Starting model training...
2019-11-25 12:57:34,132 : INFO : Extracting features and creating vocabulary...
2019-11-25 12:57:38,488 : INFO : Training a Logistic Regression Model...
[LibLinear]
2019-11-25 13:02:22,456 : INFO : Starting evaluation...
2019-11-25 13:02:22,513 : INFO : Done training and evaluation.
2019-11-25 13:02:22,546 : INFO : Starting model training...
2019-11-25 13:02:22,594 : INFO : Extracting features and creating vocabulary...
2019-11-25 13:02:27,275 : INFO : Training a Logistic Regression Model...
[LibLinear]
2019-11-25 13:03:19,438 : INFO : Starting evaluation...
2019-11-25 13:03:19,507 : INFO : Done training and evaluation.
2019-11-25 13:03:19,543 : INFO : Starting model training...
2019-11-25 13:03:19,601 : INFO : Extracting features and creating vocabulary...
2019-11-25 13:03:25,400 : INFO : Training a Logistic Regression Model...
[LibLinear]
2019-11-25 13:06:27,931 : INFO : Starting evaluation...
2019-11-25 13:06:28,002 : INFO : Done training and evaluation.
2019-11-25 13:06:28,057 : INFO : Starting model training...
2019-11-25 13:06:28,127 : INFO : Extracting features and creating vocabulary...
2019-11-25 13:06:34,953 : INFO : Training a Logistic Regression Model...
[LibLinear]
2019-11-25 13:11:21,625 : INFO : Starting evaluation...
2019-11-25 13:11:21,697 : INFO : Done training and evaluation.
2019-11-25 13:11:21,746 : INFO : Starting model training...
2019-11-25 13:11:21,805 : INFO : Extracting features and creating vocabulary...
2019-11-25 13:11:28,276 : INFO : Training a Logistic Regression Model...
[LibLinear]
2019-11-25 13:12:26,150 : INFO : Starting evaluation...
2019-11-25 13:12:26,222 : INFO : Done training and evaluation.
Results of Various Models
In [21]:
Out[21]:
Save Model for Future Use
In [28]:
Use Loaded Model
In [44]:
Out[44]:
[['POLITICS', 'THE WORLDPOST']]
In [ ]: