GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_08/code/LogisticRegression-BankMarketing-Lab.ipynb
²³⁴⁷ views

Kernel: Python 2

Logistic Regresion Lab

Exercise with bank marketing data

Authors: Sam Stack(DC)

Introduction

Data from the UCI Machine Learning Repository: data, data dictionary
Goal: Predict whether a customer will purchase a bank product marketed over the phone
bank-additional.csv is already in our repo, so there is no need to download the data from the UCI website

Step 1: Read the data into Pandas

In [1]:

import pandas as pd
bank = pd.read_csv('data/bank.csv')
bank.head()

Out[1]:

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-1-842c02c7d0ac> in <module>()
      1 import pandas as pd
----> 2 bank = pd.read_csv('data/bank.csv')
      3 bank.head()
/Users/ricky.hennessy/anaconda/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    653                     skip_blank_lines=skip_blank_lines)
    654 
--> 655         return _read(filepath_or_buffer, kwds)
    656 
    657     parser_f.__name__ = name
/Users/ricky.hennessy/anaconda/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    403 
    404     # Create the parser.
--> 405     parser = TextFileReader(filepath_or_buffer, **kwds)
    406 
    407     if chunksize or iterator:
/Users/ricky.hennessy/anaconda/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    762             self.options['has_index_names'] = kwds['has_index_names']
    763 
--> 764         self._make_engine(self.engine)
    765 
    766     def close(self):
/Users/ricky.hennessy/anaconda/lib/python3.6/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
    983     def _make_engine(self, engine='c'):
    984         if engine == 'c':
--> 985             self._engine = CParserWrapper(self.f, **self.options)
    986         else:
    987             if engine == 'python':
/Users/ricky.hennessy/anaconda/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   1603         kwds['allow_leading_cols'] = self.index_col is not False
   1604 
-> 1605         self._reader = parsers.TextReader(src, **kwds)
   1606 
   1607         # XXX
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__ (pandas/_libs/parsers.c:4209)()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source (pandas/_libs/parsers.c:8873)()
FileNotFoundError: File b'data/bank.csv' does not exist

** Target 'y' represented as such** - No : 0 - Yes : 1

In [ ]:

# Perform what ever steps you need to familiarize yourself with the data:

Step 2: Prepare at least three features

Include both numeric and categorical features
Choose features that you think might be related to the response (based on intuition or exploration)
Think about how to handle missing values (encoded as "unknown")

In [ ]:

# A:

Step 3: Model building

Use cross-validation to evaluate the logistic regression model with your chosen features. You can use any (combination) of the following metrics to evaluate.
Try to increase the metrics by selecting different sets of features
- Bonus: Experiment with hyper parameters such are regularization.

In [ ]:

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split,

from sklearn import metrics

Build a Model

In [ ]:

# convert selected features do dummies

# set the model

# set x and y

# train test splot

# fit model

Get the Coefficient for each feature.

Be sure to make note of interesting findings.

Use the Model to predict on x_test and evaluate the model using metric(s) of Choice.

In [ ]:

# A:

Model 2: Use a different combination of features.

Evaluate the model and interpret your choosen metrics.

In [ ]:

# A;

Is your model not performing very well?

Is it not predicting any True Positives?

Lets try one more thing before we revert to grabbing more features. Adjusting the probability threshold.

Use the LogisticRegression.predict_proba() attribute to get the probabilities.

Recall from the lesson the first probability is the for class 0 and the second is for class 1.

In [ ]:

# A:

Visualize the distribution

In [ ]:

# A:

** Calculate a new threshold and use it to convert predicted probabilities to output classes**

In [ ]:

# A:

Evaluate the model metrics now

In [ ]:

# A:

Step 4: Build a model using all of the features.

Evaluate it using your prefered metrics.

In [ ]:

# A:

Bonus: Use Regularization to optimize your model.

In [ ]:

# try using a for loop to test various regularization strengths 'C'