Path: blob/master/lessons/lesson_08/code/LogisticRegression-BankMarketing-Lab.ipynb
1904 views
Logistic Regresion Lab
Exercise with bank marketing data
Authors: Sam Stack(DC)
Introduction
Data from the UCI Machine Learning Repository: data, data dictionary
Goal: Predict whether a customer will purchase a bank product marketed over the phone
bank-additional.csv
is already in our repo, so there is no need to download the data from the UCI website
Step 1: Read the data into Pandas
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-1-842c02c7d0ac> in <module>()
1 import pandas as pd
----> 2 bank = pd.read_csv('data/bank.csv')
3 bank.head()
/Users/ricky.hennessy/anaconda/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
653 skip_blank_lines=skip_blank_lines)
654
--> 655 return _read(filepath_or_buffer, kwds)
656
657 parser_f.__name__ = name
/Users/ricky.hennessy/anaconda/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
403
404 # Create the parser.
--> 405 parser = TextFileReader(filepath_or_buffer, **kwds)
406
407 if chunksize or iterator:
/Users/ricky.hennessy/anaconda/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
762 self.options['has_index_names'] = kwds['has_index_names']
763
--> 764 self._make_engine(self.engine)
765
766 def close(self):
/Users/ricky.hennessy/anaconda/lib/python3.6/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
983 def _make_engine(self, engine='c'):
984 if engine == 'c':
--> 985 self._engine = CParserWrapper(self.f, **self.options)
986 else:
987 if engine == 'python':
/Users/ricky.hennessy/anaconda/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
1603 kwds['allow_leading_cols'] = self.index_col is not False
1604
-> 1605 self._reader = parsers.TextReader(src, **kwds)
1606
1607 # XXX
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__ (pandas/_libs/parsers.c:4209)()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source (pandas/_libs/parsers.c:8873)()
FileNotFoundError: File b'data/bank.csv' does not exist
** Target 'y
' represented as such** - No : 0 - Yes : 1
Step 2: Prepare at least three features
Include both numeric and categorical features
Choose features that you think might be related to the response (based on intuition or exploration)
Think about how to handle missing values (encoded as "unknown")
Step 3: Model building
Use cross-validation to evaluate the logistic regression model with your chosen features. You can use any (combination) of the following metrics to evaluate.
Try to increase the metrics by selecting different sets of features
Bonus: Experiment with hyper parameters such are regularization.
Build a Model
Get the Coefficient for each feature.
Be sure to make note of interesting findings.
Use the Model to predict on x_test and evaluate the model using metric(s) of Choice.
Model 2: Use a different combination of features.
Evaluate the model and interpret your choosen metrics.
Is your model not performing very well?
Is it not predicting any True Positives?
Lets try one more thing before we revert to grabbing more features. Adjusting the probability threshold.
Use the LogisticRegression.predict_proba()
attribute to get the probabilities.
Recall from the lesson the first probability is the for class 0
and the second is for class 1
.
Visualize the distribution
** Calculate a new threshold and use it to convert predicted probabilities to output classes**
Evaluate the model metrics now
Step 4: Build a model using all of the features.
Evaluate it using your prefered metrics.