Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_08/code/LogisticRegression-BankMarketing-Lab.ipynb
1904 views
Kernel: Python 2

Logistic Regresion Lab

Exercise with bank marketing data

Authors: Sam Stack(DC)

Introduction

  • Data from the UCI Machine Learning Repository: data, data dictionary

  • Goal: Predict whether a customer will purchase a bank product marketed over the phone

  • bank-additional.csv is already in our repo, so there is no need to download the data from the UCI website

Step 1: Read the data into Pandas

import pandas as pd bank = pd.read_csv('data/bank.csv') bank.head()
--------------------------------------------------------------------------- FileNotFoundError Traceback (most recent call last) <ipython-input-1-842c02c7d0ac> in <module>() 1 import pandas as pd ----> 2 bank = pd.read_csv('data/bank.csv') 3 bank.head() /Users/ricky.hennessy/anaconda/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision) 653 skip_blank_lines=skip_blank_lines) 654 --> 655 return _read(filepath_or_buffer, kwds) 656 657 parser_f.__name__ = name /Users/ricky.hennessy/anaconda/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds) 403 404 # Create the parser. --> 405 parser = TextFileReader(filepath_or_buffer, **kwds) 406 407 if chunksize or iterator: /Users/ricky.hennessy/anaconda/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds) 762 self.options['has_index_names'] = kwds['has_index_names'] 763 --> 764 self._make_engine(self.engine) 765 766 def close(self): /Users/ricky.hennessy/anaconda/lib/python3.6/site-packages/pandas/io/parsers.py in _make_engine(self, engine) 983 def _make_engine(self, engine='c'): 984 if engine == 'c': --> 985 self._engine = CParserWrapper(self.f, **self.options) 986 else: 987 if engine == 'python': /Users/ricky.hennessy/anaconda/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds) 1603 kwds['allow_leading_cols'] = self.index_col is not False 1604 -> 1605 self._reader = parsers.TextReader(src, **kwds) 1606 1607 # XXX pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__ (pandas/_libs/parsers.c:4209)() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source (pandas/_libs/parsers.c:8873)() FileNotFoundError: File b'data/bank.csv' does not exist

** Target 'y' represented as such** - No : 0 - Yes : 1

# Perform what ever steps you need to familiarize yourself with the data:

Step 2: Prepare at least three features

  • Include both numeric and categorical features

  • Choose features that you think might be related to the response (based on intuition or exploration)

  • Think about how to handle missing values (encoded as "unknown")

# A:

Step 3: Model building

from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split, from sklearn import metrics

Build a Model

# convert selected features do dummies # set the model # set x and y # train test splot # fit model

Get the Coefficient for each feature.

  • Be sure to make note of interesting findings.

Use the Model to predict on x_test and evaluate the model using metric(s) of Choice.

# A:

Model 2: Use a different combination of features.

  • Evaluate the model and interpret your choosen metrics.

# A;

Is your model not performing very well?

Is it not predicting any True Positives?

Lets try one more thing before we revert to grabbing more features. Adjusting the probability threshold.

Use the LogisticRegression.predict_proba() attribute to get the probabilities.

Recall from the lesson the first probability is the for class 0 and the second is for class 1.

# A:

Visualize the distribution

# A:

** Calculate a new threshold and use it to convert predicted probabilities to output classes**

# A:

Evaluate the model metrics now

# A:

Step 4: Build a model using all of the features.

  • Evaluate it using your prefered metrics.

# A:

Bonus: Use Regularization to optimize your model.

# try using a for loop to test various regularization strengths 'C'