Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
suyashi29
GitHub Repository: suyashi29/python-su
Path: blob/master/ML/Notebook/Credit card fraud detection.ipynb
3087 views
Kernel: Python 3

Problem Statement:

The Credit Card Fraud Detection Problem includes modeling past credit card transactions with the knowledge of the ones that turned out to be fraud. This model is then used to identify whether a new transaction is fraudulent or not. Our aim here is to detect 100% of the fraudulent transactions while minimizing the incorrect fraud classifications.

Data Description

  • The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions.

  • The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

  • The dataset consists of numerical values from the 28 ‘Principal Component Analysis (PCA)’ transformed features, namely V1 to V28. Furthermore, there is no metadata about the original features provided, so pre-analysis or feature study could not be done.

  • The ‘Time’ and ‘Amount’ features are not transformed data.The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

  • There is no missing value in the dataset.

.

Important terms

  • True Positive: The fraud cases that the model predicted as ‘fraud.’

  • False Positive: The non-fraud cases that the model predicted as ‘fraud.’

  • True Negative: The non-fraud cases that the model predicted as ‘non-fraud.’

  • False Negative: The fraud cases that the model predicted as ‘non-fraud.’

  • Threshold Cutoff Probability: Probability at which the true positive ratio and true negatives ratio are both highest. It can be noted that this probability is minimal, which is reasonable as the probability of frauds is low.

  • Accuracy: The measure of correct predictions made by the model – that is, the ratio of fraud transactions classified as fraud and non-fraud classified as non-fraud to the total transactions in the test data.

  • Sensitivity: Sensitivity, or True Positive Rate, or Recall, is the ratio of correctly identified fraud cases to total fraud cases.

  • Specificity: Specificity, or True Negative Rate, is the ratio of correctly identified non-fraud cases to total non-fraud cases.

  • Precision: Precision is the ratio of correctly predicted fraud cases to total predicted fraud cases.

#importing packages %matplotlib inline import scipy.stats as stats import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns plt.style.use('ggplot')
c=pd.read_csv('F:\\ML & Data Visualization\\credit.csv')
#shape print('This data frame has {} rows and {} columns.'.format(c.shape[0], c.shape[1]))
This data frame has 284807 rows and 31 columns.
c.head(10)
c.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 284807 entries, 0 to 284806 Data columns (total 31 columns): Time 284807 non-null float64 V1 284807 non-null float64 V2 284807 non-null float64 V3 284807 non-null float64 V4 284807 non-null float64 V5 284807 non-null float64 V6 284807 non-null float64 V7 284807 non-null float64 V8 284807 non-null float64 V9 284807 non-null float64 V10 284807 non-null float64 V11 284807 non-null float64 V12 284807 non-null float64 V13 284807 non-null float64 V14 284807 non-null float64 V15 284807 non-null float64 V16 284807 non-null float64 V17 284807 non-null float64 V18 284807 non-null float64 V19 284807 non-null float64 V20 284807 non-null float64 V21 284807 non-null float64 V22 284807 non-null float64 V23 284807 non-null float64 V24 284807 non-null float64 V25 284807 non-null float64 V26 284807 non-null float64 V27 284807 non-null float64 V28 284807 non-null float64 Amount 284807 non-null float64 Class 284807 non-null int64 dtypes: float64(30), int64(1) memory usage: 67.4 MB
#numerical summary -> only non-anonymized columns of interest #pd.set_option('precision', 3) c.loc[:, ['Time', 'Amount']].describe()
#visualizations of time and amount plt.figure(figsize=(10,8)) plt.title('Distribution of Time Feature') sns.distplot(c.Time)
C:\Users\HP\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result. return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
<matplotlib.axes._subplots.AxesSubplot at 0x232a17f86d8>
Image in a Jupyter notebook
plt.figure(figsize=(10,8)) plt.title('Distribution of Monetary Value Feature') sns.distplot(c.Amount)
C:\Users\HP\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result. return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
<matplotlib.axes._subplots.AxesSubplot at 0x232a17f84a8>
Image in a Jupyter notebook

Insight

Around 88 dollars is the mean of all credit card transactions in this data set. The biggest transaction had a monetary value of around 25,691 dollars.

#fraud vs. normal transactions counts = c.Class.value_counts() normal = counts[0] print(normal) fraudulent = counts[1] print(fraudulent) perc_normal = (normal/(normal+fraudulent))*100 perc_fraudulent = (fraudulent/(normal+fraudulent))*100 print('There were {} non-fraudulent transactions ({:.3f}%) and {} fraudulent transactions ({:.3f}%).'.format(normal, perc_normal, fraudulent, perc_fraudulent))
284315 492 0 284315 1 492 Name: Class, dtype: int64 There were 284315 non-fraudulent transactions (99.827%) and 492 fraudulent transactions (0.173%).
plt.figure(figsize=(8,6)) sns.barplot(x=counts.index, y=counts) plt.title('Count of Fraudulent vs. Non-Fraudulent Transactions') plt.ylabel('Count') plt.xlabel('Class (0:Non-Fraudulent, 1:Fraudulent)') print(counts.index)
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-11-4de0c76d0126> in <module>() 4 plt.ylabel('Count') 5 plt.xlabel('Class (0:Non-Fraudulent, 1:Fraudulent)') ----> 6 prints(counts.index) NameError: name 'prints' is not defined
Image in a Jupyter notebook
corr = c.corr() corr
#heatmap corr = c.corr() plt.figure(figsize=(30,30)) heat = sns.heatmap(data=corr,annot=True) plt.title('Heatmap of Correlation')
Text(0.5, 1.0, 'Heatmap of Correlation')
Image in a Jupyter notebook
#skewness skew_ = c.skew() skew_
Time -0.036 V1 -3.281 V2 -4.625 V3 -2.240 V4 0.676 V5 -2.426 V6 1.827 V7 2.554 V8 -8.522 V9 0.555 V10 1.187 V11 0.357 V12 -2.278 V13 0.065 V14 -1.995 V15 -0.308 V16 -1.101 V17 -3.845 V18 -0.260 V19 0.109 V20 -2.037 V21 3.593 V22 -0.213 V23 -5.875 V24 -0.552 V25 -0.416 V26 0.577 V27 -1.170 V28 11.192 Amount 16.978 Class 23.998 dtype: float64
## Scaling Amount and Time from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaler2 = StandardScaler() #scaling time scaled_time = scaler.fit_transform(c[['Time']]) flat_list1 = [item for sublist in scaled_time.tolist() for item in sublist] scaled_time = pd.Series(flat_list1)
#scaling the amount column scaled_amount = scaler2.fit_transform(c[['Amount']]) #for item in scaled_amount.tolist(): #flat_list2=item.tolist() flat_list2 = [item for sublist in scaled_amount.tolist() for item in sublist] scaled_amount = pd.Series(flat_list2) print(scaled_amount)
0 0.245 1 -0.342 2 1.161 3 0.141 4 -0.073 5 -0.339 6 -0.333 7 -0.190 8 0.019 9 -0.339 10 -0.322 11 -0.313 12 0.133 13 -0.243 14 -0.118 15 -0.289 16 -0.301 17 -0.350 18 -0.166 19 -0.333 20 0.573 21 -0.217 22 -0.344 23 -0.262 24 -0.350 25 -0.248 26 -0.186 27 -0.289 28 -0.221 29 -0.301 ... 284777 -0.349 284778 -0.033 284779 -0.253 284780 -0.233 284781 -0.301 284782 -0.302 284783 -0.307 284784 -0.193 284785 -0.346 284786 -0.317 284787 -0.313 284788 -0.337 284789 -0.111 284790 -0.314 284791 -0.272 284792 -0.337 284793 -0.333 284794 -0.350 284795 -0.314 284796 -0.113 284797 -0.331 284798 -0.257 284799 -0.033 284800 -0.343 284801 -0.342 284802 -0.350 284803 -0.254 284804 -0.082 284805 -0.313 284806 0.514 Length: 284807, dtype: float64
#concatenating newly created columns w original df c = pd.concat([c, scaled_amount.rename('scaled_amount'), scaled_time.rename('scaled_time')], axis=1) c.sample(5)
#dropping old amount and time columns c.drop(['Amount', 'Time'], axis=1, inplace=True)
#manual train test split using numpy's random.rand mask = np.random.rand(len(c)) < 0.9 train = c[mask] test = c[~mask] print('Train Shape: {}\nTest Shape: {}'.format(train.shape, test.shape))
Train Shape: (256486, 31) Test Shape: (28321, 31)
train.reset_index(drop=True, inplace=True) test.reset_index(drop=True, inplace=True)
# Creating a subsample data set with balanced class distributions #how many random samples from normal transactions do we need? no_of_frauds = train.Class.value_counts()[1] print('There are {} fraudulent transactions in the train data.'.format(no_of_frauds))
There are 445 fraudulent transactions in the train data.
#randomly selecting 442 random non-fraudulent transactions non_fraud = train[train['Class'] == 0] fraud = train[train['Class'] == 1]
selected = non_fraud.sample(no_of_frauds) selected.head()
#concatenating both into a subsample data set with equal class distribution selected.reset_index(drop=True, inplace=True) fraud.reset_index(drop=True, inplace=True) subsample = pd.concat([selected, fraud]) len(subsample)
890
#shuffling our data set subsample = subsample.sample(frac=1).reset_index(drop=True) subsample.head(10)
new_counts = subsample.Class.value_counts() plt.figure(figsize=(8,6)) sns.barplot(x=new_counts.index, y=new_counts) plt.title('Count of Fraudulent vs. Non-Fraudulent Transactions In Subsample') plt.ylabel('Count') plt.xlabel('Class (0:Non-Fraudulent, 1:Fraudulent)')
Text(0.5, 0, 'Class (0:Non-Fraudulent, 1:Fraudulent)')
Image in a Jupyter notebook
#taking a look at correlations once more corr = subsample.corr() corr = corr[['Class']] corr
#negative correlations smaller than -0.5 corr[corr.Class < -0.5]
#positive correlations greater than 0.5 corr[corr.Class > 0.5]
#visualizing the features w high negative correlation f, axes = plt.subplots(nrows=2, ncols=4, figsize=(26,16)) f.suptitle('Features With High Negative Correlation', size=35) sns.boxplot(x="Class", y="V3", data=subsample, ax=axes[0,0]) sns.boxplot(x="Class", y="V9", data=subsample, ax=axes[0,1]) sns.boxplot(x="Class", y="V10", data=subsample, ax=axes[0,2]) sns.boxplot(x="Class", y="V12", data=subsample, ax=axes[0,3]) sns.boxplot(x="Class", y="V14", data=subsample, ax=axes[1,0]) sns.boxplot(x="Class", y="V16", data=subsample, ax=axes[1,1]) sns.boxplot(x="Class", y="V17", data=subsample, ax=axes[1,2]) f.delaxes(axes[1,3])
Image in a Jupyter notebook
#visualizing the features w high positive correlation f, axes = plt.subplots(nrows=1, ncols=2, figsize=(18,9)) f.suptitle('Features With High Positive Correlation', size=20) sns.boxplot(x="Class", y="V4", data=subsample, ax=axes[0]) sns.boxplot(x="Class", y="V11", data=subsample, ax=axes[1])
<matplotlib.axes._subplots.AxesSubplot at 0x232a32628d0>
Image in a Jupyter notebook
#Only removing extreme outliers Q1 = subsample.quantile(0.25) Q3 = subsample.quantile(0.75) IQR = Q3 - Q1 c2 = subsample[~((subsample < (Q1 - 2.5 * IQR)) |(subsample > (Q3 + 2.5 * IQR))).any(axis=1)] len_after = len(c2) len_before = len(subsample) len_difference = len(subsample) - len(c2) print('We reduced our data size from {} transactions by {} transactions to {} transactions.'.format(len_before, len_difference, len_after))
We reduced our data size from 890 transactions by 254 transactions to 636 transactions.
## Dimensionality Reduction from sklearn.manifold import TSNE X = c2.drop('Class', axis=1) y = c2['Class'] #t-SNE X_reduced_tsne = TSNE(n_components=2, random_state=42).fit_transform(X.values)
# t-SNE scatter plot import matplotlib.patches as mpatches f, ax = plt.subplots(figsize=(24,16)) blue_patch = mpatches.Patch(color='#0A0AFF', label='No Fraud') red_patch = mpatches.Patch(color='#AF0000', label='Fraud') ax.scatter(X_reduced_tsne[:,0], X_reduced_tsne[:,1], c=(y == 0), cmap='coolwarm', label='No Fraud', linewidths=2) ax.scatter(X_reduced_tsne[:,0], X_reduced_tsne[:,1], c=(y == 1), cmap='coolwarm', label='Fraud', linewidths=2) ax.set_title('t-SNE', fontsize=14) ax.grid(True) ax.legend(handles=[blue_patch, red_patch])
<matplotlib.legend.Legend at 0x232a4a771d0>
Image in a Jupyter notebook
## Classification Algorithms def warn(*args, **kwargs): pass import warnings warnings.warn = warn
# train test split from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train = X_train.values X_validation = X_test.values y_train = y_train.values y_validation = y_test.values print('X_shapes:\n', 'X_train:', 'X_validation:\n', X_train.shape, X_validation.shape, '\n') print('Y_shapes:\n', 'Y_train:', 'Y_validation:\n', y_train.shape, y_validation.shape)
X_shapes: X_train: X_validation: (508, 30) (128, 30) Y_shapes: Y_train: Y_validation: (508,) (128,)
from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.metrics import roc_auc_score from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from sklearn.linear_model import LogisticRegression from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.svm import SVC #from xgboost import XGBClassifier from sklearn.ensemble import RandomForestClassifier
##Spot-Checking Algorithms models = [] models.append(('LR', LogisticRegression())) models.append(('LDA', LinearDiscriminantAnalysis())) models.append(('KNN', KNeighborsClassifier())) models.append(('CART', DecisionTreeClassifier())) models.append(('SVM', SVC())) #models.append(('XGB', XGBClassifier())) models.append(('RF', RandomForestClassifier())) #testing models results = [] names = [] for name, model in models: kfold = KFold(n_splits=10, random_state=42) cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring='roc_auc') results.append(cv_results) names.append(name) msg = '%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()) print(msg)
LR: 0.961843 (0.028858) LDA: 0.954895 (0.030288) KNN: 0.952829 (0.031013) CART: 0.887152 (0.027587) SVM: 0.958068 (0.030751) RF: 0.945845 (0.028972)
#Compare Algorithms fig = plt.figure(figsize=(12,10)) plt.title('Comparison of Classification Algorithms') plt.xlabel('Algorithm') plt.ylabel('ROC-AUC Score') plt.boxplot(results) ax = fig.add_subplot(111) ax.set_xticklabels(names) plt.show()
Image in a Jupyter notebook