GitHub Repository: suyashi29/python-su
Path: blob/master/ML/Notebook/Credit card fraud detection.ipynb
³⁰⁸⁷ views

Kernel: Python 3

Problem Statement:

The Credit Card Fraud Detection Problem includes modeling past credit card transactions with the knowledge of the ones that turned out to be fraud. This model is then used to identify whether a new transaction is fraudulent or not. Our aim here is to detect 100% of the fraudulent transactions while minimizing the incorrect fraud classifications.

Data Description

The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions.
The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.
The dataset consists of numerical values from the 28 ‘Principal Component Analysis (PCA)’ transformed features, namely V1 to V28. Furthermore, there is no metadata about the original features provided, so pre-analysis or feature study could not be done.
The ‘Time’ and ‘Amount’ features are not transformed data.The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.
There is no missing value in the dataset.

Important terms

True Positive: The fraud cases that the model predicted as ‘fraud.’
False Positive: The non-fraud cases that the model predicted as ‘fraud.’
True Negative: The non-fraud cases that the model predicted as ‘non-fraud.’
False Negative: The fraud cases that the model predicted as ‘non-fraud.’
Threshold Cutoff Probability: Probability at which the true positive ratio and true negatives ratio are both highest. It can be noted that this probability is minimal, which is reasonable as the probability of frauds is low.
Accuracy: The measure of correct predictions made by the model – that is, the ratio of fraud transactions classified as fraud and non-fraud classified as non-fraud to the total transactions in the test data.
Sensitivity: Sensitivity, or True Positive Rate, or Recall, is the ratio of correctly identified fraud cases to total fraud cases.
Specificity: Specificity, or True Negative Rate, is the ratio of correctly identified non-fraud cases to total non-fraud cases.
Precision: Precision is the ratio of correctly predicted fraud cases to total predicted fraud cases.

In [1]:

#importing packages
%matplotlib inline
import scipy.stats as stats
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')

In [2]:

c=pd.read_csv('F:\\ML & Data Visualization\\credit.csv')

In [3]:

#shape
print('This data frame has {} rows and {} columns.'.format(c.shape[0], c.shape[1]))

Out[3]:

This data frame has 284807 rows and 31 columns.

In [5]:

c.head(10)

Out[5]:

In [4]:

c.info()

Out[4]:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
Time      284807 non-null float64
V1        284807 non-null float64
V2        284807 non-null float64
V3        284807 non-null float64
V4        284807 non-null float64
V5        284807 non-null float64
V6        284807 non-null float64
V7        284807 non-null float64
V8        284807 non-null float64
V9        284807 non-null float64
V10       284807 non-null float64
V11       284807 non-null float64
V12       284807 non-null float64
V13       284807 non-null float64
V14       284807 non-null float64
V15       284807 non-null float64
V16       284807 non-null float64
V17       284807 non-null float64
V18       284807 non-null float64
V19       284807 non-null float64
V20       284807 non-null float64
V21       284807 non-null float64
V22       284807 non-null float64
V23       284807 non-null float64
V24       284807 non-null float64
V25       284807 non-null float64
V26       284807 non-null float64
V27       284807 non-null float64
V28       284807 non-null float64
Amount    284807 non-null float64
Class     284807 non-null int64
dtypes: float64(30), int64(1)
memory usage: 67.4 MB

In [7]:

#numerical summary -> only non-anonymized columns of interest
#pd.set_option('precision', 3)
c.loc[:, ['Time', 'Amount']].describe()

Out[7]:

In [7]:

#visualizations of time and amount
plt.figure(figsize=(10,8))
plt.title('Distribution of Time Feature')
sns.distplot(c.Time)

Out[7]:

C:\Users\HP\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

<matplotlib.axes._subplots.AxesSubplot at 0x232a17f86d8>

In [8]:

plt.figure(figsize=(10,8))
plt.title('Distribution of Monetary Value Feature')
sns.distplot(c.Amount)

Out[8]:

C:\Users\HP\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

<matplotlib.axes._subplots.AxesSubplot at 0x232a17f84a8>

Insight

Around 88 dollars is the mean of all credit card transactions in this data set. The biggest transaction had a monetary value of around 25,691 dollars.

In [10]:

#fraud vs. normal transactions 
counts = c.Class.value_counts()
normal = counts[0]
print(normal)
fraudulent = counts[1]
print(fraudulent)
perc_normal = (normal/(normal+fraudulent))*100
perc_fraudulent = (fraudulent/(normal+fraudulent))*100
print('There were {} non-fraudulent transactions ({:.3f}%) and {} fraudulent transactions ({:.3f}%).'.format(normal, perc_normal, fraudulent, perc_fraudulent))

Out[10]:

284315
492
0    284315
1       492
Name: Class, dtype: int64
There were 284315 non-fraudulent transactions (99.827%) and 492 fraudulent transactions (0.173%).

In [11]:

plt.figure(figsize=(8,6))
sns.barplot(x=counts.index, y=counts)
plt.title('Count of Fraudulent vs. Non-Fraudulent Transactions')
plt.ylabel('Count')
plt.xlabel('Class (0:Non-Fraudulent, 1:Fraudulent)')
print(counts.index)

Out[11]:

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-11-4de0c76d0126> in <module>()
      4 plt.ylabel('Count')
      5 plt.xlabel('Class (0:Non-Fraudulent, 1:Fraudulent)')
----> 6 prints(counts.index)

NameError: name 'prints' is not defined

In [11]:

corr = c.corr()
corr

Out[11]:

In [13]:

#heatmap
corr = c.corr()
plt.figure(figsize=(30,30))
heat = sns.heatmap(data=corr,annot=True)
plt.title('Heatmap of Correlation')

Out[13]:

Text(0.5, 1.0, 'Heatmap of Correlation')

In [13]:

#skewness
skew_ = c.skew()
skew_

Out[13]:

Time      -0.036
V1        -3.281
V2        -4.625
V3        -2.240
V4         0.676
V5        -2.426
V6         1.827
V7         2.554
V8        -8.522
V9         0.555
V10        1.187
V11        0.357
V12       -2.278
V13        0.065
V14       -1.995
V15       -0.308
V16       -1.101
V17       -3.845
V18       -0.260
V19        0.109
V20       -2.037
V21        3.593
V22       -0.213
V23       -5.875
V24       -0.552
V25       -0.416
V26        0.577
V27       -1.170
V28       11.192
Amount    16.978
Class     23.998
dtype: float64

In [31]:

## Scaling Amount and Time

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler2 = StandardScaler()
#scaling time
scaled_time = scaler.fit_transform(c[['Time']])
flat_list1 = [item for sublist in scaled_time.tolist() for item in sublist]
scaled_time = pd.Series(flat_list1)

In [35]:

#scaling the amount column
scaled_amount = scaler2.fit_transform(c[['Amount']])
#for item in scaled_amount.tolist():
    #flat_list2=item.tolist()
flat_list2 = [item for sublist in scaled_amount.tolist() for item in sublist]
scaled_amount = pd.Series(flat_list2)
print(scaled_amount)

Out[35]:

0         0.245
1        -0.342
2         1.161
3         0.141
4        -0.073
5        -0.339
6        -0.333
7        -0.190
8         0.019
9        -0.339
10       -0.322
11       -0.313
12        0.133
13       -0.243
14       -0.118
15       -0.289
16       -0.301
17       -0.350
18       -0.166
19       -0.333
20        0.573
21       -0.217
22       -0.344
23       -0.262
24       -0.350
25       -0.248
26       -0.186
27       -0.289
28       -0.221
29       -0.301
          ...  
284777   -0.349
284778   -0.033
284779   -0.253
284780   -0.233
284781   -0.301
284782   -0.302
284783   -0.307
284784   -0.193
284785   -0.346
284786   -0.317
284787   -0.313
284788   -0.337
284789   -0.111
284790   -0.314
284791   -0.272
284792   -0.337
284793   -0.333
284794   -0.350
284795   -0.314
284796   -0.113
284797   -0.331
284798   -0.257
284799   -0.033
284800   -0.343
284801   -0.342
284802   -0.350
284803   -0.254
284804   -0.082
284805   -0.313
284806    0.514
Length: 284807, dtype: float64

In [36]:

#concatenating newly created columns w original df
c = pd.concat([c, scaled_amount.rename('scaled_amount'), scaled_time.rename('scaled_time')], axis=1)
c.sample(5)

Out[36]:

In [17]:

#dropping old amount and time columns
c.drop(['Amount', 'Time'], axis=1, inplace=True)

In [18]:

#manual train test split using numpy's random.rand
mask = np.random.rand(len(c)) < 0.9
train = c[mask]
test = c[~mask]
print('Train Shape: {}\nTest Shape: {}'.format(train.shape, test.shape))

Out[18]:

Train Shape: (256486, 31)
Test Shape: (28321, 31)

In [19]:

train.reset_index(drop=True, inplace=True)
test.reset_index(drop=True, inplace=True)

In [20]:

# Creating a subsample data set with balanced class distributions

#how many random samples from normal transactions do we need?
no_of_frauds = train.Class.value_counts()[1]
print('There are {} fraudulent transactions in the train data.'.format(no_of_frauds))

Out[20]:

There are 445 fraudulent transactions in the train data.

In [21]:

#randomly selecting 442 random non-fraudulent transactions
non_fraud = train[train['Class'] == 0]
fraud = train[train['Class'] == 1]

In [22]:

selected = non_fraud.sample(no_of_frauds)
selected.head()

Out[22]:

In [23]:

#concatenating both into a subsample data set with equal class distribution
selected.reset_index(drop=True, inplace=True)
fraud.reset_index(drop=True, inplace=True)
subsample = pd.concat([selected, fraud])
len(subsample)

Out[23]:

890

In [24]:

#shuffling our data set
subsample = subsample.sample(frac=1).reset_index(drop=True)
subsample.head(10)

Out[24]:

In [25]:

new_counts = subsample.Class.value_counts()
plt.figure(figsize=(8,6))
sns.barplot(x=new_counts.index, y=new_counts)
plt.title('Count of Fraudulent vs. Non-Fraudulent Transactions In Subsample')
plt.ylabel('Count')
plt.xlabel('Class (0:Non-Fraudulent, 1:Fraudulent)')

Out[25]:

Text(0.5, 0, 'Class (0:Non-Fraudulent, 1:Fraudulent)')

In [26]:

#taking a look at correlations once more
corr = subsample.corr()
corr = corr[['Class']]
corr

Out[26]:

In [27]:

#negative correlations smaller than -0.5
corr[corr.Class < -0.5]

Out[27]:

In [28]:

#positive correlations greater than 0.5
corr[corr.Class > 0.5]

Out[28]:

In [29]:

#visualizing the features w high negative correlation
f, axes = plt.subplots(nrows=2, ncols=4, figsize=(26,16))

f.suptitle('Features With High Negative Correlation', size=35)
sns.boxplot(x="Class", y="V3", data=subsample, ax=axes[0,0])
sns.boxplot(x="Class", y="V9", data=subsample, ax=axes[0,1])
sns.boxplot(x="Class", y="V10", data=subsample, ax=axes[0,2])
sns.boxplot(x="Class", y="V12", data=subsample, ax=axes[0,3])
sns.boxplot(x="Class", y="V14", data=subsample, ax=axes[1,0])
sns.boxplot(x="Class", y="V16", data=subsample, ax=axes[1,1])
sns.boxplot(x="Class", y="V17", data=subsample, ax=axes[1,2])
f.delaxes(axes[1,3])

Out[29]:

In [30]:

#visualizing the features w high positive correlation
f, axes = plt.subplots(nrows=1, ncols=2, figsize=(18,9))

f.suptitle('Features With High Positive Correlation', size=20)
sns.boxplot(x="Class", y="V4", data=subsample, ax=axes[0])
sns.boxplot(x="Class", y="V11", data=subsample, ax=axes[1])

Out[30]:

<matplotlib.axes._subplots.AxesSubplot at 0x232a32628d0>

In [33]:

#Only removing extreme outliers
Q1 = subsample.quantile(0.25)
Q3 = subsample.quantile(0.75)
IQR = Q3 - Q1

c2 = subsample[~((subsample < (Q1 - 2.5 * IQR)) |(subsample > (Q3 + 2.5 * IQR))).any(axis=1)]
len_after = len(c2)
len_before = len(subsample)
len_difference = len(subsample) - len(c2)
print('We reduced our data size from {} transactions by {} transactions to {} transactions.'.format(len_before, len_difference, len_after))

Out[33]:

We reduced our data size from 890 transactions by 254 transactions to 636 transactions.

In [34]:

## Dimensionality Reduction

from sklearn.manifold import TSNE

X = c2.drop('Class', axis=1)
y = c2['Class']
#t-SNE
X_reduced_tsne = TSNE(n_components=2, random_state=42).fit_transform(X.values)

In [35]:

# t-SNE scatter plot
import matplotlib.patches as mpatches

f, ax = plt.subplots(figsize=(24,16))


blue_patch = mpatches.Patch(color='#0A0AFF', label='No Fraud')
red_patch = mpatches.Patch(color='#AF0000', label='Fraud')

ax.scatter(X_reduced_tsne[:,0], X_reduced_tsne[:,1], c=(y == 0), cmap='coolwarm', label='No Fraud', linewidths=2)
ax.scatter(X_reduced_tsne[:,0], X_reduced_tsne[:,1], c=(y == 1), cmap='coolwarm', label='Fraud', linewidths=2)
ax.set_title('t-SNE', fontsize=14)

ax.grid(True)

ax.legend(handles=[blue_patch, red_patch])

Out[35]:

<matplotlib.legend.Legend at 0x232a4a771d0>

In [36]:

## Classification Algorithms


def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

In [37]:

# train test split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [38]:

X_train = X_train.values
X_validation = X_test.values
y_train = y_train.values
y_validation = y_test.values
print('X_shapes:\n', 'X_train:', 'X_validation:\n', X_train.shape, X_validation.shape, '\n')
print('Y_shapes:\n', 'Y_train:', 'Y_validation:\n', y_train.shape, y_validation.shape)

Out[38]:

X_shapes:
 X_train: X_validation:
 (508, 30) (128, 30) 

Y_shapes:
 Y_train: Y_validation:
 (508,) (128,)

In [40]:

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
#from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier

In [41]:

##Spot-Checking Algorithms

models = []

models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('SVM', SVC()))
#models.append(('XGB', XGBClassifier()))
models.append(('RF', RandomForestClassifier()))

#testing models

results = []
names = []

for name, model in models:
    kfold = KFold(n_splits=10, random_state=42)
    cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring='roc_auc')
    results.append(cv_results)
    names.append(name)
    msg = '%s: %f (%f)' % (name, cv_results.mean(), cv_results.std())
    print(msg)

Out[41]:

LR: 0.961843 (0.028858)
LDA: 0.954895 (0.030288)
KNN: 0.952829 (0.031013)
CART: 0.887152 (0.027587)
SVM: 0.958068 (0.030751)
RF: 0.945845 (0.028972)

In [42]:

#Compare Algorithms

fig = plt.figure(figsize=(12,10))
plt.title('Comparison of Classification Algorithms')
plt.xlabel('Algorithm')
plt.ylabel('ROC-AUC Score')
plt.boxplot(results)
ax = fig.add_subplot(111)
ax.set_xticklabels(names)
plt.show()

Out[42]:

In [ ]:

Problem Statement:

Data Description

Important terms

Insight

Product

Resources

Company