CoCalc -- 2017-05-24-151244.ipynb

Path: 2017-05-24-151244.ipynb

Views: ²⁵³

Kernel: R (SageMath)

Week 1 Homework

Question 1:

A good application of classifiers is "employee retention".

Companies that are looking to retain their employees might use a classifier that identifies employees who might leave the company within the next 6 month. Good predictors for training this classifier might include:

number of promotions within last 18 months
difference between employee's salary and the industry average salary of his/her position
overall years of experience
years of tenure at the company
marital status

Question 2.1:

The classifier's linear equation is in the format of:

a0 + a1A1 + a2A2 + a3A3 + a4A8 + a5A9 + a6A10 +a7A11 + a8A12 + a9A14 + a10A15 = 0

where,{A1...A15} = feature inputs from credit_card_data-headers.txt

See [19] and [20] respectively for a1...am and a0. This classifier had an accuracy rate of 86.39% for the overall dataset [21].

In [11]:

data <- read.table('https://d37djvu3ytnwxt.cloudfront.net/assets/courseware/v1/e39a3df780dacd5503df6a8322d72cd2/asset-v1:GTx+ISYE6501x+2T2017+type@asset+block/credit_card_data-headers.txt', sep="\t", header = TRUE)

In [17]:

model <- ksvm(as.matrix(data[,1:10]),as.factor(data[,11]),type="C-svc",kernel="vanilladot",C=100,scaled=TRUE)

 Setting default kernel parameters  

In [19]:

# calculate a1…am
a <- colSums(data[model@SVindex,1:10] * model@coef[[1]])
a

A1: -0.000466036176627327
A2: -0.0140534983606244
A3: -0.0081688661743442
A8: 0.0101292226736795
A9: 0.501609468692229
A10: -0.00140343386065389
A11: 0.00129121684002342
A12: -0.000266898857269382
A14: -0.206754961642446
A15: 558.33559056503

In [20]:

# calculate a0
a0 <- sum(a*data[1,1:10]) - model@b
a0

-41.6013574057321

In [21]:

# see what the model predicts
pred <- predict(model,data[,1:10])
# see what fraction of the model’s predictions match the actual classification
sum(pred == data[,11]) / nrow(data)

0.863914373088685

Question 2.2:

K-value=12; kknn classifier was able to classify 85.3211% of the samples in the data-set correctly. (the cumulative performance where one sample is iterating taken out of the train data, and the resulting model is then evaluated on the left-out sample).

In [1]:

library(kknn)

In [22]:

k_results <- rep(0, times = (100*nrow(data)))
dim(k_results) <- c(nrow(data),100)

for (kn in 1:100) {
  for (i in 1:nrow(data)) {
    k_results[i,kn] <- round(fitted(kknn(R1~., data[-i,], data[i,], k = kn, scale = TRUE)),0)
  }
}

In [28]:

predict <- colSums(k_results == data[,11]) / nrow(data)
bestK <- which.max(predict)
perf <- max(predict)

print(bestK)
print(perf)

[1] 12
[1] 0.853211

Question 3a

Randomly splitting the data-set into the original credit card data into the following three datasets:

Training dataset of 60% of the data
Evaluation dataset of 20% of the data
Test dataset of 20% of the data

The best kknn model that was trained on training dataset, and then evaluated on the evaluation dataset was the kknn model with default parameters and K-value of 29; with the classifier accuracy of 90.4%.

In [32]:

#Create random samples
shuffled_data <- sample(1:653, 653)

#Pull out random sample for testing. ~60%
train <- data[shuffled_data[1:400], ]

# Pull out the rest for learning by selecting the opposite set.
eval <- data[shuffled_data[401:525], ] #~20%
test <- data[shuffled_data[526:653], ] #~20%

# Prepare a list to receive looped input.
eval_results <- data.frame()
for (kvalue in seq(1,199, by = 1)){
  model <- train.kknn(R1~., data = train[-i], ks = kvalue, scale = TRUE)
  predict <- round(predict(model, eval[, -11]))
  # Compare the prediction with the actual value:
  testresult <- sum(predict == eval[, 11])
  resultpct <- testresult / length(predict)
  eval_results[kvalue, 1] <- kvalue
  eval_results[kvalue, 2] <- resultpct
}

# Makes a basic plot of the data.
bestkval <- eval_results[which.max(eval_results$V2), ]

In [43]:

bestkval

	V1	V2
29	29	0.904

Question 3b

The accuracy of the kknn model with K-value of 29 assessed on the test dataset is 85.9375%.

In [46]:

#Get classifier performance on test data
model <- train.kknn(R1~., data = train[-i], ks = 29, scale=TRUE)
predict <- round(predict(model, test[, -11]))
testresult <- sum(predict == test[, 11])
testpct <- testresult / length(predict)
testpct

0.859375