Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupport News AboutSign UpSign In
| Download
Views: 253
Kernel: R (SageMath)

Week 1 Homework

Question 1:

A good application of classifiers is "employee retention".

Companies that are looking to retain their employees might use a classifier that identifies employees who might leave the company within the next 6 month. Good predictors for training this classifier might include:

  • number of promotions within last 18 months

  • difference between employee's salary and the industry average salary of his/her position

  • overall years of experience

  • years of tenure at the company

  • marital status

Question 2.1:

The classifier's linear equation is in the format of:

a0 + a1A1 + a2A2 + a3A3 + a4A8 + a5A9 + a6A10 +a7A11 + a8A12 + a9A14 + a10A15 = 0

where,{A1...A15} = feature inputs from credit_card_data-headers.txt

See [19] and [20] respectively for a1...am and a0. This classifier had an accuracy rate of 86.39% for the overall dataset [21].

data <- read.table('https://d37djvu3ytnwxt.cloudfront.net/assets/courseware/v1/e39a3df780dacd5503df6a8322d72cd2/asset-v1:GTx+ISYE6501x+2T2017+type@asset+block/credit_card_data-headers.txt', sep="\t", header = TRUE)
model <- ksvm(as.matrix(data[,1:10]),as.factor(data[,11]),type="C-svc",kernel="vanilladot",C=100,scaled=TRUE)
Setting default kernel parameters
# calculate a1…am a <- colSums(data[model@SVindex,1:10] * model@coef[[1]]) a
A1
-0.000466036176627327
A2
-0.0140534983606244
A3
-0.0081688661743442
A8
0.0101292226736795
A9
0.501609468692229
A10
-0.00140343386065389
A11
0.00129121684002342
A12
-0.000266898857269382
A14
-0.206754961642446
A15
558.33559056503
# calculate a0 a0 <- sum(a*data[1,1:10]) - model@b a0
-41.6013574057321
# see what the model predicts pred <- predict(model,data[,1:10]) # see what fraction of the model’s predictions match the actual classification sum(pred == data[,11]) / nrow(data)
0.863914373088685

Question 2.2:

K-value=12; kknn classifier was able to classify 85.3211% of the samples in the data-set correctly. (the cumulative performance where one sample is iterating taken out of the train data, and the resulting model is then evaluated on the left-out sample).

library(kknn)
k_results <- rep(0, times = (100*nrow(data))) dim(k_results) <- c(nrow(data),100) for (kn in 1:100) { for (i in 1:nrow(data)) { k_results[i,kn] <- round(fitted(kknn(R1~., data[-i,], data[i,], k = kn, scale = TRUE)),0) } }
predict <- colSums(k_results == data[,11]) / nrow(data) bestK <- which.max(predict) perf <- max(predict) print(bestK) print(perf)
[1] 12 [1] 0.853211

Question 3a

Randomly splitting the data-set into the original credit card data into the following three datasets:

  • Training dataset of 60% of the data

  • Evaluation dataset of 20% of the data

  • Test dataset of 20% of the data

The best kknn model that was trained on training dataset, and then evaluated on the evaluation dataset was the kknn model with default parameters and K-value of 29; with the classifier accuracy of 90.4%.

#Create random samples shuffled_data <- sample(1:653, 653) #Pull out random sample for testing. ~60% train <- data[shuffled_data[1:400], ] # Pull out the rest for learning by selecting the opposite set. eval <- data[shuffled_data[401:525], ] #~20% test <- data[shuffled_data[526:653], ] #~20% # Prepare a list to receive looped input. eval_results <- data.frame() for (kvalue in seq(1,199, by = 1)){ model <- train.kknn(R1~., data = train[-i], ks = kvalue, scale = TRUE) predict <- round(predict(model, eval[, -11])) # Compare the prediction with the actual value: testresult <- sum(predict == eval[, 11]) resultpct <- testresult / length(predict) eval_results[kvalue, 1] <- kvalue eval_results[kvalue, 2] <- resultpct } # Makes a basic plot of the data. bestkval <- eval_results[which.max(eval_results$V2), ]
bestkval
V1V2
2929 0.904

Question 3b

The accuracy of the kknn model with K-value of 29 assessed on the test dataset is 85.9375%.

#Get classifier performance on test data model <- train.kknn(R1~., data = train[-i], ks = 29, scale=TRUE) predict <- round(predict(model, test[, -11])) testresult <- sum(predict == test[, 11]) testpct <- testresult / length(predict) testpct
0.859375