Path: blob/master/unbalanced/unbalanced_code/unbalanced.R
1480 views
# logistic regression12# environment setting3library(ROCR)4library(grid)5library(broom)6library(caret)7library(tidyr)8library(dplyr)9library(scales)10library(ggplot2)11library(ggthemr)12library(ggthemes)13library(gridExtra)14library(data.table)15setwd("/Users/mingyuliu/personal/machine-learning/unbalanced")1617# read in HR dataset18data <- fread( list.files( "data", full.names = TRUE )[2] )19str(data)2021# using summary to check if columns contain missing values like NAs22summary(data)2324# find correlations to exclude from the model25findCorrelation( cor(data), cutoff = .75, names = TRUE )2627# from this probability table we can see that 16 percent of28# your emplyees have left29prop.table( table(data$left) )303132# -------------------------------------------------------------------------33# Model Training34# -------------------------------------------------------------------------3536# convert the newborn to factor variables37data[ , Newborn := as.factor(Newborn) ]3839# split the dataset into two parts. 80 percent of the dataset will be used to actually40# train the model, while the rest will be used to evaluate the accuracy of this model,41# i.e. out of sample error42set.seed(4321)43test <- createDataPartition( data$left, p = .2, list = FALSE )44data_train <- data[ -test, ]45data_test <- data[ test, ]46rm(data)4748model_glm <- glm( left ~ . , data = data_train, family = binomial(logit) )49summary_glm <- summary(model_glm)5051# p-value and pseudo r squared52list( summary_glm$coefficient,531- ( summary_glm$deviance / summary_glm$null.deviance ) )54# all the p value of the coefficients indicates significance555657# -------------------------------------------------------------------------58# Predicting and Assessing the Model59# -------------------------------------------------------------------------6061# obtain the predicted value that a employee will leave in the future on the train62# and test set, after that we'll perform a quick evaluation by using the double density plot63data_train$prediction <- predict( model_glm, newdata = data_train, type = "response" )64data_test$prediction <- predict( model_glm, newdata = data_test , type = "response" )6566# given that our model's final objective is to classify new instances67# into one of two categories, whether the employee will leave or not68# we will want the model to give high scores to positive69# instances ( 1: employee left ) and low scores ( 0 : employee stayed ) otherwise.7071# distribution of the prediction score grouped by known outcome72ggplot( data_train, aes( prediction, color = as.factor(left) ) ) +73geom_density( size = 1 ) +74ggtitle( "Training Set's Predicted Score" ) +75scale_color_economist( name = "data", labels = c( "negative", "positive" ) ) +76theme_economist()7778# Ideally you want the distribution of scores to be separated,79# with the score of the negative instances to be on the left and the score of the80# positive instance to be on the right.81# In the current case, both distributions are slight skewed to the left.82# Not only is the predicted probability for the negative outcomes low, but83# the probability for the positive outcomes are also lower than it should be.84# The reason for this is because our dataset only consists of 16 percent of positive85# instances ( employees that left ). Thus our predicted scores sort of gets pulled86# towards a lower number because of the majority of the data being negative instances.8788# A slight digression, when developing models for prediction, we all know that we want the model to be89# as accurate as possible, or in other words, to do a good job in90# predicting the target variable on out of sample observations.9192# Our plot, however, can actually tell us a very important thing :93# Accuracy will not be a suitable measurement for this model9495# We'll show why below :9697# Since the prediction of a logistic regression model is a98# probability, in order to use it as a classifier, we'll have a choose a cutoff value,99# or you can say its a threshold. Where scores above this value will classified as100# positive, those below as negative. We'll be using the term cutoff for the rest of101# the documentation102103# Here we'll use a function to loop through several cutoff values and104# compute the model's accuracy on both training and testing set105source("unbalanced_code/unbalanced_functions.R")106107accuracy_info <- AccuracyCutoffInfo( train = data_train, test = data_test,108predict = "prediction", actual = "left" )109# define the theme for the next plot110ggthemr("light")111accuracy_info$plot112113114# from the output, you can see that starting from the cutoff value of .6115# our accuracy for both training and testing set grows higher and higher showing116# no sign of decreasing at all117# we'll visualize the confusion matrix of the test set to see what's causing this118cm_info <- ConfusionMatrixInfo( data = data_test, predict = "prediction",119actual = "left", cutoff = .6 )120ggthemr("flat")121cm_info$plot122123# wiki : https://en.wikipedia.org/wiki/Sensitivity_and_specificity#Worked_example124# The above plot depicts the tradeoff we face upon choosing a reasonable cutoff.125126# if we increase the cutoff value,127# the number of true negative (TN) increases and the number of true positive (TP) decreases.128# Or you can say, If we increase the cutoff's value, the number of false positive (FP) is lowered,129# while the number of false negative (FN) rises.130# Here, because we have very few positive instances, thus our model will be131# less likely to make a false negative mistake, so if we keep on adding132# the cutoff value, we'll actually increase our model's accuracy, since133# we have a higher chance of turning the false positive into true negative.134135# predict all the test set's outcome as 0136prop.table( table( data_test$left ) )137138# Section conclusion :139# Accuracy is not the suitable indicator for the model140# for unbalanced distribution or costs141142# -------------------------------------------------------------------------143# Choosing the Suitable Cutoff Value144# -------------------------------------------------------------------------145146# use the roc curve to determine the cutoff147# it plots the false positive rate (FPR) on the x-axis and the true positive rate (TPR) on the y-axis148print(cm_info$data)149150ggthemr_reset()151# different cost for false negative and false positive152cost_fp <- 100153cost_fn <- 200154roc_info <- ROCInfo( data = cm_info$data, predict = "predict",155actual = "actual", cost.fp = cost_fp, cost.fn = cost_fn )156157# reset to default ggplot theme158grid.draw(roc_info$plot)159160161# re plot the confusion matrix plot162cm_info <- ConfusionMatrixInfo( data = data_test, predict = "prediction",163actual = "left", cutoff = roc_info$cutoff )164ggthemr("flat")165cm_info$plot166167168# -------------------------------------------------------------------------169# Interpretation170# -------------------------------------------------------------------------171172# tidy from the broom package173coefficient <- tidy(model_glm)[ , c( "term", "estimate", "statistic" ) ]174175coefficient$estimate <- exp( coefficient$estimate )176177# one unit increase in statisfaction, the odds of leaving the company178# (versus not leaving) increase by a factor of179coefficient[ coefficient$term == "S", "estimate" ]180181# use the model to predict a unknown outcome data "HR_unknown.csv"182# specify the column's class183col_class <- sapply( data_test, class )[1:6]184data <- read.csv( list.files( "data", full.names = TRUE )[1], colClasses = col_class )185data$prediction <- predict( model_glm, newdata = data, type = "response" )186187# cutoff188data <- data[ data$prediction >= roc_info$cutoff, ]189190# time spent in the company191median_tic <- data %>% group_by(TIC) %>%192summarise( prediction = median(prediction), count = n() )193ggthemr("fresh")194ggplot( median_tic, aes( TIC, prediction, size = count ) ) +195geom_point() + theme( legend.position = "none" ) +196labs( title = "Time and Employee Attrition", y = "Attrition Probability",197x = "Time Spent in the Company" )198199# last project evaluation200data$LPECUT <- cut( data$LPE, breaks = quantile(data$LPE), include.lowest = TRUE )201median_lpe <- data %>% group_by(LPECUT) %>%202summarise( prediction = median(prediction), count = n() )203204ggplot( median_lpe, aes( LPECUT, prediction ) ) +205geom_point( aes( size = count ), color = "royalblue3" ) +206theme( legend.position = "none" ) +207labs( title = "Last Project's Evaluation and Employee Attrition",208y = "Attrition Probability", x = "Last Project's Evaluation by Client" )209210# This is probabily an indication that it'll be worth trying out other classification211# algorithms. Since logistic regressions assumes monotonic relationships ( either entirely increasing or decreasing )212# between the input paramters and the outcome ( also true for linear regression ). Meaning the213# if more of a quantity is good, then much more of the quantity is better. This is often not the case in the real world214215# given this probability we can prioritize our actions by adding back how much216# do we wish to retain these employees. Recall that from our dataset, we have the performance217# information of the employee ( last project evaluation ).218# given this table, we can easily create a visualization to tell the story219ggplot( data, aes( prediction, LPE ) ) +220geom_point() +221ggtitle( "Performace v.s. Probability to Leave" )222223# we first have the employees that are underperforming, we probably should224# improve their performance or you can say you can't wait for them to leave....225# for employees that are not likely to leave, we should manage them as usual226# then on the short run, we should focus on those with a good performance, but227# also has a high probability to leave.228229# the next thing we can do, is to quantify our priority by230# multiplying the probablity to leave with the performance.231# we'll also use row names of the data.frame to232# to serve as imaginery employee ids.233# Then we will obtain a priority score. Where the score will be high for234# the employees we wish to act upon as soon as possible, and low for the other ones235result <- data %>%236mutate( priority = prediction * LPE ) %>%237mutate( id = rownames(data) ) %>%238arrange( desc(priority) )239240# after obtaining this result, we can schedule a face to face interview with employees241# at the top of the list.242243# using classification in this example enabled us to detect events that will244# happen in the future. That is which employees are more likely to leave the company.245# Based on this information, we can come up with a more efficient strategy to cope246# with matter at hand.247248249# ----------------------------------------------------------------------250# document later, strange statistic test251# http://www.r-bloggers.com/evaluating-logistic-regression-models/252253254255