Shareddata_imputation.ipynbOpen in CoCalc

Data imputation tutorial:

Working with real world data means that the data is not likely to be "cleaned". It may be missing observations. Rather than just tossing out those examples, it's important to know how to deal with missing data. This tutorial will go through some simple imputation techniques.

First we need to read in the data.

source("https://bioconductor.org/biocLite.R")
biocLite("impute")
library(ggplot2)
library(impute)
source("helper_functions.R")
# read in data 
dataset <- read.csv('data.csv')
# look at first 6 lines
head(dataset)
iddiagnosisradius_meantexture_meanperimeter_meanarea_meansmoothness_meancompactness_meanconcavity_meanconcave.points_meantexture_worstperimeter_worstarea_worstsmoothness_worstcompactness_worstconcavity_worstconcave.points_worstsymmetry_worstfractal_dimension_worstX
842302M 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 NA
842517M 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 NA
84300903M 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 NA
84348301M 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300 NA
84358402M 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678 NA
843786M 12.45 15.70 82.57 477.1 0.12780 0.17000 0.1578 0.08089 23.75 103.40 741.6 0.1791 0.5249 0.5355 0.1741 0.3985 0.12440 NA

Next we are going to introduce missing values to the feature "area_mean".

# convert 10% of area_mean feature into NAs
dataset_na <- generate_NAs(dataset, 'area_mean', 0.1)
head(dataset_na[,"area_mean"], 10)

Mean imputation:

The first imputation strategy we are going to implement is mean imputation. This involves estimating the missing values using the mean of the observed values. Starter code for this function has been provided below:

# impute missing values using mean imputation 
mean_imputation <- function(dataset, feature) {
    # using the dataset and the feature name, impute missing values using mean imputation 
    # mean() calculates the mean of a vector
    # is.na() provides the indices for all NAs in the vector
}
# impute missing values
dataset_mi <- mean_imputation(dataset_na, 'area_mean')

Next we want to visualize how well this method works compared to the real values. To do this, we want to generate a scatterplot of the imputed values vs the real values and calculate the correlation between the two. A function has been written to do this for you. You can use it as below or challenge yourself and generate your own plot!

plot_scatterplot(real = dataset$area_mean, imputed = dataset_mi$area_mean)

Random imputation:

Next we are going to implement random imputation. This method randomly samples from the observed data to fill in the missing values.

# impute missing values using random imputation 
random_imputation <- function(dataset, feature) {
    # using the data and feature name, impute missing values using random imputation
    # sample() takes a vector and randomly samples from it depending on the size specified, the replace
    # argument says whether to sample with or without replacement
    # consider three samples:
    # 1) first figure out how many NAs you need to impute
    # 2) randomly sample the number of NAs you need from the observed data
    # 3) replace the NAs with the randomly sampled values and return the imputed dataset
}
# impute missing values
dataset_ri <- random_imputation(dataset_na, 'area_mean')
plot_scatterplot(real = dataset$area_mean, imputed = dataset_ri$area_mean)

K-Nearest Neighbours:

A third effective method of data imputation is to apply machine learning models to predict the missing values. This can be done by treating the feature with missing data as the response and harnessing the known values of the other features in the model. One effective method is known as k-nearest neighbours (KNN). This algorithm looks at the k-nearest data points to the missing data value and assigns the missing value as the either the majority class if the feature is categorical or an average of the values if the feature is continuous.

Check out the following documentation: https://www.rdocumentation.org/packages/caret/versions/6.0-79/topics/preProcess

# impuate missing values using knn features
knn_imputation <- function(dataset, k) {
    # impute missing values using knn features
    # see function impute.knn from the impute package
}

Try running with multiple different values of k and see how that changes the correlation of the imputated values with the real values.

dataset_knn <- knn_imputation(dataset_na, k = 10)
plot_scatterplot(real = dataset$area_mean, imputed = dataset_knn$area_mean)