CoCalc -- data_imputation.ipynb

Views: ¹⁶⁶

Kernel: R (R-Project)

Data imputation tutorial:

Working with real world data means that the data is not likely to be "cleaned". It may be missing observations. Rather than just tossing out those examples, it's important to know how to deal with missing data. This tutorial will go through some simple imputation techniques.

First we need to read in the data.

In [0]:

source("https://bioconductor.org/biocLite.R")
biocLite("impute")
library(ggplot2)
library(impute)
source("helper_functions.R")

In [1]:

# read in data 
dataset <- read.csv('data.csv')
# look at first 6 lines
head(dataset)

id	diagnosis	radius_mean	texture_mean	perimeter_mean	area_mean	smoothness_mean	compactness_mean	concavity_mean	concave.points_mean	⋯	texture_worst	perimeter_worst	area_worst	smoothness_worst	compactness_worst	concavity_worst	concave.points_worst	symmetry_worst	fractal_dimension_worst	X
842302	M	17.99	10.38	122.80	1001.0	0.11840	0.27760	0.3001	0.14710	⋯	17.33	184.60	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890	NA
842517	M	20.57	17.77	132.90	1326.0	0.08474	0.07864	0.0869	0.07017	⋯	23.41	158.80	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902	NA
84300903	M	19.69	21.25	130.00	1203.0	0.10960	0.15990	0.1974	0.12790	⋯	25.53	152.50	1709.0	0.1444	0.4245	0.4504	0.2430	0.3613	0.08758	NA
84348301	M	11.42	20.38	77.58	386.1	0.14250	0.28390	0.2414	0.10520	⋯	26.50	98.87	567.7	0.2098	0.8663	0.6869	0.2575	0.6638	0.17300	NA
84358402	M	20.29	14.34	135.10	1297.0	0.10030	0.13280	0.1980	0.10430	⋯	16.67	152.20	1575.0	0.1374	0.2050	0.4000	0.1625	0.2364	0.07678	NA
843786	M	12.45	15.70	82.57	477.1	0.12780	0.17000	0.1578	0.08089	⋯	23.75	103.40	741.6	0.1791	0.5249	0.5355	0.1741	0.3985	0.12440	NA

Next we are going to introduce missing values to the feature "area_mean".

In [0]:

# convert 10% of area_mean feature into NAs
dataset_na <- generate_NAs(dataset, 'area_mean', 0.1)
head(dataset_na[,"area_mean"], 10)

Mean imputation:

The first imputation strategy we are going to implement is mean imputation. This involves estimating the missing values using the mean of the observed values. Starter code for this function has been provided below:

In [0]:

# impute missing values using mean imputation 
mean_imputation <- function(dataset, feature) {
    # using the dataset and the feature name, impute missing values using mean imputation 
    # mean() calculates the mean of a vector
    # is.na() provides the indices for all NAs in the vector
}

In [0]:

# impute missing values
dataset_mi <- mean_imputation(dataset_na, 'area_mean')

Next we want to visualize how well this method works compared to the real values. To do this, we want to generate a scatterplot of the imputed values vs the real values and calculate the correlation between the two. A function has been written to do this for you. You can use it as below or challenge yourself and generate your own plot!

In [0]:

plot_scatterplot(real = dataset$area_mean, imputed = dataset_mi$area_mean)

Random imputation:

Next we are going to implement random imputation. This method randomly samples from the observed data to fill in the missing values.

In [0]:

# impute missing values using random imputation 
random_imputation <- function(dataset, feature) {
    # using the data and feature name, impute missing values using random imputation
    # sample() takes a vector and randomly samples from it depending on the size specified, the replace
    # argument says whether to sample with or without replacement
    # consider three samples:
    # 1) first figure out how many NAs you need to impute
    # 2) randomly sample the number of NAs you need from the observed data
    # 3) replace the NAs with the randomly sampled values and return the imputed dataset
}

In [0]:

# impute missing values
dataset_ri <- random_imputation(dataset_na, 'area_mean')

In [0]:

plot_scatterplot(real = dataset$area_mean, imputed = dataset_ri$area_mean)

K-Nearest Neighbours:

A third effective method of data imputation is to apply machine learning models to predict the missing values. This can be done by treating the feature with missing data as the response and harnessing the known values of the other features in the model. One effective method is known as k-nearest neighbours (KNN). This algorithm looks at the k-nearest data points to the missing data value and assigns the missing value as the either the majority class if the feature is categorical or an average of the values if the feature is continuous.

Check out the following documentation: https://www.rdocumentation.org/packages/caret/versions/6.0-79/topics/preProcess

In [0]:

# impuate missing values using knn features
knn_imputation <- function(dataset, k) {
    # impute missing values using knn features
    # see function impute.knn from the impute package
}

Try running with multiple different values of k and see how that changes the correlation of the imputated values with the real values.

In [0]:

dataset_knn <- knn_imputation(dataset_na, k = 10)

In [0]:

plot_scatterplot(real = dataset$area_mean, imputed = dataset_knn$area_mean)