Data imputation tutorial:
Working with real world data means that the data is not likely to be "cleaned". It may be missing observations. Rather than just tossing out those examples, it's important to know how to deal with missing data. This tutorial will go through some simple imputation techniques.
First we need to read in the data.
id | diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave.points_mean | ⋯ | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave.points_worst | symmetry_worst | fractal_dimension_worst | X |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
842302 | M | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | ⋯ | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 | NA |
842517 | M | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | ⋯ | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 | NA |
84300903 | M | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | ⋯ | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 | NA |
84348301 | M | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | ⋯ | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 | NA |
84358402 | M | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | ⋯ | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 | NA |
843786 | M | 12.45 | 15.70 | 82.57 | 477.1 | 0.12780 | 0.17000 | 0.1578 | 0.08089 | ⋯ | 23.75 | 103.40 | 741.6 | 0.1791 | 0.5249 | 0.5355 | 0.1741 | 0.3985 | 0.12440 | NA |
Next we are going to introduce missing values to the feature "area_mean".
Mean imputation:
The first imputation strategy we are going to implement is mean imputation. This involves estimating the missing values using the mean of the observed values. Starter code for this function has been provided below:
Next we want to visualize how well this method works compared to the real values. To do this, we want to generate a scatterplot of the imputed values vs the real values and calculate the correlation between the two. A function has been written to do this for you. You can use it as below or challenge yourself and generate your own plot!
Random imputation:
Next we are going to implement random imputation. This method randomly samples from the observed data to fill in the missing values.
K-Nearest Neighbours:
A third effective method of data imputation is to apply machine learning models to predict the missing values. This can be done by treating the feature with missing data as the response and harnessing the known values of the other features in the model. One effective method is known as k-nearest neighbours (KNN). This algorithm looks at the k-nearest data points to the missing data value and assigns the missing value as the either the majority class if the feature is categorical or an average of the values if the feature is continuous.
Check out the following documentation: https://www.rdocumentation.org/packages/caret/versions/6.0-79/topics/preProcess
Try running with multiple different values of k and see how that changes the correlation of the imputated values with the real values.