Path: blob/master/incubator/truncated-values.ipynb
411 views
---
---
Introduction
In some datasets, we have truncated or qualified values. These may arise because of a few reasons:
One might be that the measurement device has an upper- or lower-bound limit of detection
Another might be because of protocol reasons, such as not subjecting an animal to a condition beyond a pre-defined "ethical" limit.
This results in data for which a subset of values are real-valued, but the complementary set of values are imputed as the upper-bound or lower-bound value.
For machine learning purposes, how do we deal with these bounds? One approach might be to approach it as a two-stage ML problem:
In the first stage, predict whether the value is beyond our bounds or not.
In the second stage, predict the actual value for those real-valued measurements.
However, this means we lose the rich information stored in real-valued numbers. Perhaps there could be another way of approaching the problem?
Qualified Imputation
In this notebook, I want to explore what imputation of qualified values would look like. In particular, I am choosing a parametric strategy, in which I impose a prior distribution on the data, optimize the parameters of the distribution to best fit the data, and finally draw numbers from that distribution to impute such that we remain in a regression setting.
Generate Data
First off, let's start with simulated data drawn from a standard normal distribution.
Let us now truncate the data such that any value above 2 is set to 2.
Here, data_trunc
represents our actual measured data. The data come from a standard Normal distribution, but there's a truncation point and hence an inflation of points at the truncation point, which is usually known. At the same time, there may be an inflation of values at the truncation point, for which we can model this using a mixture model.
Inferring Distributional Parameters
Let us see if we are able to infer the standard Normal distribution parameters from data_trunc
.
Here, with a very obvious spike in the upper truncated values, we are able to accurately recover mu
, are very close to sigma
, and are very accurate with p
, the proportion of values drawn from the truncated standard Normal distribution.
If we tried a different distribution with (-inf, +inf)
support, we can check other distributions and perform model comparison.
Through model comparison, we see that the Normal distribution WAIC is lower than the StudentT distribution, but of course only marginally. For all practical purposes, I would adjudicate both to be equally good. Because the Normal distribution has fewer parameters to worry about, I would select that as the imputation distribution.
Create a new array with imputed values
We don't recover the exact distribution, but something close enough for imputation purposes.
Now, we can try our machine learning task on the imputed data, and compare how a model performs seeing imputed vs. original data.
A variance explained of 0.7 means that the residuals were only about 30% of the range of the data. For a model involving synthetic data, that is poor, but for biological data, that is pretty darn amazing.
In a practical setting where any predicted numbers above the truncated values are meaningful only in the sense that "beyond truncation point" == "one meaning", this method allows us to stay within the regression context.
One has to take care when interpreting prediction values beyond the truncation point. In light of imputation, any predicted values beyond the imputation point can only be treated as "possibly beyond the imputation point", and they carry the same meaning as any other data point beyond the truncation point - "high".