Path: blob/master/incubator/hierarchical-linreg-vectorized.ipynb
411 views
Introduction
We have two learning tasks that involve the same kind of input data, but don't have exactly aligned samples. In each of the learning tasks we have different numbers of i.i.d. samples, but we don't have overlapping sets necessarily in terms of our input data. One other assumption we have baked into this model is that the weights, while given a set for each task, are shared from a parental prior, hence there is parameter sharing amongst the learning tasks, though not in our usual "classical" sense.
By appending zero-padding, we should be able to generalize this to multi-task neural network learning with non-overlapping samples. Thomas Wiecki has a great blog post on how to do it, though he didn't deal with the "number of samples" issue, which I tried to add here.
These are the true weights of the system.
We are now going to attempt to learn them in a Bayesian fashion.
Let's now generate the s. As long as they are labelled as zero where the inputs are also labelled as zero, then we should be in an ok regime.
By definition of the math at hand, they will be zero because we don't have any information to propagate forward (they are set to zero as inputs), so in this simulation setting, we are ok.
We are now going to write a hierarchical linear regression model that handles this particular case of imbalanced number of samples.
If we are able to recover back the original weights, then zero-padding could be a very powerful technique to deal with multiple learning tasks that also have non-equal numbers of samples that also have non-overlapping sample indices.
We're close!
OK! I think that this works. Setting null values to zero on both the input and output sets guarantees that there are no information propagation forwards nor backwards, and helps us get around the problem of sparsity in the input dimensions. The tensorification of the multiple linear regression tasks makes this fast, and the hierarchical nature binds them together.
Now, in this simulated situation, we had a pretty confident prior on the model structure. Naturally, in a real data science setting, we don't expect this to be the case, but given related tasks and a standardized featurization of the inputs, this should be a pretty good prior.