Introduction

We have two learning tasks that involve the same kind of input data, but don't have exactly aligned samples. In each of the learning tasks we have different numbers of i.i.d. samples, but we don't have overlapping sets necessarily in terms of our input data. One other assumption we have baked into this model is that the weights, while given a set for each task, are shared from a parental prior, hence there is parameter sharing amongst the learning tasks, though not in our usual "classical" sense.

By appending zero-padding, we should be able to generalize this to multi-task neural network learning with non-overlapping samples. Thomas Wiecki has a great blog post on how to do it, though he didn't deal with the "number of samples" issue, which I tried to add here.

In [ ]:

import pymc3 as pm
import theano.tensor as tt
import numpy as np

In [ ]:

# We start with parental weights of length 4, one for each feature.
parental_weights = np.random.normal(loc=10, scale=3, size=(4,))
parental_weights

In [ ]:

# We'll now generate new weights based on location=parental_weights,
# and scale=1, but one for each learning task (here there are 15 in total)
n_samps = 100
n_tasks = 50
n_weights = 4

child_weights = np.random.normal(loc=parental_weights, scale=3, size=(n_tasks, n_weights))
child_weights

These are the true weights of the system.

We are now going to attempt to learn them in a Bayesian fashion.

In [ ]:

data = np.random.normal(loc=3, scale=4, size=(n_tasks, n_samps, n_weights))

# We are going to apply a mask that nulls out about 70% of the values in the data matrix,
# and replaces them with zeros.
null_mask = np.repeat(np.random.binomial(n=1, p=0.3, size=(n_samps, 1)), 4, axis=1)
data = data * null_mask

In [ ]:

data.shape, child_weights.shape

Let's now generate the $y$ s. As long as they are labelled as zero where the inputs are also labelled as zero, then we should be in an ok regime.

By definition of the math at hand, they will be zero because we don't have any $X$ information to propagate forward (they are set to zero as inputs), so in this simulation setting, we are ok.

In [ ]:

y = np.einsum("ijk, ik -> ij", data, child_weights)
y = y + np.random.normal(loc=0, scale=3, size=y.shape)

In [ ]:

y.shape

We are now going to write a hierarchical linear regression model that handles this particular case of imbalanced number of samples.

If we are able to recover back the original weights, then zero-padding could be a very powerful technique to deal with multiple learning tasks that also have non-equal numbers of samples that also have non-overlapping sample indices.

In [ ]:

data.shape

In [ ]:

child_weights.shape

In [ ]:

tt.batched_dot(data, child_weights)

In [ ]:

data.shape

In [ ]:

In [ ]:

with pm.Model() as hierarchical_linear_model:
    w_parent = pm.Normal("w_parent", mu=0, sd=1, shape=(n_weights,))

    # Broadcasting will give us 4 child weights drawn from w_parent,
    # I think.
    w_child = pm.Normal("w_child", mu=w_parent, sd=1, shape=(n_tasks, n_weights))

    sd = pm.HalfCauchy("sd", beta=10)

    # mu = pm.Deterministic("mu", np.einsum('ijk, kj -> ij', data, w_child))
    mu = pm.Deterministic("mu", tt.batched_dot(data, w_child))
    like = pm.Normal("like", mu=mu, sd=sd, observed=y)

In [ ]:

with hierarchical_linear_model:
    # trace = pm.sample(2000, cores=1)
    approx = pm.fit(100000)
    trace = approx.sample(2000)

In [ ]:

trace["w_child"]

In [ ]:

trace["w_parent"].mean(axis=0)

In [ ]:

parental_weights

We're close!

In [ ]:

trace["w_child"].mean(axis=0)

In [ ]:

child_weights

In [ ]:

import matplotlib.pyplot as plt

plt.scatter(child_weights.ravel(), trace["w_child"].mean(axis=0).ravel())
plt.xlabel("actual weight value")
plt.ylabel("fitted weight value")

In [ ]:

trace["w_child"].std(axis=0)

OK! I think that this works. Setting null values to zero on both the input and output sets guarantees that there are no information propagation forwards nor backwards, and helps us get around the problem of sparsity in the input dimensions. The tensorification of the multiple linear regression tasks makes this fast, and the hierarchical nature binds them together.

Now, in this simulated situation, we had a pretty confident prior on the model structure. Naturally, in a real data science setting, we don't expect this to be the case, but given related tasks and a standardized featurization of the inputs, this should be a pretty good prior.

In [ ]:

Introduction

Product

Resources

Company