Path: blob/master/incubator/mixture-model.ipynb
411 views
Gaussian Mixtures
Sometimes, our data look like they are generated by a "mixture" model. What do we mean by that? In statistics land, it means we believe that there are "mixtures" of subpopulations generating the data that we observe. A common activity, then, is to estimate the subpopulation parameters.
Let's take a look at it by generating some simulated data to illustrate the point.
We will start by first generating a mixture distribution that is composed of unit width Gaussians (i.e. ) that are slightly overlapping.
Just to reiterate the point, one of the Gaussian distributions has a mean at 0, and the other has a mean at 3. Both subpopulations are present in equal proportions in the larger population, i.e. they have equal weighting.
Let's see if we can use PyMC3 to recover those parameters. Since we know that there are two mixture components, we can encode this in the model.
Now, sometimes, in our final population, one sub-population is present at a lower frequency than the other sub-population. Let's try to simulate that.
This is really good. We have fewer samples for the group with , which thus means that we are much less confident about the value of and . What's neat is that we are nonetheless equally confident of the relative weighting of the two groups: one is much smaller in proportion than the other!
Generalized Mixtures
We used Gaussian (a.k.a. Normal) distributions for generating the data. However, what if the data didn't come from a Gaussian distribution, but instead came from two Poissons?
It worked! There was one minor detail that I had to learn from Junpeng Lao, who answered my question on the PyMC3 discourse site. That detail is this - that we have to use the pm.Poisson.dist(...)
syntax, rather than pm.Poisson(...)
syntax.
Now, what if we had much fewer data points? How would our confidence levels change?
At ~100-ish data points, it's still not too hard to tell. What if we had fewer data points?
Model identifiability problems come in to play. lam
parameters are very hard to estimate with little data. Moral of the story - get more data 😃
Let's try Weibull distribution mixtures.
Parameters are recovered! It was hard-won, though; sampling takes a long time with NUTS (but we get very good samples), and I had to experiment with ADVI init vs. auto init before empirically finding out that ADVI init works better.