Path: blob/master/incubator/dirichlet-multinomial-sequencing-read-counts.ipynb
411 views
A Simple Statistical Model for Picking Most Abundant Sequences
by Eric J. Ma and Arkadij Kummer
Introduction
In DNA sequencing, when we look at the raw sequencing data, we might see that a sample contains a multiple of sequences, each with different read counts. If we are to use this data in machine learning, one modelling choice we usually need make is to pick one of those sequences as the representative one, or else omit the data from our downstream model if we deem it to be of insufficient quality.
By insufficient quality, we usually refer to one of the following scenarios:
Low read counts.
Mixture of sequences with very similar read counts.
Both of the above.
A common practice in the DNA sequencing world is to pick a threshold (e.g. discard everything with less than 10 read counts). We wondered whether there might be a statistically-principled way for us to identify for which samples we could "just pick the highest value", and for which samples we need to omit from downstream ML purposes.
Example DNA Sequencing Data
As an example, we may have sequencing data that look like the following:
Sequence | Count |
---|---|
AGGAT... | 17 |
AGGTT... | 3 |
ACGAT... | 121 |
This data can be cast in an "urn problem". From a generative model standpoint, we consider sequences to be present at a range of proportions in an urn, and the DNA sequencer samples them, giving rise to the read count data that we observe. The number of categories (i.e. unique sequences) varies per urn (i.e. a DNA sample sent for sequencing). An appropriate statistical model here for each DNA sample is the Dirichlet-Multinomial model, which has a number of nice properties that we can take advantage of:
The multinomial distribution story gives sample counts from groups of data.
The Dirichlet distribution is the conjugate prior for the multinomial distribution probability parameter , meaning that simple addition arithmetic can be used to compute the Dirichlet posterior.
Dirichlet distribution marginals for any of the categories is Beta-distributed, which is also analytically computable.
Before we go on to the problem, let's take a very quick and cursory look that powers this, just as a reminder.
Dirichlet-Multinomial Bayesian Updating
Very briefly, if we have the following prior distribution
which models the probability of observing each of groups, and we observe actual counts for the groups
Then the posterior distribution of the s is
Moreover, the marginal distribution of each of the probabilities is given by a Beta distribution
Taking advantage of conjugacy means we can avoid doing MCMC, and simply turn to arithmetic instead.
Here are a few examples of posterior updating with a prior distribution.
How might we devise a rule for whether or not to pick the top sequence?
We could look at a particular percentage interval of the posterior distribution of the top count, and ask if it overlaps with the rest of the posterior distributions on the same percentage interval.
For example, let's see if the 99% interval of the scenario overlaps (if it's not obvious from above, it should).