Path: blob/master/model_selection/kl_divergence.ipynb
1470 views
Table of Contents
Kullback-Leibler Divergence
In this post we're going to take a look at way of comparing two probability distributions called Kullback-Leibler Divergence (a.k.a KL divergence). Very often in machine learning, we'll replace observed data or a complex distributions with a simpler, approximating distribution. KL Divergence helps us to measure just how much information we lose when we choose an approximation, thus we can even use it as our objective function to pick which approximation would work best for the problem at hand.
Let's look at an example: (The example here is borrowed from the following link. Blog: Kullback-Leibler Divergence Explained).
Suppose we're a group of scientists visiting space and we discovered some space worms. These space worms have varying number of teeth. After a decent amount of collecting, we have come to this empirical probability distribution of the number of teeth in each worm:
Now we need to send this information back to earth. But the problem is that sending information from space to earth is expensive. So we wish to represent this information with a minimum amount of information, perhaps just one or two parameters. One option to represent the distribution of teeth in worms is a uniform distribution.
Another option is to use a binomial distribution.
Comparing each of our models with our original data we can see that neither one is the perfect match, but the question now becomes, which one is better?
Given these two distributions that we are using to approximate the original distribution, we need a quantitative way to measure which one does the job better. This is where Kullback-Leibler (KL) Divergence comes in.
KL Divergence has its origins in information theory. The primary goal of information theory is to quantify how much information is in our data. To recap, one of the most important metric in information theory is called Entropy, which we will denote as . The entropy for a probability distribution is defined as:
If we use for our calculation we can interpret entropy as, using a distribution , the minimum number of bits it would take us to encode events drawn from distribution . Knowing we have a way to quantify how much information is in our data, we now extend it to quantify how much information is lost when we substitute our observed distribution for a parameterized approximation.
The formula for Kullback-Leibler Divergence is a slight modification of entropy. Rather than just having our probability distribution we add in our approximating distribution , then we look at the difference of the log values for each:
Essentially, what we're looking at with KL divergence is the expectation of the log difference between the probability of data in the original distribution with the approximating distribution. Because we're multiplying the difference between the two distribution with , this means that matching areas where the original distribution has a higher probability is more important than areas that has a lower probability. Again, if we think in terms of , we can interpret this as, how many extra bits of information we need to encode events drawn from true distribution , if using an optimal code from distribution rather than .
The more common way to see KL divergence written is as follows:
since .
If two distributions, and perfectly match, , otherwise the lower the KL divergence value, the better we have matched the true distribution with our approximation.
Side Note: If you're interested in having an understanding of the relationship between entropy, cross entropy and KL divergence, the following links are good places to start. Maybe they will clear up some of the hand-wavy explanation of these concepts ... Youtube: A Short Introduction to Entropy, Cross-Entropy and KL-Divergence and StackExchange: Why do we use Kullback-Leibler divergence rather than cross entropy in the t-SNE objective function?
Given these information, we can go ahead and calculate the KL divergence for our two approximating distributions.
As we can see the information lost by using the Binomial approximation is greater than using the uniform approximation. If we have to choose one to represent our observations, we're better off sticking with the Uniform approximation.
To close this discussion, we used KL-divergence to calculate which our approximate distribution more closely reflects our true distribution. One caveat to note is that it may be tempting to think of KL-divergence as a way of measuring distance, however, whenever we talk about KL-divergence, we do not categorized it as a distance metric due to the fact that it is asymmetric. In other words, .