Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place.
Gaussian Probabilities
Introduction
The last chapter ended by discussing some of the drawbacks of the Discrete Bayesian filter. For many tracking and filtering problems our desire is to have a filter that is unimodal and continuous. That is, we want to model our system using floating point math (continuous) and to have only one belief represented (unimodal). For example, we want to say an aircraft is at (12.34, -95.54, 2389.5) where that is latitude, longitude, and altitude. We do not want our filter to tell us "it might be at (1.65, -78.01, 2100.45) or it might be at (34.36, -98.23, 2543.79)." That doesn't match our physical intuition of how the world works, and as we discussed, it can be prohibitively expensive to compute the multimodal case.
So we desire a unimodal, continuous way to represent probabilities that models how the real world works, and that is very computationally efficient to calculate. As you might guess from the chapter name, Gaussian distributions provide all of these features.
Mean, Variance, and Standard Deviations
Random Variables
To understand Gaussians we first need to understand a few simple mathematical computations. We start with a random variable x. A random variable is a variable whose value depends on some random process. If you flip a coin, you could have a variable , and assign it the value 1 for heads, and 0 for tails. That a random value. It can be the height of the students in a class. That may not seem random to you, but chances are you cannot predict the height of the student Reem Nassar because her height is not deterministically determined. For a specific classroom perhaps the heights are
Another example of a random variable would be the result of rolling a die. A less obvious example would be the position of an aircraft - the aircraft does deterministically respond to the control inputs, but it is also buffeted by random winds and travels through randomly distributed pressure gradients.
The coin toss and die roll are examples of discrete random variables. That is, the outcome of any given event comes from a discrete set of values. The roll of a six sided die can never produce a value of 7 or 3.24, for example. In contrast, the student heights are continuous; they can take on any value within biological limits. For example, heights of 1.7, 1.71, 1.711, 1.7111, 1.71111,.... are all possible. Do not confuse the measurement of the random variable with the actual value. If we can only measure the height of a person to 0.1 meters we would only record values from 0.1, 0.2, 0.3...2.7, yielding 27 discrete choices. Nonetheless a person's height can vary between any arbitrary real value between those ranges, and so height is a continuous random variable.
The Mean of a Random Variable
We want to know the average height of the students. We all know how to find the average, but let me belabor the point so I can introduce more formal notation and terminology. Another word for average is the mean. We compute the mean by summing the values and dividing by the number of values. In this case we have
In statistics we use the symbol (mu) to denote the mean, so we could write
We can formalize this computation with the equation
NumPy provides numpy.mean()
for computing the mean.
Standard Deviation of a Random Variable
The computation above tells us the average height of the students, but it doesn't tell us everything we might want to know. For example, suppose a second class has students with these heights:
the mean of these heights is also 1.8 meters, but notice that there is a much greater amount of variation in the heights in this class. Suppose a third class has heights
In this third class the average height is again 1.8 meters, but here there is no variation in the height between students.
All three classes have the same mean height of 1.8 meters. So the mean tells us something about the data, but it does not tell the whole story. We want to be able to specify how much variation there is between the heights of the students. You can imagine a number of reasons for this. Perhaps a school district needs to order 5,000 desks, and they want to be sure they buy sizes that accommodate the range of heights of the students.
Statistics has formalized this concept of measuring variation into the notion of standard deviation and variance:
The standard deviation is defined as the square root of the average of the squared differences from the mean.
That's a mouthful; as an equation this is stated as
where is the notation for the standard deviation and is the mean.
If this is the first time you have seen this it may not have a lot of meaning for you. But let's work through that with the data from the three classes to be sure we understand the formula. We subtract the mean of x from each value of x, square it, take the average of those, and then take the square root of the result. The mean of is 1.8, so we compute the standard deviation as
We can verify this computation with the NumPy method numpy.std()
which computes the standard deviation. 'std' is a common abbreviation for standard deviation.
What does the standard deviation signify? It tells us "how much" the heights vary amongst themseves. How much is not a mathematical term. We will be able to define it much more precisely once we introduce the concept of a Gaussian in the next section. For now I'll say that for many things in nature, including the height of people, 68% of all values lie within one standard deviation of the mean. In other words we can conclude that for a random class 68% of the students will have heights between 1.66 (1.8-0.1414) meters and 1.94 (1.8+0.1414) meters.
We can look at this in a plot:
For only 5 students we obviously will not get exactly 68% within one standard deviation. We do see that 3 out of 5 students are within , or 60%, which is as close as you can get to 68% with only 5 samples. I haven't yet introduced enough math or Python for you to fully understand the next bit of code, but let's look at the results for a class with 100 students.
We write one standard deviation as , which is pronounced "one standard deviation", not "one sigma". Two standard deviations is , and so on.
We can see by eye that roughly 68% of the heights lie within of the mean 1.8.
We'll discuss this in greater depth soon. For now let's compute the standard deviation for
The mean of y is m, so
We will verify that with NumPy with
This corresponds with what we would expect. There is more variation in the heights for y, and the standard deviation is larger.
Finally, let's compute the standard deviation for
There is no variation in the values, so we would expect the standard deviation to be zero. We show this to be true with
Variance of a Random Variable
Finally, the variance is defined as the square of the standard deviation. Some texts define this in the opposite way, which gives the definitions
The variance is the average of the squared differences from the mean.
The standard deviation is the square root of the variance.
Both ways of thinking about it are equivalent. We use the notation for the variance, and the equation for the variance is
To make sure we understand this let's compute the variance for :
We previously computed , and indeed the square of 0.1414 is 0.02. We can verify this computation with the NumPy function numpy.var
:
Many texts alternatively use VAR(x) to denote the variance of x.
Why the Square of the Differences
As an aside, why are we taking the square of the difference? I could go into a lot of math, but let's look at this in a simple way. Here is a chart of the values of x plotted against the mean for
If we didn't take the square of the differences the signs would cancel everything out:
This is clearly incorrect, as there is more than 0 variance in the data.
Maybe we can use the absolute value? We can see by inspection that the result is which is certainly correct - each value varies by 3 from the mean. But what if we change ? In this case we get . is clearly more spread out than in the last example, but we get the same variance, so this cannot be correct. If we use the correct formula we get a variance of 3.5, which reflects the larger variation in .
This is not a proof of correctness. Indeed, Carl Friedrich Gauss, the inventor of the technique, recognized that is is somewhat arbitrary. If there are outliers then squaring the difference gives disproportionate weight to that term. For example, let's see what happens if we have .
Is this correct? You tell me. Without the outlier of 100 we get , which accurately reflects how is varying absent the outlier. The one outlier swamps the computation. I will not continue down this path; if you are interested you might want to look at the work that James Berger has done on this problem, in a field called Bayesian robustness, or the excellent publications on robust statistics by Peter J. Huber [3]. For Kalman filters we can prove that this computation produces optimal results within certain limits. More about that soon.
Gaussians
We are now ready to learn about Gaussians. Let's remind ourselves of the motivation for this chapter.
Let's look at a graph of a Gaussian distribution to get a sense of what we are talking about.
I explain how to plot Gaussians, and much more, in the Notebook Computing_and_Plotting_PDFs in the Supporting_Notebooks folder. You can also read it online here[1]
Probably this is immediately recognizable to you as a 'bell curve'. This curve is ubiquitous because under real world conditions many observations are distributed in such a manner. In fact, this is the curve for the student heights given earlier. I will not use the term 'bell curve' to refer to a Gaussian because several probability distributions have a similar bell curve shape. Non-mathematical sources might not be so precise, so be judicious in what you conclude when you see the term used without definition.
This curve is not unique to heights - a vast amount of natural phenomena exhibits this sort of distribution, including the sensors that we use in filtering problems. As we will see, it also has all the attributes that we are looking for - it represents a unimodal belief or value as a probability, it is continuous, and it is computationally efficient. We will soon discover that it also other desirable qualities which we may not realize we desire.
To further motivate you, recall the shapes of the probability distributions in the Discrete Bayes chapter. They were not perfect Gaussian curves, but they were similar, as in the plot below. We will be using Gaussians to replace the discrete probabilities used in that chapter!
Nomenclature
A bit of nomenclature before we continue - this chart depicts the probability density of of a random variable having any value between (. What does that mean? Imagine we take an infinite number of infinitely precise measurements of the speed of automobiles on a section of highway. We could then plot the results by showing the relative number of cars going at any given speed. If the average was 120 kph, it might look like this:
The y-axis depicts the probability density - the relative amount of cars that are going the speed at the corresponding x-axis.
You may object that human heights or automobile speeds cannot be less than zero, let alone or . This is true, but this is a common limitation of mathematical modeling. "The map is not the territory" is a common expression, and it is true for Bayesian filtering and statistics. The Gaussian distribution above somewhat closely models the distribution of the measured automobile speeds, but being a model it is necessarily imperfect. The difference between model and reality will come up again and again in these filters. Gaussians are used in many branches of mathematics, not because they perfectly model reality, but because they are easier to use than any other choice. Even in this book Gaussians will fail to model reality, forcing us to computationally expensive alternative.
You will see these distributions called Gaussian distributions or normal distributions. Gaussian and normal both mean the same thing in this context, and are used interchangeably. I will use both throughout this book as different sources will use either term, so I want you to be used to seeing both. Finally, as in this paragraph, it is typical to shorten the name and talk about a Gaussian or normal - these are both typical shortcut names for the Gaussian distribution.
Gaussian Distributions
So let us explore how Gaussians work. A Gaussian is a continuous probability distribution that is completely described with two parameters, the mean () and the variance (). It is defined as:
is notation for ; we avoid using superscripts in print so that the fonts are larger and more readable.
Don't be dissuaded by the equation if you haven't seen it before; you will not need to memorize or manipulate it. The computation of this function is stored in `stats.py` with the function `gaussian(x, mean, var)`.
Optional: Let's remind ourselves how to look at a function stored in a file by using the %load magic. If you type %load -s gaussian stats.py into a code cell and then press CTRL-Enter, the notebook will create a new input cell and load the function into it.
We will plot a Gaussian with a mean of 22 , with a variance of 4 , and then discuss what this means.
So what does this curve mean? Assume for a moment that we have a thermometer, which reads 22$,^{\circ}C^{\circ}C,^{\circ}C$ is 20%? That is not quite accurate mathematically.
Recall that we said that the distribution is continuous. Think of an infinitely long straight line - what is the probability that a point you pick randomly is at, say, 2.0. Clearly 0%, as there is an infinite number of choices to choose from. The same is true for normal distributions; in the graph above the probability of being exactly 22$^{\circ}C$ is 0% because there are an infinite number of values the reading can take.
So what then is this curve? It is something we call the probability density function. The area under the curve at any region gives you the probability of those values. So, for example, if you compute the area under the curve between 20 and 22 the resulting area will be the probability of the temperature reading being between those two temperatures.
We can think of this in Bayesian terms or frequentist terms. As a Bayesian, if the thermometer reads exactly 22$,^{\circ}C,^{\circ}C$, then a histogram of the measurements would look like this curve.
So how do you compute the probability, or area under the curve? Well, you integrate the equation for the Gaussian
I wrote filterpy.stats.norm_cdf
which computes the integral for you. So, for example, we can compute
So the mean () is what it sounds like - the average of all possible probabilities. Because of the symmetric shape of the curve it is also the tallest part of the curve. The thermometer reads , so that is what we used for the mean.
Important: I will repeat what I wrote at the top of this section: "A Gaussian...is completely described with two parameters"
The standard notation for a normal distribution for a random variable is where means distributed according to. This means I can express the temperature reading of our thermometer as
This is an extremely important result. Gaussians allow me to capture an infinite number of possible values with only two numbers! With the values and I can compute the distribution of measurements for over any range.
The Variance
Since this is a probability density distribution it is required that the area under the curve always equals one. This should be intuitively clear - the area under the curve represents all possible outcomes, something happened, and the probability of something happening is one, so the density must sum to one. We can prove this ourselves with a bit of code. (If you are mathematically inclined, integrate the Gaussian equation from to )
This leads to an important insight. If the variance is small the curve will be narrow. this is because the variance is a measure of how much the samples vary from the mean. To keep the area equal to 1, the curve must also be tall. On the other hand if the variance is large the curve will be wide, and thus it will also have to be short to make the area equal to 1.
Let's look at that graphically:
So what is this telling us? The blue gaussian is very narrow. It is saying that we believe , and that we are very sure about that. In contrast, the red gaussian also believes that , but we are much less sure about that. Our believe that is lower, and so our belief about the likely possible values for is spread out - we think it is quite likely that or , for example. The blue gaussian has almost completely eliminated or as possible value - their probability is almost , whereas the red curve considers them nearly as likely as .
If we think back to the thermometer, we can consider these three curves as representing the readings from three different thermometers. The blue curve represents a very accurate thermometer, and the black one represents a fairly inaccurate one. Green of course represents one in between the two others. Note the very powerful property the Gaussian distribution affords us - we can entirely represent both the reading and the error of a thermometer with only two numbers - the mean and the variance.
It is worth spending a few words on standard deviation now. The standard deviation is a measure of how much variation from the mean exists. For Gaussian distributions, 68% of all the data falls within one standard deviation () of the mean, 95% falls within two standard deviations (), and 99.7% within three (). This is often called the 68-95-99.7 rule. So if you were told that the average test score in a class was 71 with a standard deviation of 9.4, you could conclude that 95% of the students received a score between 52.2 and 89.8 if the distribution is normal (that is calculated with ).
The following graph depicts the relationship between the standard deviation and the normal distribution.
An equivalent formation for a Gaussian is where is the mean and the precision. Here ; it is the reciprocal of the variance. While we do not use this formulation in this book, it underscores that the variance is a measure of how precise our data is. A small variance yields large precision - our measurement is very precise. Conversely, a large variance yields low precision - our belief is spread out across a large area. You should become comfortable with thinking about Gaussians in these equivalent forms. For a Bayesian Gaussians reflect our belief about a measurement, they express the precision of the measurement, and they express how much variance there is in the measurements. These are all different ways of stating the same fact.
Interactive Gaussians
For those that are reading this in IPython Notebook, here is an interactive version of the Gaussian plots. Use the sliders to modify and . Adjusting will move the graph to the left and right because you are adjusting the mean, and adjusting will make the bell curve thicker and thinner.
Finally, if you are reading this in an IPython Notebook, here is an animation of a Gaussian. First, the mean is being shifted to the right. Then the mean is centered at and the variance is modified.
Computational Properties of the Gaussian
Recall how our discrete Bayesian filter worked. We had a vector implemented as a NumPy array representing our belief at a certain moment in time. When we integrated another measurement into our belief using the update()
function we had to multiply probabilities together, and when we performed the motion step using the predict()
function we had to shift and add probabilities. I've promised you that the Kalman filter uses essentially the same process, and that it uses Gaussians instead of histograms, so you might reasonable expect that we will be multiplying, adding, and shifting Gaussians in the Kalman filter.
A typical textbook would directly launch into a multi-page proof of the behavior of Gaussians under these operations, but I don't see the value in that right now. I think the math will be much more intuitive and clear if we just start developing a Kalman filter using Gaussians. I will provide the equations for multiplying and shifting Gaussians at the appropriate time. You will then be able to develop a physical intuition for what these operations do, rather than be forced to digest a lot of fairly abstract math.
The key point, which I will only assert for now, is that all the operations are very simple, and that they preserve the properties of the Gaussian. This is somewhat remarkable, in that the Gaussian is a nonlinear function, and typically if you multiply a nonlinear equation with itself you end up with a different equation. For example, the shape of sin(x)sin(x)
is very different from sin(x)
. But the result of multiplying two Gaussians is yet another Gaussian. This is a fundamental property, and a key reason why Kalman filters are computationally feasible. Said another way, Kalman filters use Gaussians because they are so computationally nice.
Computing Probabilities with scipy.stats
In this chapter I have used custom code from FilterPy for computing Gaussians, plotting, and so on. I chose to do that to give you a chance to look at the code and see how these functions are implemented. However, Python comes with "batteries included" as the saying goes, and it comes with a wide range of statistics functions in the module scipy.stats
. I find the performance of some of the functions rather slow (the scipy.stats
documentation contains a warning to this effect), but this is offset by the fact that this is standard code available to everyone, and it is well tested. So let's walk through how to use scipy.stats to compute statistics and probabilities.
The scipy.stats
module contains a number of objects which you can use to compute attributes of various probability distributions. The full documentation for this module is here: http://http😕/docs.scipy.org/doc/scipy/reference/stats.html. However, we will focus on the norm variable, which implements the normal distribution. Let's look at some code that uses scipy.stats.norm
to compute a Gaussian, and compare its value to the value returned by the gaussian()
function from FilterPy.
The call norm(2, 3)
creates what scipy calls a 'frozen' distribution - it creates and returns an object with a mean of 2 and a standard deviation of 3. You can then use this object multiple times to get the probability density of various values, like so:
If we look at the documentation for scipy.stats.norm
here[2] we see that there are many other functions that norm provides.
For example, we can generate samples from the distribution with the rvs()
function.
We can get the cumulative distribution function (CDF), which is the probability that a randomly drawn value from the distribution is less than or equal to .
We can get various properties of the distribution:
There are many other functions available, and if you are interested I urge you to peruse the documentation. Sometimes the documentation is terse, but with a bit of googling you can find out what a function does and some examples of how to use it. Most of this functionality is not of immediate interest to the book, so I will leave the topic in your hands to explore. The SciPy tutorial [3] is quite approachable, and I suggest starting there.
Fat Tails
Earlier I spoke very briefly about the central limit theorem, which states that under certain conditions the arithmetic sum of any independent random variables will be normally distributed, regardless of how the random variables are distributed. This is extremely important for (at least) two reasons. First, nature is full of distributions which are not normal, but when we apply the central limit theorem over large populations we end up with normal distributions. Second, Gaussians are mathematically tractable. We will see this more as we develop the Kalman filter theory, but there are very nice closed form solutions for operations on Gaussians that allow us to use them analytically.
However, a key part of the proof is "under certain conditions". These conditions often do not hold for the physical world. The resulting distributions are called fat tailed. Tails is a colloquial term for the far left and right side parts of the curve where the probability density is close to zero.
Let's consider a trivial example. We think of things like test scores as being normally distributed. If you have ever had a professor 'grade on a curve' you have been subject to this assumption. But of course test scores cannot follow a normal distribution. This is because the distribution assigns a nonzero probability distribution for any value, no matter how far from the mean. So, for example, say your mean is 90 and the standard deviation is 13. The normal distribution assumes that there is a large chance of somebody getting a 90, and a small chance of somebody getting a 40. However, it also implies that there is a tiny chance of somebody getting a grade of -10, or 150. It assumes that there is a infinitesimal chance of getting a score of -1e300, or 4e50. The tails of a Gaussian distribution are infinite because Gaussians are continuous functions.
But for a test we know this is not true. Ignoring extra credit, you cannot get less than 0, or more than 100. Let's plot this using a normal distribution.
The area under the curve cannot equal 1, so it is not a probability distribution. What actually happens is that a more students than predicted by a normal distribution get scores nearer the upper end of the range (for example), and that tail becomes 'fat'. Also, the test is probably not able to perfectly distinguish incredibly minute differences in skill in the students, so the distribution to the left of the mean is also probably a bit bunched up in places. The resulting distribution is called a fat tail distribution.
Kalman filters use sensors to measure the world. The errors in sensor's measurements are rarely truly Gaussian. It is far too early to be talking about the difficulties that this presents to the Kalman filter designer. It is worth keeping in the back of your mind the fact that the Kalman filter math is based on a somewhat idealized model of the world. For now I will present a bit of code that I will be using later in the book to form fat tail distributions to simulate various processes and sensors. This distribution is called the student's t distribution.
Let's say I want to model a sensor that has some noise in the output. For simplicity, let's say the signal is a constant 10, and the standard deviation of the noise is 2. We can use the function numpy.random.randn()
to get a random number with a mean of 0 and a standard deviation of 1. So I could simulate this sensor with
Let's plot that signal and see what it looks like.
That looks like I would expect. The signal is centered around 10. A standard deviation of 2 means that 68% of the measurements will be within 2 of 10, and 99% will be within 6 of 10, and that looks like what is happening.
Now let's look at a fat tailed distribution. There are many choices, I will use the Student's T distribution. I will not go into the math, but just give you the source code for it and then plot a distribution using it.
We can see from the plot that while the output is similar to the normal distribution there are outliers that go far more than 3 standard deviations from the mean. This is what causes the 'fat tail'.
It is unlikely that the Student's T distribution is an accurate model of how your sensor (say, a GPS or Doppler) performs, and this is not a book on how to model physical systems. However, it does produce reasonable data to test your filter's performance when presented with real world noise. We will be using distributions like these throughout the rest of the book in our simulations and tests.
This is not an idle concern. The Kalman filter equations assume the noise is normally distributed, and perform sub-optimally if this is not true. Designers for mission critical filters, such as the filters on spacecraft, need to master a lot of theory and emperical knowledge about the performance of the sensors on their spacecraft.
The code for rand_student_t is included in filterpy.stats
. You may use it with
Summary and Key Points
The following points must be understood by you before we continue:
Normal distributions occur throughout nature
They express a continuous probability distribution
They are completely described by two parameters: the mean () and variance ()
is the average of all possible values
The variance represents how much our measurements vary from the mean
The standard deviation () is the square root of the variance ()
References
[1] http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html#scipy.stats.normfor
[2] http://docs.scipy.org/doc/scipy/reference/tutorial/stats.html
[3] Huber, Peter J. Robust Statistical Procedures, Second Edition. Society for Industrial and Applied Mathematics, 1996.