Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place.
Multivariate Gaussians - Modeling Uncertainty in Multiple Dimensions
Introduction
The techniques in the last chapter are very powerful, but they only work with one variable or dimension. Gaussians represent a mean and variance that are scalars - real numbers. They provide no way to represent multidimensional data, such as the position of a dog in a field. You may retort that you could use two Kalman filters from the last chapter. One would track the x coordinate and the other the y coordinate. That does work, but suppose we want to track position, velocity, acceleration, and attitude. These values are related to each other, and as we learned in the g-h chapter we should never throw away information. Through one key insight we will achieve markedly better filter performance than was possible with the equations from the last chapter.
In this chapter I will introduce you to multivariate Gaussians - Gaussians for more than one variable, and the key insight I mention above. Then, in the next chapter we will use the math from this chapter to write a complete filter in just a few lines of code.
Multivariate Normal Distributions
In the last two chapters we used Gaussians for a scalar (one dimensional) variable, expressed as . A more formal term for this is univariate normal, where univariate means 'one variable'. The probability distribution of the Gaussian is known as the univariate normal distribution
What might a multivariate normal distribution be? Multivariate means multiple variables. Our goal is to be able to represent a normal distribution across multiple dimensions. I don't necessarily mean spatial dimensions - it could be position, velocity, and acceleration. Consider a two dimensional case. Let's say we believe that and . This might be the x and y coordinates for the position of our dog, it might be the position and velocity of our dog on the x-axis, or the temperature and wind speed at our weather station. It doesn't really matter. We can see that for dimensions, we need means, which we will arrange in a column matrix (vector) like so:
Therefore for this example we would have
The next step is representing our variances. At first blush we might think we would also need N variances for N dimensions. We might want to say the variance for x is 10 and the variance for y is 4, like so.
This is incorrect because it does not consider the more general case. For example, suppose we were tracking house prices vs total of the floor plan. These numbers are correlated. It is not an exact correlation, but in general houses in the same neighborhood are more expensive if they have a larger floor plan. We want a way to express not only what we think the variance is in the price and the , but also the degree to which they are correlated. The covariance describes how two variables are correlated. Covariance is short for correlated variances
We use a covariance matrix to denote covariances with multivariate normal distributions, and it looks like this:
If you haven't seen this before it is probably a bit confusing. Instead of starting with the mathematical definition I will build your intuition with thought experiments. At this point, note that the diagonal contains the variance for each state variable, and that all off-diagonal elements (covariances) are represent how much the th (horizontal row) and th (vertical column) state variable are linearly correlated to each other. In other words, covariance is a measure for how much they change together.
A couple of examples. Generally speaking as the square footage of a house increases the price increases. These variables are correlated. As the temperature of an engine increases its life expectancy lowers. These are inversely correlated. The price of tea and the number of tail wags my dog makes have no relation to each other, and we say they are not correlated - each can change independent of the other.
Correlation implies prediction. If our houses are in the same neighborhood, and you have twice the square footage I can predict that the price is likely to be higher. This is not guaranteed as there are other factors such as proximity to garbage dumps which also affect the price. If my car engine significantly overheats I start planning on replacing it soon. If my dog wags his tail grocery I don't conclude that tea prices will be increasing.
A covariance of 0 indicates no correlation. So, for example, if the variance for x is 10, the variance for y is 4, and there is no linear correlation between x and y, then we would say
If there was a small amount of correlation between x and y we might have
where 1.2 is the covariance between x and y. Note that this is always symmetric - the covariance between x and y is always equal to the covariance between y and x. That is, for any x and y.
Now, without explanation, here is the multivariate normal distribution in dimensions.
I urge you to not try to remember this function. We will program it in a Python function and then call it if we need to compute a specific value. Plus, the Kalman filter equations compute this for us automatically; we never have to explicitly compute it. However, note that it has the same form as the univariate normal distribution. It uses matrices instead of scalar values, and the root of is scaled by . If you set n=1 then it turns into the univarate equation. Here is the univariate equation for reference:
The multivariate version merely replaces the scalars of the univariate equations with matrices. If you are reasonably well-versed in linear algebra this equation should look quite manageable; if not, don't worry! Let's plot it and see what it looks like.
This is a plot of two dimensional multivariate Gaussian with a mean of and a covariance of . The three dimensional shape shows the probability density of for any value of (x,y) in the z-axis. I have projected the variance for x and y onto the walls of the chart - you can see that they take on the normal Gaussian bell curve shape. The curve for x is wider than the curve for y, which is explained by and . The highest point of the curve is centered over (2, 17), the means for x and y.
All multivariate Gaussians form this shape. If we think of this as a the Gaussian for the position of a dog, the z-value at each point of (x, y) is the probability density of it being at that position. So, he has the highest probability of being near (2, 17), a modest probability of being near (5, 14), and a very low probability of being near (10, 10).
More details are in the Kalman Filter Math chapter. Here we need to understand the following.
The diagonal of the matrix contains the variance for each variable. This is because the covariance between x and itself is the variance of x: .
Each off-diagonal element contains - the covariance between i and j. This tells us how much linear correlation there is between the two variables. 0 means no correlation, and as the number gets higher the correlation gets greater.
: if i gets larger when j gets larger, then it must be true that j gets larger when i gets larger.
This chart only shows a 2 dimensional Gaussian, but the equation works for any number of dimensions > 0.
FilterPy [2] implements the equation with the function filterpy.stats.multivariate_gaussian
. I am not showing the code here because I have taken advantage of the linear algebra solving apparatus of NumPy to efficiently compute a solution - the code does not correspond to the equation in a one to one manner.
In the last chapter we did not have to explicitly program the univariate equation into our filter. The filter equations were generated by substituting the univariate equation into Bayes' equation. The same is true for the multivariate case. You will not be using this function very often in this book, so I would not spend a lot of time mastering it unless it interests you.
SciPy's stats
module implements the multivariate normal equation with multivariate_normal()
. It implements a 'frozen' form where you set the mean and covariance once, and then calculate the probability for any number of values for x over any arbitrary number of calls. This is much more efficient then recomputing everything in each call. So, if you have version 0.14 or later you may want to substitute my function for the built in version. Use scipy.version.version
to get the version number. I named my function multivariate_gaussian()
to ensure it is never confused with the SciPy version. I will say that for a single call, where the frozen variables do not matter, mine consistently runs faster as measured by the timeit
function.
The tutorial[1] for the
scipy.stats
module explains 'freezing' distributions and other very useful features.
I'll demonstrate using it, and then move on to more interesting things.
First, let's find the probability density for our dog being at (2.5, 7.3) if we believe he is at (2, 7) with a variance of 8 for and a variance of 4 for .
Start by setting to (2.5, 7.3). You can use a tuple, list, or NumPy array.
Next, we set the mean of our belief:
Finally, we have to define our covariance matrix. In the problem statement we did not mention any correlation between and , and we will assume there is none. This makes sense; a dog can choose to independently wander in either the direction or direction without affecting the other. If there is no correlation between the values place the variances in the diagonal, and set off-diagonal elements to zero. I will use name P
. Kalman filters use the name for the covariance matrix, and we need to become familiar with the conventions.
Now call the function
These numbers are not easy to interpret. Let's view a plot of it.
The result is clearly a 3D bell shaped curve. We can see that the Gaussian is centered around (2,7), and that the probability density quickly drops away in all directions. On the sides of the plot I have drawn the Gaussians for in greens and for in orange.
Let's look at this in a slightly different way. Instead of plotting a surface showing the probability distribution I will generate 1,000 points with the distribution of .
We can think of the sampled points as being possible locations for our dog given those particular mean and covariances. The contours on the side show the variance in the points for and only. We can see that he is far more likely to be at (2, 7) where there are many points, than at (-5, 5) where there are few.
As beautiful as this is, it is hard to get useful information. For example, it is not easy to tell if and both have the same variance. In most of the book I'll display Gaussians using contour plots. Helper functions in FilterPy plot them for us. If you are interested in linear algebra look at the code used to produce these contours, otherwise feel free to ignore it.
For those of you viewing this online or in IPython Notebook on your computer, here is an animation.
(source: http://git.io/vqxLS)
From a mathematical perspective these display the values that the multivariate Gaussian takes for a specific standard deviation. This is like taking a horizontal slice out of the 3D plot. By default it displays one standard deviation, but you can use the variance
parameter to control what is displayed. For example, variance=3**2
would display the 3rd standard deviation, and variance=[1,4,9]
would display the 1st, 2nd, and 3rd standard deviations as in the chart below. This takes 3 different horizontal slices of the multivariate Gaussian chart and displays them in 2D.
However, the solid colors may suggest that the probability distribution is constant between the standard deviations. This is not true, as you can tell from the 3D plot of the Gaussian. Here is a 2D shaded representation of the probability distribution for the covariance (.
Thinking about the physical interpretation of these plots clarifies their meaning. The mean and covariance of the fist plot is
Let this be our current belief about the position of our dog in a field. In other words, we believe that he is positioned at (2,7) with a variance of for both x and y. The contour plot shows where we believe the dog is located with the '+' in the center of the ellipse. The ellipse shows the boundary for . As in the univariate case 68% of the data will fall within this ellipse. Recall from the Gaussians chapter the the 68-95-99.7 rule - 68% of all values will fall within 1 standard deviation (), 95% within , and 99.7% within . This rule applies for any dimensional size. The dog could be at (356443, 58483), but the chances for values that far away from the mean are infinitesimally small.
A Bayesian way of thinking about this is that the ellipse shows us the amount of error in our belief. A tiny circle would indicate that we have a very small error, and a very large circle indicates a lot of error in our belief. We will use this throughout the rest of the book to display and evaluate the accuracy of our filters at any point in time.
The second plot is for the mean and covariance
This time we use a different variance for () vs (). The result is a tall and narrow ellipse. We can see that a lot more uncertainty in value vs . Our belief that the value is (2, 7) is the same in both cases, but the uncertainties are different. In this case the standard deviation in is and the standard deviation for is . This sort of thing happens naturally as we track objects in the world - one sensor has a better view of the object or is closer than another sensor, resulting in different uncertainties in each axis.
The third plot shows the mean and covariance
This is the first contour that has values in the off-diagonal elements of the covariance, and this is the first contour plot with a slanted ellipse. This is not a coincidence. The two facts are telling us the same thing. A slanted ellipse tells us that the and values are somehow correlated. We denote that in the covariance matrix with values off the diagonal.
What does this mean in physical terms? Think of parallel parking a car. You can not pull up beside the spot and then move sideways into the space because cars cannot drive sideways. and are not independent. This is a consequence of the steering mechanism. When the steering wheel is turned the car rotates around its rear axle while moving forward. Or think of a horse attached to a pivoting exercise bar in a corral. The horse can only walk in circles, he cannot vary and independently, which means he cannot walk in a straight line or a zig zag. If changes, must also change in a defined way.
When we see this ellipse we know that and are correlated, and that the correlation is "strong". The size of the ellipse shows how much error we have in each axis, and the slant shows how the relative sizes of the variance in and . For example, a very long and narrow ellipse tilted almost to the horizontal has a strong correlation between and (because the ellipse is narrow), and the variance of is much larger than that of (because the ellipse is much longer in ).
Using Correlations to Improve Estimates
Suppose we believe our dog is at position (5, 10) with some given covariance. If the standard deviation in x and y is each 2 meters, but they are strongly correlated, the covariance contour would look something like this.
Now suppose I were to tell you that the actual position of the dog in the x-axis is 7.5, what can we infer about his position in the y-axis? The position is extremely likely to lie within the 3$\sigma$ covariance ellipse. We can infer the position in y based on the covariance matrix because there is a correlation between x and y. I've roughly illustrated the likely value for y as a blue filled circle.
A word about correlation and independence. If variables are independent they can vary separately. If you walk in an open field, you can move in the direction (east-west), the direction(north-south), or any combination thereof. Independent variables are always also uncorrelated. Except in special cases, the reverse does not hold true. Variables can be uncorrelated, but dependent. For example, consider the pair where . Correlation is a linear measurement, so and are uncorrelated. However, they are obviously dependent on each other.
Multiplying Multidimensional Gaussians
In the previous chapter we incorporated an uncertain measurement with an uncertain estimate by multiplying their Gaussians together. The result was another Gaussian with a smaller variance. If two pieces of uncertain information corroborate each other we should be more certain in our conclusion. The graphs look like this:
The combination of measurement 1 and 2 yields more certainty, so the new Gaussian is taller and narrower - the variance became smaller. The same happens in multiple dimensions with multivariate Gaussians.
Here are the equations for multiplying multivariate Gaussians. They are generated by plugging the Gaussians for the prior and the estimate into Bayes Theorem. I gave you the algebra for the univariate case in the last section of the last chapter. You will not need to remember these equations, as they are computed by Kalman filter equations that will be presented shortly. This computation is also available in FilterPy using the multivariate_multiply()
method, which you can import from filterpy.stats
.
To give you some intuition about this, recall the equations for multiplying univariate Gaussians:
This looks similar to the equations for the multivariate equations. This will be more obvious if you recognize that matrix inversion, denoted by the -1 power, is like division since . I will rewrite the inversions as divisions - this is not a mathematically correct thing to do but it does help us see what is going on.
In this form we can surmise that these equations are the linear algebra form of the univariate equations.
Now let's explore multivariate Gaussians in terms of a concrete example. Suppose that we are tracking an aircraft with two radar systems. I will ignore altitude so I can use two dimensional plots. Radars give us the range and bearing to a target. We start out being uncertain about the position of the aircraft, so the covariance, which is our uncertainty about the position, might look like this. In the language of Bayesian statistics this is our prior.
Now suppose that there is a radar to the lower left of the aircraft. Further suppose that the radar is very accurate in the bearing measurement, but not very accurate at the range. That covariance, which is the uncertainty in the reading might look like this (plotted in blue):
Recall that Bayesian statistics calls this the evidence. The ellipse points towards the radar. It is very long because the range measurement is inaccurate, and he aircraft could be within a considerable distance of the measured range. It is very narrow because the bearing estimate is very accurate and thus the aircraft must be very close to the bearing estimate.
We want to find the posterior - the mean and covariance of incorporating the evidence into the prior. As in every chapter so far we multiply them together. I have the equations for this and we could use those, but I will use FilterPy's multivariate_multiply
method.
Here I have plotted the original estimate (prior) it a very transparent yellow, the radar reading in blue (evidence), and the finale estimate (posterior) in yellow.
The Gaussian retained the same shape and position as the radar measurement, but is smaller. We've seen this with one dimensional Gaussians. Multiplying two Gaussians makes the variance smaller because we are incorporating more information, hence we are less uncertain. But the main point I want to make is that the covariance shape reflects the physical layout of the aircraft and the radar system.
Now lets say we get a measurement from a second radar, this one to the lower right, which I will plot in blue against the yellow covariance of our current belief.
Again, to incorporate this new information we will multiply the Gaussians together.
You can see how the multivariate Gaussian's shape reflects the geometry of the problem. The first radar system was at a 45 degree angle to the aircraft, and its error in the bearing measurement was much smaller than the error in the range. This resulted in a long and narrow covariance ellipse whose major axis was aligned with the angle to the radar system. The next radar system was also at a 45 degree angle, but to the right, so the two measurements were orthogonal to each other. This allowed us to triangulate on the aircraft, resulting in a very accurate estimate. We didn't explicitly write any code to perform triangulation; it was a natural outcome of multiplying the Gaussians of each measurement together.
To make sure you understand this, what would the Gaussian look like if we only had one radar station, and we received several measurements from it over a short period of time? Clearly the Gaussian would remain elongated in the axis of the bearing angle. Without a second radar station no information would be provided to reduce the error on that axis, so it would remain quite large. As the aircraft moves the bearing will typically change by a small amount, so over time some of the error will be reduced, but it will never be reduced as much as a second radar station would provide.
To round this out lets quickly redo this example but with the first radar system in a different position. I will position it directly to the left of the aircraft. The only change I need to make is to the Gaussian for the measurement from the radar. In the previsous example I used
Why did this result in a 45 degree ellipse? Think about that before reading on. It was 45 degrees because the values in the diagonal were identical. So if x=10 then y=10, and so on. We can alter the angle by making the variance for x or y different, like so:
The radar is to the left of the aircraft, so I can use a covariance of
to model the measurement. Incidentally, I invented those values. We haven't learned how to transform a matrix from one coordinate system to another.
In the next graph I plot the original estimate in a very light yellow, the radar measurement in blue, and the new estimate based on multiplying the two Gaussians together in yellow.
Now we can incorporate the measurement from the second radar system, which we will leave in the same position as before.
Our estimate is not as accurate as the previous example. The two radar stations are no longer orthogonal to each other relative to the aircraft's position so the triangulation is not optimal. Imagine standing on the ground and trying to triangulate on an aircraft in the sky with a transit. If you took a measurement, moved the transit 5 meters and took a second measurement the tiny change in angle between the two measurements would result in a very poor measurement because a very small error in either measurement would give a wildly different result. Think of the measurements as two nearly parallel lines. Changing the angle between them slightly will move the intersection between the two by a large amount. If you were to take the measurements from positions 100 km apart the lines might be nearly perpendicular to each other, in which case a small measurement error would result in a very small shift in the intersection point.
Hidden Variables
You can probably already see why a multivariate Kalman filter can perform better than a univariate one. The last section demonstrated how we can use correlations between variables to significantly improve our estimates. We can take this much further. This section contains the key insight to this chapter, so read carefully.
Let's say we are tracking an aircraft and we get the following data for the and coordinates at time =1,2, and 3 seconds. What does your intuition tell you the value of will be at time =4 seconds?
It appears that the aircraft is flying in a straight line and we know that aircraft cannot turn on a dime. The most reasonable guess is that at =4 the aircraft is at (4,4). I will depict that with a green arrow.
You made this inference because you inferred a constant velocity for the airplane. The reasonable assumption is that the aircraft is moving one unit each in x and y per time step.
Think back to the g-h filter chapter when we were trying to improve the weight predictions of a noisy scale. We incorporated weight gain into the equations because it allowed us to make a better prediction of the weight the next day. The g-h filter uses the g parameter to scale the amount of significance given to the current weight measurement, and the h parameter scaled the amount of significance given to the weight gain.
We are going to do the same thing with our Kalman filter. After all, the Kalman filter is a form of a g-h filter. In this case we are tracking an airplane, so instead of weight and weight gain we need to track position and velocity. Weight gain is the derivative of weight, and of course velocity is the derivative of position. It's impossible to plot and understand the 4D chart that would be needed to plot x and y and their respective velocities so let's do it for , knowing that the math generalizes to more dimensions.
At time 1 we might be fairly certain about the position (x=0) but have no idea about the velocity. We can plot that with a covariance matrix like this. The narrow width expresses our relative certainty about position, and the tall height expresses our lack of knowledge about velocity.
Now after one second we get a position update of x=5.
This implies that our velocity is roughly 5 m/s. But of course position and velocity are correlated. If the velocity is 5 m/s the position would be 5, but if the velocity was 10 m/s the position would be 10. So let's draw a velocity covariance matrix in red.
This superposition of the two covariances is where the magic happens. The only reasonable estimate at time t=1 (where position=5) is roughly the intersection between the two covariance matrices! More exactly, we can use the math from the last section and multiply the two covariances together. From a Bayesian point of view we multiply the prior with the evidence to get the posterior. If we multiply the position covariance with the velocity covariance using the Bayesian equations we get the result shown in the next chart.
We can see that the new covariance (the posterior) lies at the intersection of the position covariance and the velocity covariance. It is slightly tilted, showing that there is some correlation between the position and velocity. Far more importantly, it is much smaller than either the position or velocity covariances. In the previous chapter our variance would get smaller each time we performed an update()
because the previous estimate was multiplied by the new measurement. The same thing happens here. However, the amount by which the covariance got smaller by is much larger in this chapter. This is because we are using two different kinds of information which are nevertheless correlated. Knowing the velocity approximately and the position approximately allows us to very quickly hone in on the correct answer.
This is a key point in Kalman filters, so read carefully! Our sensor is only detecting the position of the aircraft (how doesn't matter). This is called an observed variable. It does not have a sensor that provides velocity. But based on the position estimates we can compute velocity. In Kalman filters we would call the velocity a hidden variable. Hidden means what it sounds like - there is no sensor that is measuring velocity, thus its value is hidden from us. We are able to use the correlation between position and velocity to infer its value very accurately.
To round out the terminology there are also unobserved variables. For example, the aircraft's state includes things such as as heading, engine RPM, weight, color, the first name of the pilot, and so on. We cannot sense these directly using the position sensor so they are not observed. There is no way to infer them from the sensor measurements and correlations (red planes don't go faster than white planes), so they are not hidden. Instead, they are unobservable. If you include an unobserved variable in your filter state the estimate for that variable will be nonsense.
What makes this possible? Imagine for a moment that we superimposed the velocity from a different airplane over the position graph. Clearly the two are not related, and there is no way that combining the two could possibly yield any additional information. In contrast, the velocity of this airplane tells us something very important - the direction and speed of travel. So long as the aircraft does not alter its velocity the velocity allows us to predict where the next position is. After a relatively small amount of error in velocity the probability that it is a good match with the position is very small. Think about it - if you suddenly change direction your position is also going to change a lot. If the measurement of the position is not in the direction of the velocity change it is very unlikely to be true. The two are correlated, so if the velocity changes so must the position, and in a predictable way.
It is important to understand that we are taking advantage of the fact that velocity and position are correlated. We get a rough estimate of velocity from the distance and time between two measurement, and use Bayes theorem to and produce very accurate estimates after only a few observations. Please reread this section if you have any doubts. If you grasp this point the rest is straightforward. If you do not you will quickly find it impossible to reason about what you will learn in the rest of this chapter.
In summary we have taken advantage of the geometry and correlations of the system to produce a very accurate estimate. The math does not care whether we are working with two positions, or a position and a correlated velocity, or if these are spatial dimensions. If floor space is correlated to house price you can write a Kalman filter to track house prices. If age is correlated to disease incidence you can write a Kalman filter to track diseases. If the zombie population is inversely correlated with the number of shotguns then you can write a Kalman filter to track zombies. I showed you this in terms of geometry and talked about triangulation. That was just to build your intuition. Get used to thinking of these as Gaussians with correlations. If we can express our uncertainties as a multidimensional Gaussian we can then multiply the prior with the evidence and get a much more accurate result.
References
[1] http://docs.scipy.org/doc/scipy/reference/tutorial/stats.html
[2]
FilterPy
library. Roger Labbe. https://github.com/rlabbe/filterpy