In the last chapter we discussed the difficulties that nonlinear systems pose. This nonlinearity can appear in two places. It can be in our measurements, such as a radar that is measuring the slant range to an object. Slant range requires you to take a square root to compute the x,y coordinates:
The nonlinearity can also occur in the process model - we may be tracking a ball traveling through the air, where the effects of gravity and air drag lead to highly nonlinear behavior. The standard Kalman filter performs poorly or not at all with these sorts of problems.
In the last chapter I showed you a plot like this. I have altered the equation somewhat to emphasize the effects of nonlinearity.
I generated this by taking 500,000 samples from the input, passing it through the nonlinear transform, and building a histogram of the result. We call these points sigma points. From the output histogram we can compute a mean and standard deviation which would give us an updated, albeit approximated Gaussian.
It has perhaps occurred to you that this sampling process constitutes a solution to our problem. Suppose for every update we generated 500,000 points, passed them through the function, and then computed the mean and variance of the result. This is called a 'Monte Carlo' approach, and it used by some Kalman filter designs, such as the Ensemble filter and particle filter. Sampling requires no specialized knowledge, and does not require a closed form solution. No matter how nonlinear or poorly behaved the function is, as long as we sample with enough sigma points we will build an accurate output distribution.
"Enough points" is the rub. The graph above was created with 500,000 sigma points, and the output is still not smooth. What's worse, this is only for 1 dimension. In general, the number of points required increases by the power of the number of dimensions. If you only needed 500 points for 1 dimension, you'd need 500 squared, or 250,000 points for two dimensions, 500 cubed, or 125,000,000 points for three dimensions, and so on. So while this approach does work, it is very computationally expensive. The Unscented Kalman filter uses a similar technique but reduces the amount of computation needed by a drastic amount by using a deterministic method of choosing the points.
Sigma Points - Sampling from the Distribution
Let's look at the problem in terms of a 2D covariance ellipse. I choose 2D merely because it is easy to plot; this will extend to any number of dimensions. Assuming some arbitrary nonlinear function, we will take random points from the first covariance ellipse, pass them through the nonlinear function, and plot their new position. Then we can compute the the mean and covariance of the transformed points, and use that as our estimate of the mean and probability distribution.
On the left we show an ellipse depicting the distribution of two state variables. The arrows show how several randomly sampled points might be transformed by some arbitrary nonlinear function to a new distribution. The ellipse on the right is drawn semi-transparently to indicate that it is an estimate of the mean and variance of this collection of points - if we were to sample, say, a million points the shape of the points might be very far different than an ellipse.
Let's look at that by running a bunch of points through a nonlinear function. We will write a function to pass randomly generated points with a Gaussian distribution through the system
for the mean and covariance
This plot shows the strong nonlinearity that occurs with this function, and the large error that would result if we linearized the function at (0,0), which is what filters like the Extended Kalman filters do (we will be learning this in the next chapter).
Choosing Sigma Points
I used 10,000 randomly selected sigma points to generate this solution. While the computed mean is quite accurate, computing 10,000 points for every update would cause our filter to be very slow. So, what would be fewest number of sampled points that we can use, and what kinds of constraints does this problem formulation put on the points? We will assume that we have no special knowledge about the nonlinear function as we want to find a generalized algorithm that works for any function. We can visualize this algorithm using the following diagram.
Let's consider the simplest possible case and see if it offers any insight. The simplest possible system is the identity function. In mathematical notation this is . It should be clear that if our algorithm does not work for the identity function then the filter will never converge. In other words, if the input is 1 (for a one dimensional system), the output must also be 1. If the output was different, such as 1.1, then when we fed 1.1 into the transform at the next time step, we'd get out yet another number, maybe 1.23. The filter would run away (diverge).
The fewest number of points that we can use is one per dimension. This is the number that the linear Kalman filter uses. The input to a Kalman filter for the distribution is itself. So while this works for the linear case, it is not a good answer for the nonlinear case.
Perhaps we can use one point per dimension, but altered somehow. However, if we were to pass some value into the identity function it would not converge, so this is not a possible algorithm. We must conclude that a one point sample will not work.
So, what is the next lowest number we can choose? Consider the fact that Gaussians are symmetric, and that we probably want to always have one of our sample points be the mean of the input for the identity function to work. Two points would require us to select the mean, and then one other point. That one other point would introduce an asymmetry in our input that we probably don't want. It would be very difficult to make this work for the identity function .
The next lowest number is 3 points. 3 points allows us to select the mean, and then one point on each side of the mean, as depicted on the chart below.
So we can take those three points, pass them through a nonlinear function f(x), and compute a new mean and variance. We can compute the mean as the average of the 3 points, but that is not very general. For example, for a very nonlinear problem we might want to weight the center point much higher than the outside points, or we might want to weight the outside points higher if the distribution is not Gaussian. A more general approach is to compute the mean as .
For this to work for identity we will want the sums of the weights for the mean to equal one. We can always come up with counterexamples, but in general if the sum is greater or less than one the sampling will not yield the correct output. Given that, we then have to select sigma points and their corresponding weights so that they compute to the mean and variance of the input Gaussian. It is possible to use different weights for the mean () and for the variance (). So we can write
If we look at this is should be clear that there is no one unique answer - the problem is unconstrained. For example, if you choose a smaller weight for the point at the mean for the input, you could compensate by choosing larger weights for the rest of the , and vice versa. We can use different weights for the mean and variances, or the same weights. Indeed, these equations do not require that any of the points be the mean of the input at all, though it seems 'nice' to do so, so to speak.
But before we go on I want to make sure the idea is clear. We are choosing 3 points for each dimension in our covariances. That choice is entirely deterministic. Below are three different examples for the same covariance ellipse.
The points do not lie along the major and minor axis of the ellipse; nothing in the constraints above require me to do that. Furthermore, in each case I show the points evenly spaced; again, the constraints above do not require that.
We can see that the arrangement and spacing of the sigma points will affect how we sample our distribution. Points that are close together will sample local effects, and thus probably work better for very nonlinear problems. Points that are far apart, or far off the axis of the ellipse will sample non-local effects and non Gaussian behavior. However, by varying the weights used for each point we can mitigate this. If the points are far from the mean but weighted very slightly we will incorporate some of the knowledge about the distribution without allowing the nonlinearity of the problem to create a bad estimate.
Van der Merwe's Scaled Sigma Point Algorithm
There are several published ways for selecting the sigma points for the Unscented Kalman filter, and you are free to invent your own. However, since 2005 or so research and industry have mostly settled on the version published by Rudolph Van der Merwe in his 2004 PhD dissertation [1] because it performs well with a variety of problems and it has a good tradeoff between performance and accuracy. It is a slight reformulation of the Scaled Unscented Transform published by Simon J. Julier [2].
Before we work through the derivation, let's look at an example. I will plot the sigma points on top of a covariance ellipse showing the first and second standard deviations, and size them based on the weights assigned to them.
We can see that the sigma points lie between the first and second deviation, and that the larger spreads the points out further. Furthermore, the larger weighs the mean (center point) higher than the smaller , and weighs the rest of the sigma points less. This should fit our intuition - the further a point is from the mean the less we should weight it. We don't know how these weights and sigma points are selected yet, but the choices look reasonable.
Van der Merwe's formulation uses 3 parameters to control how the sigma points are distributed and weighted: , , and . We will go into detail with these later. For now, is a good choice for Gaussian problems, is a good choice for , and is an appropriate choice for , where a larger value spreads the sigma points further from the mean.
Our first sigma point is always going to be the mean of our input. This is the sigma point displayed in the center of the ellipses in the diagram above. We will call this . So,
For notational convenience we define . Then the remaining sigma points are computed as
In other words, we scale the covariance matrix by a constant, take the square root of it, and then to ensure symmetry both add and subtract if from the mean. We will discuss how you take the square root of a matrix later.
Van der Merwe's forumation uses one set of weights for the means, and another set for the covariance. The weights for the mean of is computed as
The weight for the covariance of is
The weights for the rest of the sigma points are the same for the mean and covariance. They are
It may not be obvious why this is 'correct', and indeed, it cannot be proven that this is ideal for all nonlinear problems. But you can see that we are choosing the sigma points proportional to the square root of the covariance matrix, and the square root of variance is standard deviation. So, the sigma points are spread roughly according to 1 standard deviation. However, there is an term in there - the more dimensions there are the more the points will be spread out and weighed less.
The update step converts the sigmas into measurement space via the h(x)
function.
The mean and covariance of those points is computed with the unscented transform. The residual and Kalman gain is then computed. The cross variance is computed as:
Finally, we compute the new state estimate using the residual and Kalman gain:
and the new covariance is computed as:
This function can be implemented as follows, assuming it is a method of a class that stores the necessary matrices and data.
The computation of the new mean and covariance is called the unscented transform.
The Unscented Transform
The unscented transform is the core of the algorithm yet it is remarkably simple. For each dimension in the state space we deterministically choose 3 sigma points and corresponding weights using the algorithm above. We pass the sigma points through the nonlinear function yielding a transformed set of points
We then compute the new mean and covariance using these equations.
These equations should be familar - they are the constraint equations we developed above. They perhaps don't look like much, but let's see an example of their power.
Earlier we wrote a function that found the mean of a distribution by passing 50,000 points through a nonlinear function. Let's now use 5 sigma points - we will pass them through the nonlinear function, and compute their mean with the unscented transform. This code is below. Under the comment ### create sigma points
I use code from FilterPy to generate the sigma points. It uses functionality which you will learn later in this chapter; pass by it for now. The key points are we generate 5 points deterministically, pass them through the nonlinear function, and then run the unscented transform on them to generate the estimated mean.
I find this result remarkable. Using only 5 points we were able to compute the mean with amazing accuracy. The error in x is only 0.054, and the error in y is 0.408. In contrast, a linearized approach (used by the predominant EKF filter) gave an error of over 43 in y. I told you to ignore the code to generate the sigma points, but if you look at it you'll see that it has no knowledge of the nonlinear function, only of the mean and covariance of our initial distribution. The same 5 sigma points would be generated if we had a completely different nonlinear function.
I will admit to choosing a nonlinear function that makes the performance of the unscented tranform striking compared to the EKF. But the physical world is filled with very nonlinear behavior, and the UKF takes it in stride. You will see in the next chapter how more traditional techniques struggle with strong nonlinearities. This graph is the foundation of why I advise you to use the UKF or similar modern technique whenever possible.
The UKF belongs to a family of filters called sigma point filters. By the end of this chapter you will be prepared to do a literature search and learn about the various sigma point filters that have been invented.
The Unscented Filter
Now we can present the entire Unscented Kalman filter algorithm. As discussed earlier assume that there is a function that performs the state transition for our filter - it predicts the next state given the current state. Also assume there is a measurement function - it takes the current state and computes what measurement that state corresponds to. These are nonlinear forms of the and matrices used by the linear Kalman filter.
Predict Step
As with the linear Kalman filter, the UKF's predict step computes the mean and covariance of the system for the next epoch using the process model. However, the computation is quite different.
First, we generate the weights and sigma points as specified above.
We pass each sigma point through the function f. This projects the sigma points forward in time according to our process model.
Now we compute the predicted mean and covariance using the *unscented transform *on the transformed sigma points. I've dropped the subscript for readability.
This computes the mean and covariance represented by the green ellipse above, and corresponds with the linear Kalman filter equations of
Update Step
Now we can perform the update step of the filter. Recall that Kalman filters perform the update state in measurement space. So, the first thing we do is convert the sigma points from the predict step into measurements using the function that you define.
Now we can compute the mean and covariance of these points using the unscented transform.
The subscript denotes that these are the mean and covariance for the measurements.
All that is left is to compute the residual and Kalman gain. The residual is trivial to compute:
The Kalman gain is not much worse. We have to compute the cross variance of the state and the measurements, which we state without proof to be:
And then the Kalman gain is defined as
Finally, we compute the new state estimate using the residual and Kalman gain:
and the new covariance is computed as:
This step contains a few equations you have to take on faith, but you should be able to see how they relate to the linear Kalman filter equations. We convert the mean and covariance into measurement space, add the measurement error into the measurement covariance, compute the residual and kalman gain, compute the new state estimate as the old estimate plus the residual times the Kalman gain, and then convert both back into state space. The linear algebra is slightly different from the linear Kalman filter, but the algorithm is the same.
Using the UKF
We are now ready to consider implementing an unscented Kalman filter. All of the math is above is already implemented in FilterPy, and you are perhaps a bit lost at this point, so lets launch into solving some problems so you can gain confidence in how easy the UKF actually is. Later we will revisit how the UKF is implemented in Python.
Let's start by solving a problem you already know how to do. Although the UKF was designed for nonlinear problems, it works fine on linear problems. In fact, it is guaranteed to get the same result as the linear Kalman filter for linear problems. We will write a solver for the linear problem of tracking using a constant velocity model in 2D. This will allows us to focus on what is the same (and most is the same!) and what is different with the UKF.
To design a linear Kalman filter you need to design the , , , , and matrices. We have done this many times already so let me present a design to you without a lot of comment. We want a constant velocity model, so we can define to be
With this ordering of state variables we can define our state transition model to be
which implements the Newtonian equation
Our sensors provide position measurements but not velocity, so the measurement function is
Let's say our sensor gives positions in meters with an error of meters in both the x and y coordinates. This gives us a measurement noise matrix of
Finally, let's assume that the process noise can be represented by the discrete white noise model - that is, that over each time period the acceleration is constant. We can use FilterPy
's Q_discrete_white_noise()
method to create this matrix for us, but for review the matrix is
Our implementation might look like this:
This should hold no surprises for you. Now let's implement this filter as an Unscented Kalman filter. Again, this is purely for educational purposes; using a UKF for a linear filter confers no benefit.
The equations of the UKF are implemented for you with the FilterPy
class UnscentedKalmanFilter
; all you have to do is specify the matrices and the nonlinear functions f(x)
and h(x)
. f(x)
implements the state transition function that is implemented by the matrix in the linear filter, and h(x)
implements the measurement function implemented with the matrix . In nonlinear problems these functions are nonlinear, so we cannot use matrices to specify them.
For our linear problem we can define these functions to implement the linear equations. The function f(x)
takes the signature def f(x,dt)
and h(x)
takes the signature def h(x)
. Below is a reasonable implementation of these two functions. Each is expected to return a 1-D NumPy array with the result.
You need to tell the filter how to compute the sigma points and weights. We gave the Van der Merwe's scaled unscented transform version above, but there are many different choices. FilterPy uses a SigmaPoints
class which must implement two methods:
FilterPy provides the class MerweScaledSigmaPoints
, which implements the sigma point algorithm in this chapter.
Other than these two functions, everything else is nearly the same. When you create the UKF you will pass in the two functions and sigma point class object like so:
The rest of the code is the same as for the linear kalman filter. Let's just implement it! I'll use the same measurements as used by the linear Kalman filter, and compute the standard deviation of the difference between the two solution. I will use the batch_filter
method to keep the code succinct; for clarity I put the equivalent iterative loop in a comment. As with all the filters we have done to date you ju
This gave me a standard deviation 0f 0.013 which is quite small.
You can see that there is not much difference in the implementation of the UKF vs linear Kalman filter. We merely replace with the function f(x), and the matrix with the function h(x). The rest of the theory and implementation remains the same. Well, of course under the hood the FilterPy
implementation is quite different than the Kalman filter code, but from a designer's point of view the problem formulation and filter design is very similar.
Tracking a Flying Airplane
First Attempt
Let's tackle our first nonlinear problem. To minimize the number of things that change I will keep the problem formulation very similar to the linear tracking problem above. For this problem we will write a filter to track a flying airplane using a stationary radar as the sensor. To keep the problem as close to the previous one as possible we will track in two dimensions, not three. We will track one dimension on the ground and the altitude of the aircraft. The second dimension on the ground adds no difficulty or different information, so we can do this with no loss of generality.
Radars work by emitting a beam of radio waves and scanning for a return bounce. Anything in the beam's path will reflect some of the signal back to the radar. By timing how long it takes for the reflected signal to get back to the radar the system can compute the slant distance - the straight line distance from the radar installation to the object. We also get the bearing to the target. For this 2D problem that will be the angle above the ground plane. True integration of sensors for applications like military radar and aircraft navigation systems have to take many factors into account which I am not currently interested in trying to cover in this book. So if any practitioners in the the field are reading this they will be rightfully scoffing at my exposition. My only defense is to argue that I am not trying to teach you how to design a military grade radar tracking system, but instead trying to familiarize you with the implementation of the UKF.
For this example we want to take the slant range measurement from the radar and compute the horizontal position (distance of aircraft from the radar measured over the ground) and altitude of the aircraft, as in the diagram below.
We will assume that the aircraft is flying at a constant altitude, so a three variable state vector will work.
Our state transition function is linear
and we can implement that very much like we did for the previous problem.
The next step is to design the measurement function. As in the linear Kalman filter the measurement function converts the filter's state into a measurement. So for this problem we need the position and velocity of the aircraft and convert it to the bearing and range from the radar station.
If we represent the position of the radar station with an (x,y) coordinate computation of the range and bearing is straightforward. To compute the range we use the Pythagorean theorem:
To compute the bearing we need to use the arctangent function.
As with the state transition function we need to define a Python function to compute this for the filter. I'll take advantage of the fact that a function can own a variable to store the radar's position.
Important Note: There is a nonlinearity that we are not considering, the fact that angles are modular. Kalman filters operate by computing the differences between measurements. The difference between and is , but a subtraction of the two values, as implemented by the filter, yields a value of . This is exacerbated by the UKF which computes sums of weighted values in the unscented transform. For now we will place our sensors and targets in positions that avoid these nonlinear regions. Later in the chapter I will show you how to handle this problem.
We need to simulate the Radar station and the movement of the aircraft. By now this should be second nature for you, so I will present the code without much discussion.
Now let's put it all together. A military grade radar system can achieve 1 meter RMS range accuracy, and 1 mrad RMS for bearing [1]. We will assume a more modest 5 meter range accuracy, and 0.5$^\circ$ angular accuracy as this provides a more challenging data set for the filter.
I'll start with the aircraft positioned directly over the radar station, flying away from it at 100 m/s. A typical radar might update only once every 12 seconds and so we will use that for our update period.
This may or may not impress you, but it impresses me! In the Extended Kalman filter chapter we will solve the same problem, but it will take significant amounts of mathematics to handle the same problem.
Tracking Manuevering Aircraft
The previous example produced extremely good results, but it also relied on an assumption of an aircraft flying at a constant speed with no change in altitude. I will relax that assumption by allowing the aircraft to change altitude. First, let's see the performance of the previous code if the aircraft starts climbing after one minute.
the performance is terrible as the filter is completely unable to track the changing altitude. What do we have to change to allow the filter to track the aircraft?
I hope you answered add climb rate to the state, like so:
This requires the following change to the state transition function, which is still linear.
The measurement function stays the same, but we will have to alter Q to account for the state dimensionality change.
We can see that a significant amount of noise has been introduced into the altitude, but we are now accurately tracking the altitude change.
Sensor Fusion
Now let's consider a simple example of sensor fusion. Suppose we have some type of Doppler system that produces a velocity estimate with 2 m/s RMS accuracy. I say "some type" because as with the radar I am not trying to teach you how to create an accurate filter for a Doppler system, where you have to account for the signal to noise ratio, atmospheric effects, the geometry of the system, and so on.
The accuracy of the radar system in the last examples allowed us to estimate velocities to within a m/s or so, so we have to degrade that accuracy to be able to easily see the effect of the sensor fusion. Let's change the range error to 500 meters and then compute the standard deviation of the computed velocity. I'll skip the first several measurements because the filter is converging during that time, causing artificially large deviations.
By incorporating the velocity sensor we were able to reduce the standard deviation from 3.5 m/s to 1.3 m/s.
Sensor fusion is a large topic, and this is a rather simplistic implementation. In a typical navigation problem we have sensors that provide complementary information. For example, a GPS might provide somewhat accurate position updates once a second with poor velocity estimation while an inertial system might provide very accurate velocity updates at 50Hz but terrible position estimates. The strengths and weaknesses of each sensor are orthogonal to each other. This leads to something called the Complementary filter, which uses the high update rate of the inertial sensor with the position accurate but slow estimates of the GPS to produce very high rate yet very accurate position and velocity estimates. This will be the topic of a future chapter.
Multiple Position Sensors
The last sensor fusion problem was somewhat a toy example due to the existence of techniques like the Complementary filter. Let's tackle a problem that is not so toy-like. Before the advent of GPS ships and aircraft navigated via various range and bearing systems such as VOR, LORAN, TACAN, DME, and so on. I do not intend to cover the intricacies of these systems - Wikipedia will fill the basics if you are interested. In general these systems are beacons that allow you to extract either the range, bearing, or range and bearing to the beacon. For example, an aircraft might have two VOR receivers. The pilot tunes each receiver to a different VOR station. Each VOR receiver displays what is called the "radial" - the direction from the VOR station on the ground to the aircraft. Using a chart they can extend these two radials - the intersection point is the position of the aircraft.
That is a very manual and low accuracy approach. We can use a Kalman filter to filter the data and produce far more accurate position estimates. Let's work through that.
The problem is as follows. Assume we have two sensors, each which provides a bearing only measurement to the target, as in the chart below. In the chart the width of the circle is intended to denote a different amount of sensor noise.
We can compute the bearing between a sensor and the target as follows:
So our filter will need to receive a vector of 2 measurements during each update, one for each sensor. We can implement that as:
The design of the measurement function and state transition function can remain the same as nothing has changed that would affect them.
This looks quite good to me. There is a very large error at the beginning of the track, but the filter is able to settle down and start producing good data.
Let's revisit the nonlinearity of the angles. I will position the target between the two sensors with the same y-coordinate. This will cause a nonlinearity in the computation of the sigma means and the residuals because the mean angle will be near zero. As the angle goes below 0 the measurement function will compute a large positive angle of around . The residual between the prediction and measurement will thus be very large, nearly instead of nearly 0. This makes it impossible for the filter to perform accurately, as seen in the example below.
This is untenable behavior for a real world filter. FilterPy
's UKF code allows you to provide it a function to compute the residuals in cases of nonlinear behavior like this, but this requires more knowledge about FilterPy
's implementation than we yet process.
Finally, let's discuss the sensor fusion method that we used. We used the measurement from each bearing and had the Kalman filter's equations compute the world position from the measurements. This is equivalent to doing a weighted least squares solution. In the Kalman Filter Math chapter we discussed the limited accuracy of such a scheme and introduced an Iterative Least Squares (ILS) algorithm to produce greater accuracy. If you wanted to use that scheme you would write an ILS algorithm to solve the geometry of your problem, and pass the result of that calculation into the update()
method as the measurement to be filtered. This imposes on you the need to compute the correct noise matrix for this computed positions, which may not be trivial. Perhaps in a later release of this book I will develop an example, but regardless it will take you a significant amount of research and experiment to design the best method for your application. For example, the ILS is probably the most common algorithm used to turn GPS pseudoranges into a position, but the literature discusses a number of alternatives. Sometimes there are faster methods, sometimes the iterative method does not converge, or does not converge fast enough, or requires too much computation for an embedded system.
Exercise: Track a target moving in a circle
Change the simulated target movement to move in a circle. Avoid angular nonlinearities by putting the sensors well outside the movement range of the target, and avoid the angles and .
Solution
We have a few choices here. First, if we know the movement of the target we can design our filter's state transition function so that it correctly predicts the circular movement. For example, suppose we were tracking a boat race optically - we would want to take track shape into account with our filter. However, in this chapter we have not been talking about such constrained behavior, so I will not build knowledge of the movement into the filter. So my implementation looks like this.
Discussion
The filter tracks the movement of the target, but never really converges on the track. This is because the filter is modeling a constant velocity target, but the target is anything but constant velocity. As mentioned above we could model the circular behavior by defining the fx()
function, but then we would have problems when the target is not moving in a circle. Instead, lets tell the filter we are are less sure about our process model by making larger. Here I have increased the variance from 0.1 to 1.0
The convergence is not perfect, but it is far better.
Exercise: Sensor Position Effects
Is the behavior of the filter invariant for any sensor position? Find a sensor position that produces bad filter behavior.
Solution
We have already discussed the problem of angles being modal, so causing that problem by putting the sensors at y=0
would be a trivial solution. However, let's be more subtle than that. We want to create a situation where there are an infinite number of solutions for the sensor readings. We can achieve that whenever the target lies on the straight line between the two sensors. In that case there is no triangulation possible and there is no unique solution. My solution is as follows.
I put the sensors at the upper left hand side and lower right hand side of the target's movement. We can see how the filter diverges when the target enters the region directly between the two sensors. The state transition always predicts that the target is moving in a straight line. When the target is between the two sensors this straight line movement is well described the bearing measurements from the two sensors so the filter estimate starts to approximate a straight line.
Exercise: Compute Position Errors
The position errors of the filter vary depending on how far away the target is from a sensor. Write a function that computes the distance error due to a bearing error.
Solution
Basic trigonometry gives us this answer.
Discussion
This is an inherent physical limitation that is extremely difficult to deal with when designing filters. We can see that a very small angular error translates into a very large distance error. What is worse, this behavior is nonlinear - the error in the x-axis vs the y-axis will vary depending on the actual bearing. For example, here is a scatter plot that shows the error distribution for a standard deviation in bearing for a bearing.
Exercise: Explain Filter Performance
We can see that for even very small angular errors the (x, y) positional errors are very large. Explain how we got such relatively good performance out of the UKF in the target tracking problems above. Answer this for the one sensor problem as well as the multiple sensor problem.
Solution
This is very important to understand. Try very hard to answer this before reading the answer below. If you cannot answer this you really need to revisit some of the earlier Kalman filter material in the Multidimensional Kalman filter chapter.
There are several factors contributing to our success. First, let's consider the case of having only one sensor. Any single measurement has an extreme range of possible positions. But, our target is moving, and the UKF is taking that into account. Let's plot the results of several measurements taken in a row for a moving target to form an intuition.
We can see that each individual measurement has a very large position error. However, when we plot successive measurements we suddenly have a clear trend - the target is obviously moving towards the upper right. When the Kalman filter (whether a linear KF, an EKF, or UKF) computes the Kalman gain it takes the distribution of errors into account by using the measurement function. In this example the error lies on an approximately degree line, so the filter will discount errors in that direction. On the other hand, there is almost no error in measurement orthogonal to that, and again the Kalman gain will be taking that into account. The end result is the track can be computed even with this sort of noise distribution. Of course, this graph makes it look easy because we have computed 100 possible measurements for each position update and plotted them all. This makes it easy for us to see where the target is moving. In contrast, the Kalman filter only gets 1 measurement per update. Therefore the filter will not be able to generate as good a fit as the dotted green line implies.
The next interaction we must think about is the fact that the bearing gives us no distance information. Suppose in the example above we told the filter that the initial position of the target was 1000km away from the sensor (vs the actual distance of 7.07 km), and that we were highly confident in that guess by setting very small. At that distance a error translates into a positional error of 17.5 km. The KF would never be able to converge onto the actual target position because the filter is incorrectly very certain about its position estimates and because there is no distance information provided in the measurements.
Now let's consider the effect of adding a second sensor. Again, I will start by plotting the measurements so we can form an intuition about the problem.
I placed the sensors nearly orthogonal to the target's initial position so we get these lovely 'x' shape intersections. I then computed the (x, y) coordinate corresponding to the two noisy bearing measurements and plotted them with red dots to show the distribution of the noisy measurements in x and y. We can see how the errors in x and y change as the target moves by the shape the scattered red dots make - as the target gets further away from the sensors, but nearer the y coordinate of sensor B the shape becomes strongly elliptical.
Next I will alter the starting positions and rerun the simulation. Here the shape of the errors in x and y changes radically as the target position changes relative to the two sensors.
Implementation of the UKF
Now let's implement these equations with Python. As you have seen FilterPy already implements this code for you, but it is instructive to learn how to go from equations to code. Plus, if you encounter a problem and need to debug your code you will likely need to step through the code with the debugger.
Implementing the UKF is quite straightforward. First, let's write the code to compute the mean and covariance given the sigma points.
We will store the sigma points and weights in matrices, like so:
In other words, each column contains the sigma points for one dimension in our problem. The sigma point is always the mean, so first row of sigma's contains the mean of each of our dimensions. The second through nth row contains the terms, and the to rows contains the terms. the choice to store the sigmas in row-column vs column row format is somewhat arbitrary; my choice makes the rest of the code a bit easier to code as I can refer to the ith sigma point for all dimensions as sigmas[i]
.
Weights
Computing the weights in numpy is extremely simple. Recall that for the Van der Merwe scaled sigma point implementation
with .
Our code will look something like this.
Sigma Points
The equations for the sigma points are:
The Python for this is not much more difficult once we wrap our heads around the term.
The term has to be a matrix because is a matrix. The subscript is choosing the column vector of the matrix. What is the square root of a matrix? There is no unique definition. A typical definition is that the square root of a matrix is the matrix that, when multiplied by itself, yields .
However there is an alternative definition, and we will chose it because it has numerical properties that makes it much easier for us to compute its value. We can define the square root as the matrix S, which when multiplied by its transpose, returns :
$$\Sigma = SS^\mathsf{T} \\$$This method is frequently chosen in computational linear algebra because this expression is easy to compute using something called the Cholesky decomposition [3]. SciPy provides this with the scipy.linalg.cholesky()
method. If your language of choice is Fortran, C, or C++, libraries such as LAPACK provide this routine. Matlab provides chol()
.
Now let's implement the unscented transform. Recall the equations
We implement the sum of the means with
If you are not a heavy user of NumPy this may look foreign to you. NumPy is not just a library that make linear algebra possible; under the hood it is written in C to achieve much faster speeds than Python can reach. A typical speedup is 100x. To get that speedup we must avoid using for loops, and instead use NumPy's built in functions to perform calculations. So, instead of writing a for loop to compute the sum, we call the built in numpy.dot(x,y)
method. If passed a 1D array and a 2D array it will compute the sum of inner products:
All that is left is to compute
This introduces another feature of NumPy. The state variable is one dimensional, as is , so the difference is also one dimensional. NumPy will not compute the transpose of a 1-D array; it considers the transpose of [1,2,3]
to be [1,2,3]
. So we call the function np.outer(y,y)
which computes the value of for a 1D array .
Predict Step
For the predict step, we will generate the weights and sigma points as specified above. We pass each sigma point through the function f.
Then we compute the predicted mean and covariance using the unscented transform. In the code below you can see that I am assuming that this is a method in a class that stores the various matrices and vectors needed by the filter.
Update Step
The update step converts the sigmas into measurement space via the h(x)
function.
The mean and covariance of those points is computed with the unscented transform. The residual and Kalman gain is then computed. The cross variance is computed as:
Finally, we compute the new state estimate using the residual and Kalman gain:
and the new covariance is computed as:
This function can be implemented as follows, assuming it is a method of a class that stores the necessary matrices and data.
Batch Processing
The Kalman filter is designed as a recursive algorithm - as new measurements come in we immediately create a new estimate. But it is very common to have a set of data that have been already collected which we want to filter. Kalman filters can always be run in a batch mode, where all of the measurements are filtered at once. We have implemented this in UnscentedKalmanFilter.batch_filter()
. Internally, all the function does is loop over the measurements and collect the resulting state and covariance estimates in arrays. It may seem a bit trivial, but you will almost always want to do this if you can for several reasons. First, it will execute a bit quicker than if you implement the loop yourself. Second, the logic in your code will be a bit cleaner, and you will have a reduced chance of bugs. Third, and most importantly, it you batch process your data you can then use an extremely powerful technique to generate far smoother results than you have seen so far. We will present that technique in the next section; here I will show you how to use UnscentedKalmanFilter.batch_filter()
.
All you have to do is collect your measurements into an array or list. Maybe it is in a CSV file, for example.
Or maybe you will generate it using a generator:
Now we call the batch_filter()
method.
The function takes the list/array of measurements, filters it, and returns an array of state estimates (Xs) and covariance matrices (Ps) for the entire data set.
Here is a complete example drawing from the radar tracking problem above.
Smoothing the Results
Briefly, the recursive form of Kalman filtering that we have been using up to now use information from the past to generate the current estimate. Recall the last section, where we used batch processing to filter a collection of measurements. This implies that we know now only the past, but the future! This is a key insight.
Let's assume that we are tracking a car. Suppose we get a noisy measurement that implies that the car is starting to turn to the left, but the state function has predicted that the car is moving straight. The Kalman filter has no choice but to move the state estimate somewhat towards the noisy measurement, as it cannot judge whether this is just a particularly noisy measurement or the true start of a turn.
However, if we have measurements from the future we can often figure out if a turn was made or not. Suppose the subsequent measurements all continue turning left. We can then be sure that the measurement was not very noisy, but instead a true indication that a turn was initiated. On the other hand, if the subsequent measurements continued on in a straight line we would know that the measurement was noisy and should be mostly ignored.
We will not develop the math or algorithm here, I will just show you how to call the algorithm in FilterPy
. The algorithm that we have implemented is called an RTS smoother, after the three inventors of the algorithm: Rauch, Tung, and Striebel.
The routine is UnscentedKalmanFilter.rts_smoother()
. Using it is trivial; we pass in the means and covariances computed from the batch_filter
step, and receive back the smoothed means, covariances, and Kalman gain.
From these charts we can see that the improvement in the position is small, but the improvement in the velocity is good, and spectacular for the altitude. The difference in the position are very small, so I printed the difference between the UKF and the smoothed results for the last 5 points.
Choosing the Sigma Parameters
** author's note: this entire section needs a lot of work. Ignore for now.**
I have found the literature on choosing values for , , and to be rather lacking. Van der Merwe's dissertation contains the most information, but it is not exhaustive. So let's explore what they do.
Van der Merwe suggests using for Gaussian problems, and . So let's start there and vary . I will let minimize the size of the arrays we need to look at and to avoid having to compute the square root of matrices
So what is going on here? We can see that for a mean of 0 the algorithm choose sigma points of 0, 3, and -3, but why? Recall the equation for computing the sigma points:
Here you will appreciate my choice of as it reduces everything to scalars, allowing us to avoid computing the square root of matrices. So, for our values the equation is
So as gets larger the sigma points get more spread out. Let's set it to an absurd value.
We can see that the sigma point spread over 100 standard deviations.If our data was Gaussian we'd be incorporating data many standard deviations away from the mean; for nonlinear problems this is unlikely to produce good results. But suppose our distribution was not Gaussian, but instead had very fat tails? We might need to sample from those tails to get a good estimate, and hence it would make sense to make larger (not 200, which was absurdly large to make the change in the sigma points stark).
With a similar line of reasoning, suppose that our distribution has nearly no tails - the probability distribution looks more like an inverted parabola. In such a case we'd probably want to pull the sigma points in closer to the mean to avoid sampling in regions where there will never be real data.
Now let's look at the change in the weights. When we have the weights were 0.6667 for the mean, and 0.1667 for the two outlying sigma points. On the other hand, when the mean weight shot up to 0.99999 and the outlier weights were set to 0.000004. Recall the equations for the weights:
We can see that as gets larger the fraction for the weight of the mean () approaches 1, and the fraction for the weights of the rest of the sigma points approaches 0. This is invariant on the size of your covariance. So as we sample further and further away from the mean we end up giving less weight to those samples, and if we sampled very close to the mean we'd give very similar weights to all.
However, the advice that Van der Merwe gives is to constrain in the range . He suggests as a good value. Lets try that.
I must admit to not fully understanding this advice.
Robot Localization - A Fully Worked Example
It is time to try a real problem. I warn you that this is far from a simple problem. However, most books choose simple, textbook problems with simple answers, and you are left wondering how to implement a real world solution.
We will consider the problem of robot localization. In this scenario we have a robot that is moving through a landscape with sensors that give range and bearings to various landmarks. This could be a self driving car using computer vision to identify trees, buildings, and other landmarks. Or, it might be one of those small robots that vacuum your house. It could be a search and rescue device meant to go into dangerous areas to search for survivors. It doesn't matter too much.
Our robot is wheeled, which means that it manuevers by turning it's wheels. When it does so, the robot pivots around the rear axle while moving forward. This is nonlinear behavior which we will have to account for. The robot has a sensor that gives it approximate range and bearing to known targets in the landscape. This is nonlinear because computing a position from a range and bearing requires square roots and trigonometry.
Robot Motion Model
At a first approximation n automobile steers by turning the front tires while moving forward. The front of the car moves in the direction that the wheels are pointing while pivoting around the rear tires. This simple description is complicated by issues such as slippage due to friction, the differing behavior of the rubber tires at different speeds, and the need for the outside tire to travel a different radius than the inner tire. Accurately modelling steering requires an ugly set of differential equations. For Kalman filtering, especially for lower speed robotic applications a simpler bicycle model has been found to perform well.
I have depicted this model above. Here we see the front tire is pointing in direction . Over a short time period the car moves forward and the rear wheel ends up further ahead and slightly turned inward, as depicted with the blue shaded tire. Over such a short time frame we can approximate this as a turn around a radius . If you google bicycle model you will find that we can compute the turn angle with
and the turning radius R is given by
where the distance the rear wheel travels given a forward velocity is .
If we let be our current orientation then we can compute the position before the turn starts as
After the move forward for time the new position and orientation of the robot is
Once we substitute in for we get
You don't really need to understand this math in detail, as it is already a simplification of the real motion. The important thing to recognize is that our motion model is nonlinear, and we will need to deal with that with our Kalman filter.
Design the State Variables
For our robot we will maintain the position and orientation of the robot:
I could include velocities into this model, but as you will see the math will already be quite challenging.
Our control input is the velocity and steering angle
Design the System Model
In general we model our system as a nonlinear motion model plus noise.
Using the motion model for a robot that we created above, we can write:
Now we can use this function to implement the function f(x)
which implements the system model.
Design the Measurement Model
Now we need to design our measurement model. For this problem we are assuming that we have a sensor that receives a noisy bearing and range to multiple known locations in the landscape. The measurement model must convert the state into a range and bearing to the landmark. Using be the position of a landmark, the range is
We assume that the sensor provides bearing relative to the orientation of the robot, so we must subtract the robot's orientation from the bearing to get the sensor reading, like so:
Thus our function is
I will not implement this in Python yet as there is a difficulty that will be discussed in the Implementation section below.
Design Measurement Noise
This is quite straightforward as we need to specify measurement noise in measurement space, hence it is linear. It is reasonable to assume that the range and bearing measurement noise is independent, hence
Implementation
Before we begin coding the main loop we have another issue to handle. The residual is notionally computed as but this will not work because our measurement contains an angle in it. Suppose z has a bearing of and has a bearing of . Naively subtracting them would yield a bearing difference of . this will throw off the computation of the Kalman gain because the correct angle difference in this case is . So we will have to write code to correctly compute the bearing residual.
The state vector has the bearing at position 2, but the measurement vector has it at position 1, so we need to write functions to handle each. Furthermore, the function for the residual in the measurement will be passed an array of several measurements, one per landmark. These will be passed into the UnscentedKalmanFilter.__init__()
function.
The modularity of angles kept me from implementing the measurement model. The equation is
The expression can produce a result outside the range , so we should normalize the angle to that range.
The function will be passed an array of landmarks and needs to produce an array of measurements in the form [dist_to_1, bearing_to_1, dist_to_2, bearing_to_2, ...]
.
Our difficulties are not over. The unscented transform computes the average of the state and measurement vectors, but each contains a bearing. There is no unique way to compute the average of a set of angles. For example, what is the average of 359$^\circ^\circ^\circ^\circ$.
One common approach is to take the arctan of the sum of the sins and cosines.
We have not used this feature yet, but the UnscentedKalmanFilter.__init__()
method has an argument x_mean_fn
that allows you to provide a function which compute the mean for the state, and z_mean_fn
for a function which computes the mean of the measurement. We will code these function as:
These functions take advantage of the fact that NumPy's trigometric functions operate on arrays, and dot
performs element-wise multiplication. NumPy is implemented in C and Fortran, so sum(dot(sin(x), w))
is much faster than writing the equivalent loop in Python. Once you are used to seeing it, I think it is as readable as the loop form.
With that done we are now ready to implement the UKF. You've seen this sort of code many times, so I will not describe it in detail. There is one new thing here that is important to discuss. When we construct the sigma points and filter we have to provide it the functions that we have written to compute the residuals and means.
The rest of the code runs the simulation and plots the results, and shouldn't need too much comment by now. I create a variable landmarks
that contains the coordinates of the landmarks. I update the simulated robot position 10 times a second, but run the EKF only once. This is for two reasons. First, we are not using Runge Kutta to integrate the differental equations of motion, so a narrow time step allows our simulation to be more accurate. Second, it is fairly normal in embedded systems to have limited processing speed. This forces you to run your Kalman filter only as frequently as absolutely needed.
The rest of the code runs the simulation and plots the results, and shouldn't need too much comment by now. I create a variable landmarks
that contains the coordinates of the landmarks. I update the simulated robot position 10 times a second, but run the EKF only once. This is for two reasons. First, we are not using Runge Kutta to integrate the differental equations of motion, so a narrow time step allows our simulation to be more accurate. Second, it is fairly normal in embedded systems to have limited processing speed. This forces you to run your Kalman filter only as frequently as absolutely needed.
Steering the Robot
While the robot was turning in the example above, we really can't say it was being driven realistically. The velocity and steering angles never changed, which doesn't pose much of a problem for the Kalman filter. We could implement a complicated PID controlled robot simulation, but I will just generate varying steering commands using NumPy's linspace
method. I'll also add more landmarks as the robot will be travelling much further than in the first example.
You can see that the uncertainty becomes very small very quickly. The covariance ellipses are displaying the covariance, yet the ellipses are so small they are hard to see. We can force some error into the answer by only supplying two landmarks near the start point. When we run this filter the errors increase as the robot gets further from the landmarks.
Discussion
Your impression of this chapter probably depends on how many nonlinear Kalman filters you have implemented in the past. If this is your first exposure perhaps the computation of sigma points and the subsequent writing of the and function struck you as a bit finicky. Indeed, I spent more time than I'd care to admit getting everything working. On the other hand, if you have implemented an EKF or linearized Kalman filter you are perhaps bouncing gleefully in your seat. There is a small amount of tedium in writing the functions for the UKF, but the concepts are very basic. As you will see in the next chapter the EKF for the same problems requires some fairly difficult mathematics. In fact, for many problems we cannot find a closed form solution for the equations of the EKF, and we must retreat to some sort of iterated solution.
However, the advantage of the UKF over the EKF is not only the relative ease of implementation. It is somewhat premature to discuss this because you haven't learned the EKF yet, but the EKF linearizes the problem at one point and passes that point through a linear Kalman filter. In contrast, the UKF takes samples. Therefore the UKF is almost always more accurate than the EKF, especially when the problem is highly nonlinear. While it is not true that the UKF is guaranteed to always outperform the EKF, in practice it has been shown to perform at least as well, and usually much better than the EKF.
Hence my recommendation is to always start by implementing the UKF. If your filter has real world consequences if it diverges (people die, lots of money lost, power plant blows up) of course you will have to engage in a lot of sophisticated analysis and experimentation to chose the best filter. That is beyond the scope of this book, and you should be going to graduate school to learn this theory in much greater detail than this book provides.
I have spoken of the UKF I presented in this chapter as the way to perform sigma point filters. This is not true. The specific version I chose is Julier's scaled unscented filter as parameterized by Van der Merwe in his 2004 dissertation. If you search for Julier, Van der Merwe, Uhlmann, and Wan you will find a family of similar sigma point filters that they developed. Each technique uses a different way of choosing and weighting the sigma points. But the choices don't stop there. For example, there is the SVD Kalman filter that uses singular value decomposition (SVD) to find the approximate mean and covariance of the probability distribution. There are many more. Think of this chapter as an introduction to the sigma point filters, rather than a definitive treatment of how they work. If you have been reading carefully and writing your own code you should be able to read the literature and implement your own filters without needing FilterPy, which does not implement every possible sigma point filter.
References
[1] Rudolph Van der Merwe. "Sigma-Point Kalman Filters for Probabilistic Inference in Dynamic State-Space Models" dissertation (2004).
[2] Simon J. Julier. "The Scaled Unscented Transformation". Proceedings of the American Control Conference 6. IEEE. (2002)
[1] http://www.esdradar.com/brochures/Compact Tracking 37250X.pdf
[2] Julier, Simon J.; Uhlmann, Jeffrey "A New Extension of the Kalman Filter to Nonlinear Systems". Proc. SPIE 3068, Signal Processing, Sensor Fusion, and Target Recognition VI, 182 (July 28, 1997)
[3] Cholesky decomposition. Wikipedia. http://en.wikipedia.org/wiki/Cholesky_decomposition