In Recursive Least Squares, new points are appearing all the time and we need to adjust our plane to fit that. Multiple Linear Regression // Mathworks

Recursive Least Squares

Exploring Recursive Least Squares (RLS) and using the Sherman-Morrison-Woodbury Formula and Python

--

The mathematics here should be tackled with individuals who have completed an introductory linear algebra course. For those just looking for the code implementation, visit the GitHub repository here.

The Normal Equations for Least Squares are ubiquitous and for good reason. Apart from very large datasets, the Normal Equations are easy to use, easily generalizable to datasets with multiple variables, and easy to remember.

You could be woken by sirens at four am in the morning and still be able to recite the Normal Equation flawlessly.

If you are familiar with the Normal Equations but want the full picture of what they mean and how they arise from the idea of vector projections, I wrote the perfect article for you a month ago.

The need for an alternate formula arises when dealing with a dataset that is continuously increasing in size.

The Need For Recursive Least Squares

When solving for x, finding the inverse of A transpose A is an expensive computation. With a large matrix A, this could become a large bottleneck and is one of the reasons why Normal Equations are generally reserved for smaller datasets (datasets on the order of 10³, or less than 10,000).

Say we’re in a situation where we calculate least squares using the Normal Equation, add one more datapoint to our dataset, and want to see what our new line* is. It would be inconvenient and seemingly redundant if we were to use the normal equation again and calculate that nightmarish A transpose A inverse term, just to include our one new single data point.

This situation comes up much more than you think.

The concept of online algorithms become relevant in dynamic systems, like tracking a satellite. Berkely Lab

If we’re tracking the movement of something ‘online’, and want to model its movement continuously, we would have to continually adjust our model to new data. A common example is that of a satellite. If we are tracking the position of it using linear regression, we want to continuously update our linear regression model as new data points (satellite’s x, y, z coordinates) flow in time interval t, and we want to do it quickly.

How can we build an algorithm that avoids the pesky A transpose A inverse term?

The New Normal Equation

This explanation is the same (albeit a bit more explained) as Georgia Tech’s Justin Romberg’s notes on Streaming Least Squares. You can see the original notes here. Also includes a proof for the Sherman-Morrison-Woodbury formula.

Let’s say we add a new row of data to our data matrix A, v. That means we need to add a new corresponding output to our answers vector b.

We can create a block matrix and add on our new datapoint, A1, to the bottom of all our old data, A0. I am not specifying if A1 is a vector or not since it is possible that we are updating our equation with more than just one new data point (maybe 2, or 10, or anything — it’s still cheaper than redoing everything!).

We then add the corresponding outputs to b, which we call b0. If we are only adding a single row, b0 is a single number, but if we are adding multiple, it will be a vector.

Figure 1

From now on, I’ll call the number of data points we add M. (along with the regular m, n for the initial dataset). If we continue with our block matrices example, we end up with the matrix equation as follows:

Figure 2

We can then divide this problem into a few chunks. More specifically, the left term. Inside this inverse, we see that the first term deals with the initial A and the second deals with the added part. We can split these up into P’s.

Figure 3

The reason for this funny organization is so that we can easily implement the Sherman-Morrison-Woodbury Formula, which can speed up our computation of the P1 inverse term, which we can then just multiply by our right-hand sides.

You can imagine that this pattern of P’s would continue when adding even more data — our entire ‘P1’ right now would be stuffed inside another inverse P2 which has an added A2 transpose A2 term. This repetition is the key to the algorithm. But first, the Sherman-Morrison-Woodbury Formula.

The Sherman-Morrison-Woodbury Formula

Sometimes called the Matrix Inversion Lemma, the Sherman-Morrison-Woodbury formula tells us how the inverse of a matrix changes when we change the matrix itself. Most graphics are from the notes I mentioned.

Figure 4 Georgia Tech / Justin Romberg

Note that we can make X, Y, or Z the identity if we’d like, and keep things a little simpler. The reason why the formula includes an X, Y, and Z is that we can use this to perturb W by any matrix.

I won’t prove it here, or really give much background, as we’re using it as a tool for our application. Here’s a good resource for the proof.

You might know where we’re going with this. Using this formula, we only need to calculate our inverse W, and base all other calculations by taking perturbations with this formula.

In the context of our problem, we can set W to our original A transpose A matrix (that we can’t avoid calculating) and then perturb it by our new entries and see how it changes the overall inverse. Then, we can multiply this by the right non-inverse term, and get our x for that new updated A.

By substituting the terms in the S-M-W equation as follows, we can perturb the matrix so we solve for the P1 term, by knowing the inverse of P0.

Figure 5 With all dimensions added for help. Georgia Tech / Justin Romberg

This might look a lot more convoluted than just re-calculating A transpose A inverse with the new matrix A, but it’s actually a lot cheaper to compute. The A1’s inside the inverse will usually be much smaller than the original A’s (since A1 is just the new batch, often only a single row).

If we are just adding a new row, the dimensions work out so that A1*P0*A1^T term in the brackets becomes a scalar (and the identity matrix is just the number 1). Taking the inverse of this is equivalent to dividing 1 by the terms inside the inverse, which is computationally cheap.

Even if A1 has a few rows in it, it will always be cheaper to do than to recalculate a Anew transpose Anew matrix. The other terms outside the inverse are also cheap since we already have calculated P0 (which is the original A transpose A inverse).

From this, we can derive an update equation to update x every time a new batch of (M, n) sized data An is supplied to the algorithm.

The Recursive Least Squares Algorithm

Recall our substituted Sherman-Morrison-Woodbury formula that we just defined, this time without all the specific dimension notation.

Figure 6 Georgia Tech / Justin Romberg

Remember that this is just half of the formula for x1. We still need to multiply by our term mentioned in the full equation mentioned in figure 3.

So, we can complete our formula by tacking on this final term at the end of this, to solve for the updated x vector.

That’s the complete formula for the new x without explicitly calculating that pesky Anew transpose Anew inverse operation. To generalize this to any update An, simply replace the 0’s with k-1 and the 1’s with k. The thing is that we only really need 0 and 1. When we want to add a new batch of data, we would calculate P1, and set it as “P0” and then solve for our “P1” (actually P2) using the same function. We never deal with adding two batches of new data at the same time (e.g adding A2 and A3 to A simultaneously) as we could just combine those into one matrix A2.

To speed this up further, is there any way we can express x1 as some sort of update of x0? Instead of fully recalculating x1, can we do something like x1 = x0 + _____ to help create a generalized algorithm for updating x?

To express our updated function x1 as some sort of function of x0, we need to factor our matrix further. Let’s do just that.

Figure 8.

We can actually decompose this series of matrix operations further. Using our original definition of P1 (which was just A0 transpose A0 + A1 transpose A1), we can rearrange this to substitute P0 inverse.

Figure 9.

Now, we can distribute the P1 to the three terms in the bracket.

Figure 10.

Now we’ve successfully created x1 as an update of x0. We can factor and clean up the two terms on the right, and create a new matrix K to keep things tidy.

Figure 11.

That’s all we need to do to calculate our new updated x1 as an update of x0. We first calculate our P for that layer, then calculate our K, then calculate our update for x. Summarized nicely is the algorithm for calculating an update of x when continuously adding data Ak to the A matrix.

Figure 12 Georgia Tech / Justin Romberg

Understanding the algorithm for recursive least squares, we can code it in Python by creating a class RecursiveLeastSquares() .

Coding Recursive Least Squares in Python

Coding RLS in Python is not too hard after understanding the formula for it. I link to my GitHub code at the bottom of the document, so here I’ll just use pseudocode.

Instead of creating a function RecursiveLeastSquares() , I decided to make a class since it would need to have to keep track of several local variables, like P, K, x, and others, which it can then update after getting more data.

Let’s define our init function, which initializes all variables that we will be when using S-M-W to update when we get new data in. All these initial values will be quickly replaced by newer P’s, K’s, X’s, A’s and B’s.

class RecursiveLeastSquares():
def __init__(self, initialA, initialb):
self.A = initialA
self.B = initialb

# do A0 transpose A0 inverse calculation for initial P
self.P = np.linalg.inv(np.dot(A0, A0.transpose()
self.K = None
# do least squares on initial A, b to get initial x
self.x = self.P * A0transposeAb

Now, we define the addData function, which allows us to add M more data points that get concatenated to the bottom of our A matrix and a corresponding output to concatenate to b.

    . 
.
def addData(self, newA, newb)
self.A = concatenate(A, newA)
self.b = concatenate(b, newb)

self.P = use S-M-W to calculate P matrix for that entry.
self.K = multiply the self.P by the newA transpose.
self.x = solve for x using K.

Note that in my GitHub code I segment the self.P calculation into five different lines as it would be quite unreadable and long without. I give a few details on what exactly each line does in the README.md document on the Github Repository, which is here.

Thanks for reading, see you soon.

Adam Dhalla is a high school student out of Vancouver, British Columbia. He is fascinated with the outdoor world, and is currently learning about emerging technologies for an environmental purpose. Visit his website at adamdhalla.com.

Follow his Instagram, and his LinkedIn. For more, similar content, subscribe to his newsletter here.

--

--

17 y old learning about machine learning, as well as a lifelong naturalist. Climate activist in Vancouver. Writer. Visit me @ adamdhalla.com