Least squares is a very old and classic problem. But it is very important for us to understand Kalman Filter. In this post, we revisit:

1. deterministic least-squares
2. stochastic least-mean-squares

## 1. Deterministic least-squares

Problem statement: Let $x\in \mathbb{R}^{n}, y\in \mathbb{R}^m, A\in \mathbb{R}^{m\times n}$. Find $\hat{x}$ to minimize

$J=\|y-A\hat{x}\|^2$

Solution: I don’t want to show the derivation in detail. See ‘Linear estimation’ for a good discussion on it. Here I want to merely highlight some points.

1. The optimal estimation on $x$ is the solution to the normal equation $A^TAx=A^Ty$ no matter $A$ is full rank or not.
2. Usually least-squares is used for inconsistent overdetermined system: $Ax\cong y$. In this case, $A$ is full rank and $A^TA$ is invertible. At this time the solution is unique.
3. More important: we should keep in mind that this problem is not only for inconsistent overdetermined system. The matrix $A$ can be arbitrary!! But no matter what the $A$ is (full rank or deficient rank), the optimal estimation is always the solution to the normal equation. The only difference when $A$ is not full rank is there are infinite minimizers $\hat{x}$. But all these minimizers gives the same  minimum $J$. Another interesting thing is among the infinite solutions, there is a one with minimum norm which is $x=(A^TA)^{+}A^Ty$.
4. To prove the normal equation, two methods can be used: one is to let the derivative of J with respect to $x$ be zero; the other one is to use completion of squares.
5. Why $A^TAx=A^Ty$ is called normal equation: it can be rewritten as $A^T(Ax-y)=0$. That means $Ax-y$ is orthogonal (normal) to the range space of $A$.

## 2. Stochastic least-mean-squares

Problem statement: Let $x\in \mathbb{R}^{n}$ be a random variable vector, $y\in \mathbb{R}^m$ is some measurements. We need to find $\hat{x}=Ky$ to minimize

$E(x-\hat{x})(x-\hat{x})^T$

Solution: Refer to ‘Linear estimation’ P80 for details. Here I want to highlight some points:

1. Stochastic least-mean-squares is not very similar to the deterministic least-squares though they share similar names.
2. The optimal gain $K^*$ is the solution to the normal equation $K^*R_y=R_{xy}$.
3. When we say we want to find the $\hat{x}=Ky$, we are trying to get an estimator!! And this is a linear estimator! Because it is a linear function of $y$. Of course the function can be nonlinear.