The Simple Logic Behind the $\widehat\beta$ of the OLS

The (in)famous OLS estimator, \(\widehat\beta = (X^{T}X)^{-1}X^{T}y\). When people see this formula for the first time, the first reaction is to assume that it came from complex mathematical proofs, so it is OK if it’s not very intuitive. While it is true that a quite ingenious proof (aka Gauss-Markov) shows why the estimator is the “best”, this formula actually comes from quite a simple and intuitive logic.

No matrices, only scalars

Let’s, for the moment, forget about the matrices. Let’s say \(y = x * \beta\). We know \(x\) and \(y\) and we want to know \(\beta\). For this purpose, we multiply both sides of the equation by \(x^{-1}\). Put differently, then, \(x^{-1}*y = x^{-1}*x*\beta \implies x^{-1}*y=\beta \implies \beta = y/x\).

Special case: X as an invertible matrix

Before we move on to the more realistic, general, and common case where the feature matrix X can be non-invertible, let’s consider the case where X is invertible. The simplicity of the invertible matrices is that they, at least in the current case, can behave like scalars.

In this case, \(\mathbf{y}\) is a vector with, say, n elements, so y is a \(nx1\) matrix (a column vector with n elements) and \(\mathbf{X}\) is a \(nxn\) matrix (note that following the convention, we write matrices in bold). We still do not know \(\mathbf{\beta}\), and we know \(\mathbf{X}\) and \(\mathbf{y}\) (note that \(\mathbf{\beta}\) is also bold because for each feature, there is a different \(\beta\), so it has also become a vector). How do we find \(\mathbf{\beta}\) given \(\mathbf{y} = \mathbf{X} * \mathbf{\beta}\) ?

The first thing that would come to mind if we mistakenly thought that we are still in the real of scalars would be to divide \(\mathbf{y}\) by \(\mathbf{X}\). Unfortunately, that would not work, but we can use the matrix equivalent of the division operation; i.e. multiplying \(\mathbf{y}\) from the left by the inverse of \(\mathbf{X}\). So, we would get \(\mathbf{\beta} = \mathbf{X^{-1}} * \mathbf{y}\).

General case: X can be a non-invertible matrix

99.99% of the times, X would not be a square matrix. In fact, most of the times, X will have more observations (rows) than features/predictors (columns).1 For a more intuitive grasp, let’s say we have collected sufficiently many observations, but we could only record a small number of features. Put concretely, let’s say that, \(\mathbf{y}\) has 100 determinants. If we knew all of these 100 determinants, we could always know the value of \(\mathbf{y}\) with perfect accuracy. But, in reality, we know only 5 of them. So, our feature matrix \(\mathbf{X}\) is \(100x5\) matrix instead of \(100x100\).

In this case, we can still try to guess what \(\mathbf{y}\) could be given the known 5 determinants. So, now, the formula became \(\mathbf{y} = \mathbf{X} * \mathbf{\beta} + \varepsilon\) . The last term is the error term. It is there because we do not know the other 95 determinants. But, for the moment, let’s ignore it and continue as if we still have \(\mathbf{y} = \mathbf{X} * \mathbf{\beta}\).

Because \(\mathbf{X}\) is not a square matrix, we cannot just take its inverse. However, what we can do is produce a square and invertible matrix using only \(\mathbf{X}\). The most straightforward way to do it is to multiply it from the left with its transpose. If \(\mathbf{X}\) is a \(100x5\) matrix, then \(\mathbf{X^{T}}*\mathbf{X}\) is a \(5x5\) matrix. So, we have \(\mathbf{X^{T}}*y = \mathbf{X^{T}}*\mathbf{X}*\beta\).

Still, \(\mathbf{\beta}\) is not alone, but now, we can take the inverse of what comes before \(\mathbf{\beta}\). Let’s do that:

\[(\mathbf{X^{T}}*\mathbf{X})^{-1}\mathbf{X^{T}}*y = (\mathbf{X^{T}}*\mathbf{X})^{-1}\mathbf{X^{T}}*\mathbf{X}*\beta\]

\[ \implies (\mathbf{X^{T}}*\mathbf{X})^{-1}\mathbf{X^{T}}*y = \beta\]

Voila!! We just arrived at the OLS estimator. Then, all this weird-looking formula is doing is trying to get a matrix that is invertible in front of \(\mathbf{\beta}\) and then leave \(\mathbf{\beta}\) alone.

The beauty of the formula is that it can be directly applied in the special cases as well:

If it is invertible, then this formula will produce \(\mathbf{X^{-1}}*y = \beta\), because individual elements in the parenthesis are also invertible. So, the formula, in the special case becomes \(\mathbf{X^{-1} * X^{T^{-1}} * X^{T} *y = \beta}\) and the transposed terms cancel each other out and yield \(\mathbf{X^{-1}*y = \beta}\).

If it is a scalar, because the transpose of a scalar is itself, the formula becomes \((1/x^{2}) * x * y = y/x = \beta\).

Let’s see these in action

Now, we will have two hats. When we wear the hat of the “Gods of Olympus”, we will know the underlying data generating process, read the truth. In fact, we will decide on the truth. When we wear the hat of the “researcher”, we will not know the truth, but we will see \(y\) or \(\mathbf{y}\), and \(x\) or \(\mathbf{x}\).

The case of scalar:
#Wearing the hat of the "Gods of Olympus"
x <- rnorm(n=1)
true_beta <- 1
y <- x*true_beta

#Wearing the hat of the "researcher"

estimated_beta <- as.numeric(solve(t(x)%*%x) %*% t(x) %*% y)

all.equal(true_beta, estimated_beta)
## [1] TRUE

Our researcher, though certainly not a god, managed to obtain the truth with the OLS estimator formula even though the formula is for matrices (well, this happens thanks to R; because it understands our intention). Now, what if our researcher used R and was somewhat careless and did not pay attention to the fact that \(y\) and \(x\) are scalars? What if she just fit a regression?2

fitted_beta <- unname(lm(y~0+x)$coef)
all.equal(true_beta, fitted_beta)
## [1] TRUE

It seems like R again has our backs.

The special case where \(\mathbf{X}\) is invertible and is \(100 x 100\) matrix:
#Wearing the hat of the "Gods of Olympus"
X <- matrix(rnorm(100*100), nrow = 100)
true_beta <- matrix(seq(10,1, length.out = nrow(X)), ncol = 1)
y <- X %*% true_beta

#Wearing the hat of the "researcher"
estimated_beta <- solve(t(X)%*%X) %*% t(X) %*% y
all.equal(true_beta, estimated_beta)
## [1] TRUE

Another success. When our researcher knows all there is to know, the OLS estimator gives the correct answer in the special case. Let’s see what happens if we fit with R’s lm function:

fitted_beta <- matrix(unname(lm(y~0+X)$coef), ncol = 1)
all.equal(true_beta, fitted_beta)
## [1] TRUE

As expected, the lm function did not fail us either.

The general case where \(\mathbf{X}\) is not invertible and is \(100 x 5\) matrix:

Let’s use the same \(\mathbf{X}\) matrix, but this time, let’s say that our researcher can only observe the first 5 features.

Now, if we tried to use the formula for the special case, R throws an expected error:

observed_X <- X[,1:5]
estimated_noisy_beta <- solve(observed_X) %*% y
## Error in solve.default(observed_X): 'a' (100 x 5) must be square

Well, that was expected. We got an error, quite an informative one though. It is basically telling us to make the “observed_X” an invertible matrix. That was the whole point of the estimator.

(estimated_noisy_beta <- solve(t(observed_X) %*% observed_X) %*% t(observed_X) %*% y)
##           [,1]
## [1,] 13.373890
## [2,]  6.926159
## [3,]  3.317777
## [4,]  3.497302
## [5,]  7.713949

OK, this worked. Now let’s compare it against the true values.

true_vs_est <- matrix(c(true_beta[1:5,], estimated_noisy_beta), byrow = FALSE, ncol = 2)
colnames(true_vs_est) <- c("truth", "estimated")
##          truth estimated
## [1,] 10.000000 13.373890
## [2,]  9.909091  6.926159
## [3,]  9.818182  3.317777
## [4,]  9.727273  3.497302
## [5,]  9.636364  7.713949

This time, we did not hit the mark. Our estimates do not look terrible, they were of the same sign as the true values, but they are certainly not the same as the truth as before. Well, this is the work of the error term.

Now, the question is whether our estimates were best that could be obtained? The answer is affirmative, but the proof requires us to bring the big guns of the mathematics: Gauss and Markov.3

  1. It is also possible to have more features than observations. In these cases, the following matrix algebra practically goes out of the window. In these cases, other methods, e.g. gradient descent or min-norm solutions, can be tried.↩︎

  2. We said that our researcher is careless, not stupid. Naturally, she did not include the intercept term.↩︎

  3. Gauss-Markov theorem↩︎

Doruk Cengiz
Doruk Cengiz
Lead Data Scientist