The Simple Logic Behind the of the OLS
The (in)famous OLS estimator,
No matrices, only scalars
Let’s, for the moment, forget about the matrices. Let’s say
Special case: X as an invertible matrix
Before we move on to the more realistic, general, and common case where the feature matrix X can be non-invertible, let’s consider the case where X is invertible. The simplicity of the invertible matrices is that they, at least in the current case, can behave like scalars.
In this case,
The first thing that would come to mind if we mistakenly thought that we are still in the real of scalars would be to divide
General case: X can be a non-invertible matrix
99.99% of the times, X would not be a square matrix. In fact, most of the times, X will have more observations (rows) than features/predictors (columns).1 For a more intuitive grasp, let’s say we have collected sufficiently many observations, but we could only record a small number of features. Put concretely, let’s say that,
In this case, we can still try to guess what
Because
Still,
Voila!! We just arrived at the OLS estimator. Then, all this weird-looking formula is doing is trying to get a matrix that is invertible in front of
The beauty of the formula is that it can be directly applied in the special cases as well:
If it is invertible, then this formula will produce
If it is a scalar, because the transpose of a scalar is itself, the formula becomes
Let’s see these in action
Now, we will have two hats. When we wear the hat of the “Gods of Olympus”, we will know the underlying data generating process, read the truth. In fact, we will decide on the truth. When we wear the hat of the “researcher”, we will not know the truth, but we will see
The case of scalar:
#Wearing the hat of the "Gods of Olympus"
set.seed(12345)
x <- rnorm(n=1)
true_beta <- 1
y <- x*true_beta
#Wearing the hat of the "researcher"
estimated_beta <- as.numeric(solve(t(x)%*%x) %*% t(x) %*% y)
all.equal(true_beta, estimated_beta)
## [1] TRUE
Our researcher, though certainly not a god, managed to obtain the truth with the OLS estimator formula even though the formula is for matrices (well, this happens thanks to R; because it understands our intention). Now, what if our researcher used R and was somewhat careless and did not pay attention to the fact that
fitted_beta <- unname(lm(y~0+x)$coef)
all.equal(true_beta, fitted_beta)
## [1] TRUE
It seems like R again has our backs.
The special case where is invertible and is matrix:
#Wearing the hat of the "Gods of Olympus"
X <- matrix(rnorm(100*100), nrow = 100)
true_beta <- matrix(seq(10,1, length.out = nrow(X)), ncol = 1)
y <- X %*% true_beta
#Wearing the hat of the "researcher"
estimated_beta <- solve(t(X)%*%X) %*% t(X) %*% y
all.equal(true_beta, estimated_beta)
## [1] TRUE
Another success. When our researcher knows all there is to know, the OLS estimator gives the correct answer in the special case. Let’s see what happens if we fit with R’s lm function:
fitted_beta <- matrix(unname(lm(y~0+X)$coef), ncol = 1)
all.equal(true_beta, fitted_beta)
## [1] TRUE
As expected, the lm function did not fail us either.
The general case where is not invertible and is matrix:
Let’s use the same
Now, if we tried to use the formula for the special case, R throws an expected error:
observed_X <- X[,1:5]
estimated_noisy_beta <- solve(observed_X) %*% y
## Error in solve.default(observed_X): 'a' (100 x 5) must be square
Well, that was expected. We got an error, quite an informative one though. It is basically telling us to make the “observed_X” an invertible matrix. That was the whole point of the estimator.
(estimated_noisy_beta <- solve(t(observed_X) %*% observed_X) %*% t(observed_X) %*% y)
## [,1]
## [1,] 13.373890
## [2,] 6.926159
## [3,] 3.317777
## [4,] 3.497302
## [5,] 7.713949
OK, this worked. Now let’s compare it against the true values.
true_vs_est <- matrix(c(true_beta[1:5,], estimated_noisy_beta), byrow = FALSE, ncol = 2)
colnames(true_vs_est) <- c("truth", "estimated")
print(true_vs_est)
## truth estimated
## [1,] 10.000000 13.373890
## [2,] 9.909091 6.926159
## [3,] 9.818182 3.317777
## [4,] 9.727273 3.497302
## [5,] 9.636364 7.713949
This time, we did not hit the mark. Our estimates do not look terrible, they were of the same sign as the true values, but they are certainly not the same as the truth as before. Well, this is the work of the error term.
Now, the question is whether our estimates were best that could be obtained? The answer is affirmative, but the proof requires us to bring the big guns of the mathematics: Gauss and Markov.3
It is also possible to have more features than observations. In these cases, the following matrix algebra practically goes out of the window. In these cases, other methods, e.g. gradient descent or min-norm solutions, can be tried.↩︎
We said that our researcher is careless, not stupid. Naturally, she did not include the intercept term.↩︎