Developing intuition
Let’s start with the intuitive way. In short, whenever you plot data, things could look like the image above.
We know that OLS is just a “line” that best fits the data. We also know that the slope of this line is the in the regression . But how How is this slope or line calculated? The intuition is that this is the line that “minimizes” squared errors. Let’s dive into that.
Think of the “error” as the distance between the line and an actual point. For example, when x=4, the line predicts that Y should be at 3, however one of the data points at x=4 has a y=5. This means that there is an “error” in the prediction of 5-3=2, or we are 2 units away from where we should be for a given x (x=4).
That’s one error for one data point, we can do this for all data points. Therefore, for each point, we can find a measure of how big the error is.
We want the error to be as small as possible (in the ideal world, zero). Imagine that we are trying to figure out what’s the best fitting line, but don’t know OLS. What we could do, is put one line we think would work well, and add up all the errors to get one big metric of how much error that line has, and then do the same for another line and compare. We would prefer the line with the smallest error.
However, notice that some of the “errors” will be positive (the actual Y is larger than the predicted Y), and some will be negative (the actual Y is smaller than the predicted Y). Therefore, if we were to add up all the errors, we may end up having a number close to 0! (because we would be adding positive and negative numbers). Hence, just adding up the errors is not good enough.
A solution to this is to “square” the errors. If we square positive numbers, they will become positive, and if we square negative numbers, they will also become positive. Hence once we add up all the squared errors, we can have a measure that is a better comparison across the different lines.
This is why we say that OLS is trying to “minimize squared errors,” and in fact, where OLS gets its name: Ordinary Least Squares.
Now, let’s write that with math. Let’s call the errors as epsilon , and the actual calculated errors epsilon hat or , and since we can calculate this for every observation in our data, we would have , let’s call one of those error as , and can go from 1 to the last observation.
Next, we need to express the sum of all of this. This can be done with the summation sign in math which is Sigma or .
This is representing the sum of all errors. Now, we are missing the square part, so let’s add that:
So in math, we want to minimize the sum of squared errors or
But how is this possible? How does one “minimize” this? Well, let’s go back to see what is equal to.
It is just the actual Y minus the predicted Y we got from from OLS (), and since the OLS predicts , then the prediction (Ŷ) is just alpha plus beta times X.
Now Y and X are data, so we cannot change that, that’s a given. The only thing we can change is alpha and beta. So if we wanted to minimize the sum of squared errors, we need to pick an or that minimizes the squared errors. How do we find that? We use an insight from calculus, that if we take the derivative with respect to beta, set it equal to 0, and solve for beta, that this will give us a min (for convex functions, which this is).
If you want to see the math for it, open this toggle, but you can skip to the next result
Once you do that math you find out the following:
That means we can calculate beta that minimizes the sum of squared errors by just doing the exercise above. is a math term for the average of X.
Now your job will be to do exactly that. Download this data
and run the regression in STATA, obtain the beta, then use google sheets or excel to apply the formula above to corroborate that is the same .