📝

Standard Errors in OLS and IV

This worksheet explains what standard errors are, where they come from, and why we use "robust" versions. We use one dataset and one example from start to finish so you can see how the same concept evolves across different estimation settings.

Dataset: nlsw88 (National Longitudinal Survey of Women, 1988), built into Stata.

Running example throughout:

Outcome Y: wage (hourly wage in dollars)
Main regressor X: ttl_exp (total work experience in years)
We add covariates and switch to IV as we go

Section 1: What Is a Standard Error?

The Core Idea

Suppose you survey 50 women and estimate the average hourly wage. You get $7.57. A different sample of 50 would give a slightly different number — maybe $7.43, or $7.71. Repeat this thousands of times and you'd have a distribution of estimates. The standard deviation of that distribution is the standard error.

The same logic applies to regression coefficients. Our estimate $\hat{\beta}$ comes from one sample. A different sample would give a different $\hat{\beta}$ . The standard error of $\hat{\beta}$ measures how much it would vary across repeated samples from the same population.

Technical definition: $SE(\hat{\beta}) = SD(\text{sampling distribution of } \hat{\beta})$

Non-technical definition: The standard error is your "margin of error" for a regression coefficient. A small SE means the data pin down the estimate precisely. A large SE means the estimate is noisy — you'd get very different numbers with a different sample. Hence we we use the language of “an estimate is precise” or “the covariates help add precision” what we are really saying is that the standard error are smaller.

What Kind of Uncertainty Does the SE Capture?

This is the most important thing to understand about standard errors, and the thing most students get wrong.

The SE captures one specific type of uncertainty: sampling uncertainty. That is, the uncertainty that comes from using a finite random sample to estimate something about a population. If you drew a different random sample from the same population and ran the same regression, you'd get a slightly different $\hat{\beta}$ . The SE tells you how much that estimate would bounce around across hypothetical repeated samples.

Here a table that summarizes other types of uncertainty:

Type of uncertainty	What it means	Does SE capture it?
Sampling uncertainty	We have one sample, not the whole population	✓ Yes — this is exactly what SE measures
Model uncertainty	Maybe the true relationship isn't linear, or we're missing key controls	✗ No
Causal uncertainty	Maybe $\hat{\beta}$ is a correlation, not a causal effect (OVB, reverse causality)	✗ No
Measurement uncertainty	Maybe wages or experience are recorded with error	✗ No
External validity	Maybe the result doesn't generalize to other populations or time periods	✗ No

In our running example: reg wage ttl_exp gives $\hat{\beta} = 0.331$ with $SE = 0.025$ . The small SE tells us: if we drew another random sample of 2,246 women from the same 1988 population and ran the same regression, our estimate would likely be close to 0.331. It does not tell us:

Whether 0.331 is the causal effect of experience on wages (or just a correlation driven by unobserved ability)
Whether the linear model is correctly specified (maybe returns to experience are quadratic)
Whether this finding generalizes to men, or to workers today

Why this matters: A coefficient can be extremely precisely estimated — tiny SE, highly "statistically significant," t-stat of 13 — and still be completely misleading if the model is misspecified or the coefficient doesn't identify a causal effect. Precision and validity are different things. The SE answers the question "how reliable is this estimate from this sample?" It does not answer "am I estimating the right thing?"

When a classmate says "this coefficient is significant" they mean the SE is small relative to the estimate — they are making a statement about sampling uncertainty only. All the other sources of uncertainty require separate arguments: a research design that addresses causal identification, robustness checks for model specification, and so on.

Why Standard Errors Matter

The SE appears in three places you'll use constantly:

t-statistic: $t = \hat{\beta} / SE(\hat{\beta})$ — used to test whether $\beta = 0$
95% confidence interval: $\hat{\beta} \pm 1.96 \times SE(\hat{\beta})$
p-value: derived from the t-statistic and degrees of freedom

If we get the SE wrong, all of these are wrong — even if $\hat{\beta}$ is correct.

Section 2: Standard Errors in Simple Bivariate OLS

Our Starting Regression

sysuse nlsw88, clear
reg wage ttl_exp

Output:

------------------------------------------------------------------------------
        wage | Coefficient  Std. err.      t    P>|t|
-------------+----------------------------------------------------------------
     ttl_exp |   .3314291   .0254087    13.04   0.000
       _cons |   3.612492   .3393469    10.65   0.000
------------------------------------------------------------------------------

The SE on ttl_exp is 0.0254. Where does this come from?

The Formula

For simple bivariate OLS (one regressor), the standard error of $\hat{\beta}_1$ is:

$\boxed{SE(\hat{\beta}_1) = \sqrt{\frac{s^2}{SSX}}}$

Where:

$s^2 = \frac{SSR}{(n-2)}$ is the estimated variance of the residuals
$SSR = \sum_i \hat{\varepsilon}_i^2$ is the sum of squared residuals
$SSX = \sum_i (x_i - \bar{x})^2$ is the total variation in X
$n - 2$ is the degrees of freedom (n observations minus 2 estimated parameters)

What this formula tells you: SE is small when residuals are small (good model fit), X varies a lot, and sample size is large.

Computing the SE by Hand

Let's verify that Stata's SE of 0.0254 matches the formula.

Step 1: Get SSR and degrees of freedom

reg wage ttl_exp

display "SSR (sum of squared residuals): " e(rss)
display "n:                              " e(N)
display "Denominator df (n-2):           " e(df_r)

Step 2: Compute $s^2$ (estimated error variance)

scalar s2 = e(rss) / e(df_r)
display "s^2 = " s2
display "Root MSE (= s) = " sqrt(s2)
* Root MSE is also reported in the Stata regression header

Step 3: Compute SSX (variation in X)

quietly summarize ttl_exp
scalar SSX = r(Var) * (r(N) - 1)
display "SSX = Var(ttl_exp) x (n-1) = " SSX

Step 4: Put it together

scalar my_SE = sqrt(s2 / SSX)
display "SE by hand:               " my_SE
display "SE from Stata regression: " _se[ttl_exp]
display "Difference (should be ~0): " my_SE - _se[ttl_exp]

The Formula as a Story

$SE(\hat{\beta}_1) = \sqrt{\frac{s^2}{SSX}} = \frac{\text{noise in the outcome}}{\text{variation in X}}$

What changes	Effect on SE	Intuition
More noise in outcome (larger $s^2$ )	SE ↑	Harder to see a clear signal
More variation in X (larger $SSX$ )	SE ↓	Easier to identify the slope
Larger sample (larger $n$ )	SE ↓	$SSX$ grows; $s^2$ estimated more precisely
No variation in X ( $SSX = 0$ )	SE → ∞	Impossible to estimate slope

‣

Question 1: The SE on ttl_exp is 0.0254. If the sample size doubled, approximately what would the new SE be?

$SSX$ doubles (roughly) when $n$ doubles, since $SSX = (n-1) times text{Var}(x)$. So:

$SE_{new} = \sqrt{\frac{s^2}{2 \cdot SSX}} = \frac{SE_{old}}{\sqrt{2}} \approx \frac{0.0254}{1.414} \approx 0.018$

More data → more precise estimates. SE falls at rate $\frac{1}{\sqrt{n}}$

‣

Question 2: Suppose Root MSE = 5.5 and SD of ttl_exp = 4.61 with n = 2246. Compute SE without running a regression.

$s = 5.5$ , so $s^2 = 30.25$
$SSX = (2246 - 1) \times (4.61)^2 = 2245 \times 21.25 = 47{,}706$
$SE = \sqrt{30.25 / 47706} = \sqrt{0.000634} \approx 0.0252$

This closely matches Stata's output of 0.0254 (small differences from rounding).

Section 3: Standard Errors With Multiple Covariates

Adding Controls

reg wage ttl_exp union collgrad

Output:

------------------------------------------------------------------------------
        wage | Coefficient  Std. err.      t    P>|t|
-------------+----------------------------------------------------------------
     ttl_exp |   .2954401   .0181754    16.25   0.000
       union |   .9926316    .194432     5.11   0.000
    collgrad |   3.125986   .1947131    16.05   0.000
       _cons |   2.762356   .2494487    11.07   0.000
------------------------------------------------------------------------------

Notice: the SE on ttl_exp fell from 0.0254 to 0.0182. Why?

The Matrix Formula

For multiple regression, the OLS estimate is $\hat{\boldsymbol{\beta}} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{y}$ , and its variance-covariance matrix is:

$\text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2 (\mathbf{X}'\mathbf{X})^{-1}$

The SE for coefficient $j$ is:

$SE(\hat{\beta}j) = \sqrt{s^2 \cdot [(\mathbf{X}'\mathbf{X})^{-1}]{jj}}$

This generalizes the bivariate formula. The $[(\mathbf{X}'\mathbf{X})^{-1}]_{jj}$ term is the analogue of $1/SSX$ — it measures how much "independent" variation remains in $X_j$ after projecting out all other covariates.

Why Did the SE on `ttl_exp` Fall?

When we add union and collgrad, two competing forces act:

$s^2$ falls — the model fits much better (less residual variance)
Effective variation in ttl_exp falls — some variation in experience is shared with union and college

In this case, force #1 dominates: union and college absorb a lot of residual wage variation, driving down $s^2$ substantially.

Non-technical: Imagine measuring height from a shadow. If you can control for the sun angle (a covariate), the shadow becomes a more reliable signal of height — the noise falls. Adding union and college does the same: it soaks up noise in wages, making the slope on experience more precisely identified.

A Key Warning: Multicollinearity

If a new covariate is highly correlated with ttl_exp, the SE on ttl_exp can explode.

set seed 42
gen ttl_exp_noisy = ttl_exp + rnormal(0, 0.5)
reg wage ttl_exp ttl_exp_noisy
* You will see enormous SEs on both experience variables
drop ttl_exp_noisy

Non-technical: If two students always study exactly as many hours as the other sleeps fewer, you can't separately estimate the effect of each. Multicollinearity is this problem in regression: when two variables move together, their separate effects become impossible to estimate precisely.

‣

Question 3: Does adding more controls always reduce the SE on ttl_exp?

Section 4: Standard Errors in IV (2SLS)

Why IV Standard Errors Are Larger

When we use IV, we use only the variation in union that is predicted by the instrument south. This "exogenous variation" is a subset of all variation in union — like shrinking the effective sample size.

Less variation to work with → larger standard errors.

A useful approximation:

$\frac{SE_{IV}}{SE_{OLS}} \approx \frac{1}{\sqrt{R^2_{\text{first stage}}}}$

If south explains only 2% of variation in union ($R^2 = 0.02$), then $SE_{IV} \approx \frac{1}{\sqrt{0.02}} \approx 7 \times SE_{OLS}$ — seven times larger!

Non-technical: Imagine estimating how fertilizer affects crop yield using "whether it rained last Tuesday" as your instrument for fertilizer used. Rain predicts some fertilizer use, but only a tiny fraction of its variation. You're trying to identify a slope using only 2% of the fertilizer variation that rain "moved." Of course the estimate is imprecise.

Demonstration

sysuse nlsw88, clear

* OLS (for comparison)
reg wage union age ttl_exp collgrad, robust
scalar se_ols = _se[union]
display "OLS SE on union: " se_ols

* IV
ivregress 2sls wage (union = south) age ttl_exp collgrad, robust
scalar se_iv = _se[union]
display "IV SE on union:  " se_iv
display "Ratio SE_IV / SE_OLS: " se_iv / se_ols

* First-stage R² and predicted ratio
reg union south age ttl_exp collgrad
display "First-stage R^2:                  " e(r2)
display "Predicted ratio (1/sqrt(R^2)):    " 1/sqrt(e(r2))
display "Actual ratio:                      " se_iv / se_ols

‣

Question 4: Suppose the first-stage R² = 0.01. How much larger would the IV SE be than the OLS SE?

‣

Question 5: "IV fixes endogeneity bias, so even with a large SE the IV estimate is more trustworthy than OLS." Is this right?

Section 5: Robust Standard Errors

The Hidden Assumption in Classical SEs

Everything in Sections 2–4 assumed homoskedasticity: the variance of the error term $\varepsilon_i$ is the same for every observation.

$\text{Var}(\varepsilon_i \mid X_i) = \sigma^2 \quad \text{for all } i$

For wages, this means the "unpredictable" part of wages has the same spread at all experience levels. Plausible? Probably not — higher earners likely have more variance in wages. When variance depends on X, errors are heteroskedastic: $\text{Var}(\varepsilon_i \mid X_i) = \sigma_i^2$ , varying across observations.

Think of predicting how long tasks take. Simple tasks (data entry) have low variance. Complex tasks (writing a dissertation) vary enormously. If your dataset mixes both, the residual uncertainty isn't constant — it depends on task type. Heteroskedasticity is exactly this: the noise in your regression varies across observations.

What Goes Wrong With Classical SEs?

The classical formula $\sqrt{s^2/SSX}$ uses one $s^2$ for all observations. If variance actually differs, $s^2$ is just an average — and using an average ignores where the real uncertainty lives:

In high-variance regions, classical SEs are too small (overconfident)
In low-variance regions, classical SEs are too large (underconfident)
t-tests and confidence intervals are unreliable

Note: the point estimates $\hat{\beta}$ are unaffected by heteroskedasticity — OLS is still unbiased. Only inference is broken.

Robust (Huber-White) Standard Errors

Instead of assuming every observation has variance $\sigma^2$ , use each observation's own squared residual $\hat{\varepsilon}_i^2$ to estimate its own variance. The sandwich estimator is:

$\widehat{\text{Var}}_{robust}(\hat{\boldsymbol{\beta}}) = (\mathbf{X}'\mathbf{X})^{-1} \left[sum_i hat{\varepsilon}_i^2 \mathbf{x}_i \mathbf{x}_i'\right] (\mathbf{X}'\mathbf{X})^{-1}$

$(\mathbf{X}'\mathbf{X})^{-1}$ is the "bread" and $\sum_i \hat{\varepsilon}_i^2 \mathbf{x}_i \mathbf{x}_i'$ is the "meat."

Technical property: Consistent under heteroskedasticity of any unknown form.

Non-technical: Classical SEs assume every data point is equally "noisy." Robust SEs say: "Let each data point tell us how much it personally contributes to uncertainty." Observations with large residuals contribute more; observations with small residuals contribute less. It's a more honest accounting of where uncertainty comes from.

Key takeaway: Robust SEs do not change $\hat{\beta}$ at all. They only change the SE — and thus the t-statistic, p-value, and confidence intervals.

In Stata: Just Add `, robust`

sysuse nlsw88, clear

* Classical standard errors
reg wage ttl_exp

* Robust standard errors
reg wage ttl_exp, robust

You will see:

Coefficients are identical — robust SEs don't change estimates
SEs differ — sometimes slightly, sometimes substantially

* With multiple covariates
reg wage ttl_exp union collgrad
reg wage ttl_exp union collgrad, robust

* IV with robust SEs
ivregress 2sls wage (union = south) age ttl_exp collgrad, robust

Detecting Heteroskedasticity

reg wage ttl_exp

* Breusch-Pagan formal test
* H0: errors are homoskedastic
* Reject H0 -> use robust SEs
estat hettest

* Visual: residual-vs-fitted plot
* If spread grows with fitted values: heteroskedasticity
rvfplot, yline(0)

When Does It Matter Most?

Robust SEs differ most from classical SEs when:

Skewed outcomes (wages, income, firm revenues) — high earners create large residuals
Binary or count outcomes — variance is mechanically tied to the mean
Cross-sectional data with heterogeneous units (countries, individuals, firms)

For wages — our example — the distribution is right-skewed and variance almost certainly increases with experience. Always use robust SEs with wage data.

‣

Question 6: After reg wage ttl_exp and reg wage ttl_exp, robust — are the coefficients different? Are the SEs different? What should you conclude?

‣

Question 7: "Classical SEs are efficient when errors are homoskedastic. Robust SEs are slightly larger. So always use classical SEs and switch to robust only when you detect heteroskedasticity." Evaluate this.

Summary Table

Setting	SE Formula (sketch)	Key Properties
Bivariate OLS	$\sqrt{s^2/SSX}$	Simple; assumes homoskedasticity; SE ↓ as $n$ ↑ or variation in X ↑
Multiple OLS	$\sqrt{s^2 \cdot [(\mathbf{X}'\mathbf{X})^{-1}]_{jj}}$	SE can rise or fall with more controls; watch for collinearity
IV (2SLS)	Larger than OLS	Always $\geq$ OLS SE; grows as first-stage F falls
Classical SE	Based on common $s^2$	Valid only under homoskedasticity
Robust SE	Sandwich with $\hat{\varepsilon}_i^2$	Valid under hetero- and homoskedasticity; use by default

The bottom line: The point estimate $\hat{\beta}$ is the same regardless of whether you use classical or robust SEs, OLS or IV. What changes is how precisely you can pin it down, and whether your inference is valid.