This worksheet explains what standard errors are, where they come from, and why we use "robust" versions. We use one dataset and one example from start to finish so you can see how the same concept evolves across different estimation settings.
Dataset: nlsw88 (National Longitudinal Survey of Women, 1988), built into Stata.
Running example throughout:
- Outcome Y:
wage(hourly wage in dollars) - Main regressor X:
ttl_exp(total work experience in years) - We add covariates and switch to IV as we go
Section 1: What Is a Standard Error?
The Core Idea
Suppose you survey 50 women and estimate the average hourly wage. You get $7.57. A different sample of 50 would give a slightly different number — maybe $7.43, or $7.71. Repeat this thousands of times and you'd have a distribution of estimates. The standard deviation of that distribution is the standard error.
The same logic applies to regression coefficients. Our estimate comes from one sample. A different sample would give a different . The standard error of measures how much it would vary across repeated samples from the same population.
Technical definition:
Non-technical definition: The standard error is your "margin of error" for a regression coefficient. A small SE means the data pin down the estimate precisely. A large SE means the estimate is noisy — you'd get very different numbers with a different sample. Hence we we use the language of “an estimate is precise” or “the covariates help add precision” what we are really saying is that the standard error are smaller.
What Kind of Uncertainty Does the SE Capture?
This is the most important thing to understand about standard errors, and the thing most students get wrong.
The SE captures one specific type of uncertainty: sampling uncertainty. That is, the uncertainty that comes from using a finite random sample to estimate something about a population. If you drew a different random sample from the same population and ran the same regression, you'd get a slightly different . The SE tells you how much that estimate would bounce around across hypothetical repeated samples.
Here a table that summarizes other types of uncertainty:
Type of uncertainty | What it means | Does SE capture it? |
Sampling uncertainty | We have one sample, not the whole population | ✓ Yes — this is exactly what SE measures |
Model uncertainty | Maybe the true relationship isn't linear, or we're missing key controls | ✗ No |
Causal uncertainty | Maybe is a correlation, not a causal effect (OVB, reverse causality) | ✗ No |
Measurement uncertainty | Maybe wages or experience are recorded with error | ✗ No |
External validity | Maybe the result doesn't generalize to other populations or time periods | ✗ No |
In our running example: reg wage ttl_exp gives with . The small SE tells us: if we drew another random sample of 2,246 women from the same 1988 population and ran the same regression, our estimate would likely be close to 0.331. It does not tell us:
- Whether 0.331 is the causal effect of experience on wages (or just a correlation driven by unobserved ability)
- Whether the linear model is correctly specified (maybe returns to experience are quadratic)
- Whether this finding generalizes to men, or to workers today
Why this matters: A coefficient can be extremely precisely estimated — tiny SE, highly "statistically significant," t-stat of 13 — and still be completely misleading if the model is misspecified or the coefficient doesn't identify a causal effect. Precision and validity are different things. The SE answers the question "how reliable is this estimate from this sample?" It does not answer "am I estimating the right thing?"
When a classmate says "this coefficient is significant" they mean the SE is small relative to the estimate — they are making a statement about sampling uncertainty only. All the other sources of uncertainty require separate arguments: a research design that addresses causal identification, robustness checks for model specification, and so on.
Why Standard Errors Matter
The SE appears in three places you'll use constantly:
- t-statistic: — used to test whether
- 95% confidence interval:
- p-value: derived from the t-statistic and degrees of freedom
If we get the SE wrong, all of these are wrong — even if is correct.
Section 2: Standard Errors in Simple Bivariate OLS
Our Starting Regression
sysuse nlsw88, clear
reg wage ttl_expOutput:
------------------------------------------------------------------------------
wage | Coefficient Std. err. t P>|t|
-------------+----------------------------------------------------------------
ttl_exp | .3314291 .0254087 13.04 0.000
_cons | 3.612492 .3393469 10.65 0.000
------------------------------------------------------------------------------The SE on ttl_exp is 0.0254. Where does this come from?
The Formula
For simple bivariate OLS (one regressor), the standard error of is:
Where:
- is the estimated variance of the residuals
- is the sum of squared residuals
- is the total variation in X
- is the degrees of freedom (n observations minus 2 estimated parameters)
What this formula tells you: SE is small when residuals are small (good model fit), X varies a lot, and sample size is large.
Computing the SE by Hand
Let's verify that Stata's SE of 0.0254 matches the formula.
Step 1: Get SSR and degrees of freedom
reg wage ttl_exp
display "SSR (sum of squared residuals): " e(rss)
display "n: " e(N)
display "Denominator df (n-2): " e(df_r)Step 2: Compute (estimated error variance)
scalar s2 = e(rss) / e(df_r)
display "s^2 = " s2
display "Root MSE (= s) = " sqrt(s2)
* Root MSE is also reported in the Stata regression headerStep 3: Compute SSX (variation in X)
quietly summarize ttl_exp
scalar SSX = r(Var) * (r(N) - 1)
display "SSX = Var(ttl_exp) x (n-1) = " SSXStep 4: Put it together
scalar my_SE = sqrt(s2 / SSX)
display "SE by hand: " my_SE
display "SE from Stata regression: " _se[ttl_exp]
display "Difference (should be ~0): " my_SE - _se[ttl_exp]The Formula as a Story
What changes | Effect on SE | Intuition |
More noise in outcome (larger ) | SE ↑ | Harder to see a clear signal |
More variation in X (larger ) | SE ↓ | Easier to identify the slope |
Larger sample (larger ) | SE ↓ | grows; estimated more precisely |
No variation in X () | SE → ∞ | Impossible to estimate slope |
Section 3: Standard Errors With Multiple Covariates
Adding Controls
reg wage ttl_exp union collgradOutput:
Notice: the SE on ttl_exp fell from 0.0254 to 0.0182. Why?
The Matrix Formula
For multiple regression, the OLS estimate is , and its variance-covariance matrix is:
The SE for coefficient is:
This generalizes the bivariate formula. The term is the analogue of — it measures how much "independent" variation remains in after projecting out all other covariates.
Why Did the SE on ttl_exp Fall?
When we add union and collgrad, two competing forces act:
- falls — the model fits much better (less residual variance)
- Effective variation in
ttl_expfalls — some variation in experience is shared with union and college
In this case, force #1 dominates: union and college absorb a lot of residual wage variation, driving down substantially.
Non-technical: Imagine measuring height from a shadow. If you can control for the sun angle (a covariate), the shadow becomes a more reliable signal of height — the noise falls. Adding union and college does the same: it soaks up noise in wages, making the slope on experience more precisely identified.
A Key Warning: Multicollinearity
If a new covariate is highly correlated with ttl_exp, the SE on ttl_exp can explode.
set seed 42
gen ttl_exp_noisy = ttl_exp + rnormal(0, 0.5)
reg wage ttl_exp ttl_exp_noisy
* You will see enormous SEs on both experience variables
drop ttl_exp_noisyNon-technical: If two students always study exactly as many hours as the other sleeps fewer, you can't separately estimate the effect of each. Multicollinearity is this problem in regression: when two variables move together, their separate effects become impossible to estimate precisely.
Section 4: Standard Errors in IV (2SLS)
Why IV Standard Errors Are Larger
When we use IV, we use only the variation in union that is predicted by the instrument south. This "exogenous variation" is a subset of all variation in union — like shrinking the effective sample size.
Less variation to work with → larger standard errors.
A useful approximation:
If south explains only 2% of variation in union ($R^2 = 0.02$), then — seven times larger!
Non-technical: Imagine estimating how fertilizer affects crop yield using "whether it rained last Tuesday" as your instrument for fertilizer used. Rain predicts some fertilizer use, but only a tiny fraction of its variation. You're trying to identify a slope using only 2% of the fertilizer variation that rain "moved." Of course the estimate is imprecise.
Demonstration
Section 5: Robust Standard Errors
The Hidden Assumption in Classical SEs
Everything in Sections 2–4 assumed homoskedasticity: the variance of the error term is the same for every observation.
For wages, this means the "unpredictable" part of wages has the same spread at all experience levels. Plausible? Probably not — higher earners likely have more variance in wages. When variance depends on X, errors are heteroskedastic: , varying across observations.
Think of predicting how long tasks take. Simple tasks (data entry) have low variance. Complex tasks (writing a dissertation) vary enormously. If your dataset mixes both, the residual uncertainty isn't constant — it depends on task type. Heteroskedasticity is exactly this: the noise in your regression varies across observations.
What Goes Wrong With Classical SEs?
The classical formula uses one for all observations. If variance actually differs, is just an average — and using an average ignores where the real uncertainty lives:
- In high-variance regions, classical SEs are too small (overconfident)
- In low-variance regions, classical SEs are too large (underconfident)
- t-tests and confidence intervals are unreliable
Note: the point estimates are unaffected by heteroskedasticity — OLS is still unbiased. Only inference is broken.
Robust (Huber-White) Standard Errors
Instead of assuming every observation has variance , use each observation's own squared residual to estimate its own variance. The sandwich estimator is:
is the "bread" and is the "meat."
Technical property: Consistent under heteroskedasticity of any unknown form.
Non-technical: Classical SEs assume every data point is equally "noisy." Robust SEs say: "Let each data point tell us how much it personally contributes to uncertainty." Observations with large residuals contribute more; observations with small residuals contribute less. It's a more honest accounting of where uncertainty comes from.
Key takeaway: Robust SEs do not change at all. They only change the SE — and thus the t-statistic, p-value, and confidence intervals.
In Stata: Just Add , robust
sysuse nlsw88, clear
* Classical standard errors
reg wage ttl_exp
* Robust standard errors
reg wage ttl_exp, robustYou will see:
- Coefficients are identical — robust SEs don't change estimates
- SEs differ — sometimes slightly, sometimes substantially
* With multiple covariates
reg wage ttl_exp union collgrad
reg wage ttl_exp union collgrad, robust
* IV with robust SEs
ivregress 2sls wage (union = south) age ttl_exp collgrad, robustDetecting Heteroskedasticity
reg wage ttl_exp
* Breusch-Pagan formal test
* H0: errors are homoskedastic
* Reject H0 -> use robust SEs
estat hettest
* Visual: residual-vs-fitted plot
* If spread grows with fitted values: heteroskedasticity
rvfplot, yline(0)When Does It Matter Most?
Robust SEs differ most from classical SEs when:
- Skewed outcomes (wages, income, firm revenues) — high earners create large residuals
- Binary or count outcomes — variance is mechanically tied to the mean
- Cross-sectional data with heterogeneous units (countries, individuals, firms)
For wages — our example — the distribution is right-skewed and variance almost certainly increases with experience. Always use robust SEs with wage data.
Summary Table
Setting | SE Formula (sketch) | Key Properties |
Bivariate OLS | Simple; assumes homoskedasticity; SE ↓ as ↑ or variation in X ↑ | |
Multiple OLS | SE can rise or fall with more controls; watch for collinearity | |
IV (2SLS) | Larger than OLS | Always OLS SE; grows as first-stage F falls |
Classical SE | Based on common | Valid only under homoskedasticity |
Robust SE | Sandwich with | Valid under hetero- and homoskedasticity; use by default |
The bottom line: The point estimate is the same regardless of whether you use classical or robust SEs, OLS or IV. What changes is how precisely you can pin it down, and whether your inference is valid.