Sebastian Tello
  • Home
  • CV
  • Contact
  • Research
  • Resources
  • RMDA
  • APP
InstagramBluesky
📝

Connection between OLS and Differences in Averages

In cases where the primary variable of interest is binary, the regression of any outcome variable on this binary variable is just a difference in means or the naive difference. But don’t let me just tell you this to make you believe it; let’s derive it.

Let’s say we have the following regression:

Y=α+δD+εY = \alpha + \delta D + \varepsilonY=α+δD+ε, where:

  • Y is the dependent variable,
  • D is a binary (indicator) variable.
  • α\alphaα is the intercept
  • δ\deltaδ is the coefficient of interest, and
  • ϵ\epsilonϵ is the error term.

Step 1: Taking Expectations

Since D is binary, we compute the expected value of YYY Conditional on D:

  1. When D=0D = 0D=0:
  2. E[Y∣D=0]=α+δ(0)+E[ε∣D=0]=αE[Y | D = 0] = \alpha + \delta(0) + E[\varepsilon | D=0] = \alphaE[Y∣D=0]=α+δ(0)+E[ε∣D=0]=α

    Assuming E[ε∣D=0]=0E[\varepsilon | D=0] = 0E[ε∣D=0]=0

  3. When D=1:D = 1:D=1:
  4. E[Y∣D=1]=α+δ(1)+E[ε∣D=1]=α+δE[Y | D = 1] = \alpha + \delta(1) + E[\varepsilon | D=1] = \alpha + \deltaE[Y∣D=1]=α+δ(1)+E[ε∣D=1]=α+δ

    Assuming E[ε∣D=1]=0E[\varepsilon | D=1] = 0E[ε∣D=1]=0

    These assumptions are essentially a version of unbiasedness

Step 2: Difference in Means Interpretation

The coefficient δ\deltaδ represents the difference in the expected values of YYY for the two groups:

E[Y∣D=1]−E[Y∣D=0]=(α+δ)−α=δ.E[Y | D=1] - E[Y | D=0] = (\alpha + \delta) - \alpha = \delta.E[Y∣D=1]−E[Y∣D=0]=(α+δ)−α=δ.

Thus, δ\deltaδ captures the difference in the mean outcome between the treatment group D=1D=1D=1 and the control group D=0D=0D=0

Now notice that without the assumptions the subtraction above would be

E[Y∣D=1]−E[Y∣D=0]=(α+δ+E[ε∣D=1])−(α+E[ε∣D=0])=E[ε∣D=1]−E[ε∣D=0]+δE[Y | D=1] - E[Y | D=0] = (\alpha+\delta+E[\varepsilon | D=1]) - (\alpha+E[\varepsilon | D=0]) \\ =E[\varepsilon | D=1] -E[\varepsilon | D=0]+\deltaE[Y∣D=1]−E[Y∣D=0]=(α+δ+E[ε∣D=1])−(α+E[ε∣D=0])=E[ε∣D=1]−E[ε∣D=0]+δ

If we don’t assume unbiasedness, then the difference in means give us the causal effect δ\deltaδ plus bias, and here it is written in a different way. This would also be true for regression.