In cases where the primary variable of interest is binary, the regression of any outcome variable on this binary variable is just a difference in means or the naive difference. But don’t let me just tell you this to make you believe it; let’s derive it.
Let’s say we have the following regression:
, where:
- Y is the dependent variable,
- D is a binary (indicator) variable.
- is the intercept
- is the coefficient of interest, and
- is the error term.
Step 1: Taking Expectations
Since D is binary, we compute the expected value of Conditional on D:
- When :
- When
Assuming
Assuming
These assumptions are essentially a version of unbiasedness
Step 2: Difference in Means Interpretation
The coefficient represents the difference in the expected values of for the two groups:
Thus, captures the difference in the mean outcome between the treatment group and the control group
Now notice that without the assumptions the subtraction above would be
If we don’t assume unbiasedness, then the difference in means give us the causal effect plus bias, and here it is written in a different way. This would also be true for regression.