Purpose
The objective of this homework is for you to practice concepts learned in class and apply them to a real-case scenario. The concepts we will practice in this homework relate to difference-in-difference.
Guidelines
- Work will be independent.
- Submit your answers to gradescope (within Canvas).
- We encourage you to use the boxes, PDFs, JPGs, and PNGs are preferable over word documents or cvs. Recall you can always save something as a PDF. You can also “Screenshot” anything. You can do this in windows using the snipping tool or Windows+Shift+S. In Mac, you can do this by command+shift+4.
- Submit your do-file to gradescope (within Canvas).
- You will get points for correct answers. You will get points deducted if the answer contains more information that’s not necessary or if the answer contains incorrect statements among correct statements. In short, we are trying to incentivize students to use the least amount of characters while maximizing the accuracy of responses.
- Your responses should be professionally formatted and written.
- The due date is Monday May 1st, at 11:59 pm EDT.
- Data can be found at https://www.dropbox.com/s/fosdbw9jbupbdbv/micro.dta?dl=0
Preamble
In this assignment you’ll study the Earned Income Tax Credit, one of the largest anti-poverty programs in the United States. The EITC provides more than $67 billion to more than 27 million Americans each year. The data for this exercise come from the Current Population Survey (CPS), which you can access online. The data provided is a cleaned version of the the CPS. This is an actual real dataset from a survey from the government. You will be using this data and a difference-in-difference design to understand the impact of EITC on women’s labor supply. Get acquainted with the basic structure of the EITC: What is the EITC? Who’s eligible? How much support can the beneficiaries receive? (The Tax Policy Center and the Center on Budget and Policy Priorities offer two good primers you can read. Jeffrey Liebman provides a more thorough history.)
Getting acquiescence with the data
The data for this homework comes from the Current Population Survey. The University of Minnesota maintains an excellent website for downloading CPS data. We are using the Annual Social and Economic Supplement (ASEC) for this analysis. You’ll often see the ASEC referred to as the “March CPS” since it’s collected each year in March. The data includes all the years from 1980 through 2015. We’ll use incwage
to measure annual earnings. Once you have the data up and running, perform the following data cleaning steps:
- Since we need to compare apples to apples, convert those nominal earnings to 2015 real dollars using the Consumer Price Index (CPI). We give you in the data the CPI, so use your knowledge from econ on how to turn nominal income into real income. In the end your average real income variable should look like:
- Restrict your sample to single women between the ages of 25 and 45 (inclusive). These women are likely the primary earners in their households. You should exclude women who were married when surveyed but include those who were never married, separated, divorced, or widowed.
- Since the EITC targets low-income workers, we’ll focus on women who never attended college. Restrict your sample to women whose educational attainment was a high school diploma, GED, or less.
- We’ll use
incwage
to measure annual earnings. We’ll refer to women as employed if they had positive earnings in a given year. Create a binary variable that identifies employed respondents. - Because EITC benefits vary with family size, we’ll explore how female employment varied with number of children. We’ll focus on three groups: women with zero kids, those with one kid, and those with two or more kids. Create binary variables that identify these groups.
- As always, you should investigate the data for missing values and outliers before proceeding. You needn’t report any specific summary statistics from that investigation for this assignment, however. After the sample selections, do not drop any data in this process, just learn about outliers and missing values.
- Here are some of the sum stats for some variables, your data after cleaning should have the same summary stats. These are not all the variables.
*2024
sum year serial asecwth cpsid statefip pernum cpsidp age cpi employed
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
year | 100,314 1998.244 10.10372 1980 2014
serial | 100,314 41936.16 26323.77 2 99983
asecwth | 100,314 1594.002 1023.859 0 19512.07
cpsid | 100,314 1.49e+13 8.67e+12 0 2.02e+13
statefip | 100,314 27.20708 15.3888 1 56
-------------+---------------------------------------------------------
pernum | 100,314 1.866459 1.350947 1 21
cpsidp | 100,314 1.49e+13 8.67e+12 0 2.02e+13
age | 100,314 33.0159 5.977183 25 45
cpi | 100,314 .7014234 .1910864 .3476793 .9987342
employed | 100,314 .6674642 .4711242 0 1
. sum year serial hwtsupp cpsid statefip pernum cpsidp wtsupp age cpi employed
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
year | 162,763 1997.625 10.38741 1980 2015
serial | 162,763 42198.61 25727.67 1 99984
hwtsupp | 162,763 1578.355 1001.286 0 27403.12
cpsid | 122,330 1.45e+13 8.93e+12 0 2.02e+13
statefip | 162,763 27.61705 15.50662 1 56
-------------+---------------------------------------------------------
pernum | 162,763 1.743584 1.265402 1 21
cpsidp | 122,330 1.45e+13 8.93e+12 0 2.02e+13
wtsupp | 162,763 1585.084 1005.075 0 27403.12
age | 162,763 34.32917 6.134879 25 45
cpi | 162,763 .6891565 .1954211 .3476793 1
-------------+---------------------------------------------------------
employed | 162,763 .6988382 .4587643 0 1
use micro, clear
// sample restrictions
keep if inrange(year, 1980, 2015)
keep if (sex == "Female":sex_lbl)
keep if inrange(age, 25, 45)
keep if inrange(marst,"Separated":marst_lbl,"Never married/single":marst_lbl)
keep if (educ < "1 year of college":educ_lbl)
// define employment
assert !missing(incwage)
gen employed = (incwage > 0)
// create child groups replace nchild = 2 if (nchild > 2)
save micro2, replace
Graphing the Big Picture
We’ll start by graphing some big picture trends in employment and earnings for this sample.
- Create a graph that shows how female employment rates varied with family size from 1980- 2015. The y-axis should be employment rate (ranging from 0 to 1 or 0 to 100, depending on how you coded the variable) and the x-axis should be years. Your graph should have three lines for each of these three groups: zero kids, one kid, and two or more kids. You may find Stata’s
collapse
command helpful in dividing your dataset into groups. Be sure to use the weight your data by the sample weight ([pw=wtsupp]
) when calculating the employment rate. You can use weights with thecollapse
command. Add a vertical line at the year 1993. I invite you to use the scheme plotplainblind, but is not necessary. In gradescope put (a) your code that replicates the figure and (b) the figure. Title your axes! - Create a similar graph (as above) that shows how women’s average earnings varied with family size from 1980-2015. Income should range from $0 to $25,000
- Researchers have used difference-in-differences (DD) to measure the effects of the 1993 EITC reforms on female employment and earnings. The DD strategy compares women with two or more children (the treated group) to childless women, who were arguably less affected by the EITC expansion. Use your graphs to assess whether DD is an appropriate strategy for this analysis, think of: What assumption has to hold for DD to capture the causal effect of the EITC reforms on employment? Do your graphs support that assumption?
// get annual averages by child groups
collapse incwage employed [pw = wtsupp], by(nchild year)
reshape wide incwage employed, i(year) j(nchild)
// plot employment by child group
line emp* year, ///
title("Employment Rates for Single Women with No College Education", color(black) size(medium)) ///
scheme(plotplainblind) ylabel(0(.1)1, grid gstyle(major)) ///
xlabel(1980(2)2015, angle(45)) ///
xmtick(1980(1)2015, tstyle(minor_nolabel)) ///
xline(1993, lpattern(dash) lcolor(black)) ///
legend(label(1 "no kids") label(2 "one kid") label(3 "two or more kids") ring(0))
graph export employment.png, replace
line inc* year, ///
title("Earnings for Single Women with No College Education", color(black) size(medium)) ///
scheme(plotplainblind) ///
ylabel(0(5000)25000, grid gstyle(major)) ///
ymtick(0(2500)25000, grid tstyle(minor_nolabel) gstyle(minor)) ///
xlabel(1980(2)2015, angle(45)) ///
xmtick(1980(1)2015, tstyle(minor_nolabel)) ///
xline(1993, lpattern(dash) lcolor(black)) ///
legend(label(1 "no kids") label(2 "one kid") label(3 "two or more kids") ring(0))
graph export earnings.png, replace
Doing Difference-in-Difference
We’ll now use DD regressions to estimate the causal effects of EITC expansions on female employment. For this part of the analysis, we’ll make three extra sample restrictions to make our analysis more clean and straightforward:
- We’ll only compare childless women to those with two or more children. You can drop women with one child from your sample entirely.
- Since the Great Recession also greatly affected employment, we’ll want to exclude it from our analysis, so we’ll focus on years prior to the Great Recession, so you can drop observations from 2008 onward.
- We’ll also exclude the years 1994-1999 from our analysis. EITC reforms were phased in slowly during the 1990s, so these interim years were not fully “treated” by the expansion. We’ll therefore define the post-reform period as the years 2000-2007, when the expansion was in full effect. Your analysis sample should include only the years 1980-1993 and 2000-2007. (DD can be adapted for treatments that are phased in slowly over time. Though those tools are beyond the scope of this assignment, you can read more about them in Hoynes and Patel (2016).)
- The simplest DD specification has four regressors:
- Run the DD regression in (1) using the employment dummy as your outcome variable. Interpret each of the parameter estimates () in words that everyone can understand, don’t use jargon from class. Which parameter captures the causal effect of the EITC expansion on mothers’ employment?
- Replace the dummy in regression (1) with a full set of year fixed effects – i.e. a separate dummy variable for each year in the sample:
- Women with children tend to be older than childless women, and older women are also more likely to work. Age is therefore a potential source of bias in this analysis. Add the age of the respondent as a control variable in your regression:
- Female employment rates also vary widely across states. Add state fixed effects to your model in step 3. Answer the following questions: How do these controls alter the estimated effect of EITC expansion on employment? Do these results strengthen or weaken your confidence in the DD strategy?
- Create a table that shows the main results from the regressions above. Each column should be a different regression, and the rows should have the main coefficients (not the fixed effects) with their standard errors and *s to indicate their statistical significance. Make sure to add footnotes to explain the *s, and any other important notes. (Hint: use the
esttab
command). Present the code and the table - Repeat steps 1-4 using annual earnings as the outcome variable instead of employment.
- Present your results from these 8 regressions (4 for earnings, 4 for employment) in a single (professional looking) table. You only need to include the main DD estimate from each regression. No need to report the coefficients from all of the other year effects, control variables, etc. in your table. If you are looking to do it all within stata and esttab, check out appendmodels, and examples of how to use it online.
- Briefly interpret your results from the earnings regressions. What is the main result from the simple DD specification? How does that estimate change as you add the controls in steps 2-4?
- Present the results of the event study for the outcome “employment” in a graph. This means making a plot where year is the x-axis and you are plotting the from the regression, so the y-axis is the possible ranges of the coefficients. Make sure you also plot their confidence intervals. You will notice that because we have dropped years 1994-1999, that there is a “dead” space in the graph. You can amend this by changing the values from the year variable to be subtracted by 6 starting in 2000. This will provide a continuous graph. Hint: Check out the commands
regsave
orcoefplot
. - How are the coefficients from related to the graph of employment that you made above?
- Given the results from the graph, what is your assessment of parallel trends?
- Write a tweet-thread of your main findings so that everyone can understand. Lots of things you can talk about: context, findings, size, length of effects, to who, implication, etc. A tweet itself is about 250 characters, so a max of 8 tweets. Feel free to include figures or pics!
use micro2, clear
// sample restrictions
drop if (nchild == 1)
drop if inrange(year, 1994, 1999) | (year >= 2008)
*drop if (year >= 2008)
// DD regressors
gen post = (year > 1993)
gen treat = (nchild == 2)
gen postTreat = post*treat
where identifies women who had two or more children in year , and identifies years from the post-reform period. You are about to run a total of eight regressions and at the end we are asking you to make two tables, so we recommend reading the whole set of instructions and then writing your code appropriately. Create all the variables are needed and make sure you add the weighting option across all your regressions: reg y x [pw=wtsupp]
How does this model differ conceptually from the regression in (1)? That is, what do the year effects capture that the single dummy did not? Does this change alter your estimated effect of EITC expansion on employment?
Interpret your estimate of in words (talk about statistical and economic significance). In addition, discuss the following: does controlling for age alter the estimated effect of EITC expansion on employment? Are you more (or less) confident about giving our estimate a causal interpretation?
// run DD regressions
estimates clear
local as = 0
foreach y in employed incwage {
eststo: reg `y' post treat postTreat [pw = wtsupp]
local as = `as' + 1
addstats `as' postTreat
eststo: reg `y' i.year treat postTreat [pw = wtsupp]
local as = `as' + 1
addstats `as' postTreat
eststo: reg `y' i.year treat postTreat age [pw = wtsupp]
local as = `as' + 1
addstats `as' postTreat
eststo: reg `y' i.year treat postTreat age i.statefip [pw = wtsupp]
local as = `as' + 1
addstats `as' postTreat
eststo: reg `y' i.year postTreat age i.statefip [pw = wtsupp]
local as = `as' + 1
addstats `as' postTreat
}
// display results Employed
esttab est1 est2 est3 est4 est5, keep(postTreat post treat _cons) ///
se label title(Effects of EITC on Employment) ///
mtitle(Baseline "+Year FE" "+Age" "+ State FE") ///
coeflabels(postTreat "Post x Treat") ///
stats(N r2 premean postmean perceffect2,label("N" "R-Squared" "Pre Mean" "Post Mean" "Perc. Effect") fmt (%9.0gc %7.2f ))
esttab est1 est2 est3 est4, keep(postTreat) ///
se label title(Effects of EITC on Employment) ///
mtitle(Baseline "+Year FE" "+Age" "+ State FE") ///
coeflabels(postTreat "Post x Treat") ///
stats(N r2 premean postmean perceffect2,label("N" "R-Squared" "Pre Mean" "Post Mean" "Perc. Effect") fmt (%9.0gc %7.2f ))
* Earnings
esttab est6 est7 est8 est9, keep(postTreat) ///
se label title(Effects of EITC on Earnings) ///
mtitle(Baseline "+Year FE" "+Age" "+ State FE") ///
coeflabels(postTreat "Post x Treat") ///
stats(N r2 premean postmean perceffect2,label("N" "R-Squared" "Pre Mean" "Post Mean" "Perc. Effect") fmt (%9.0gc %7.2f %9.0gc %9.0gc %7.2f))
7. Creating an event study. Graphing an event study is important for assessing parallel trends assumptions. For the employment outcome run the model below which includes, year fixed-effects, state fixed-effects, age, the treatment group variable (two kids) and the interaction of the treatment group variables with every single year dummy. You will notice that instead of having one , we have several, each representing a for every single year. This specification is the event study. Pick the year 1992 as base.
* Event study
use micro2, clear
// sample restrictions
drop if (nchild == 1)
drop if inrange(year, 1994, 1999) | (year >= 2008)
*drop if (year >= 2008)
// DD regressors
gen post = (year > 1993)
gen treat = (nchild == 2)
gen postTreat = post*treat
fvset base 1992 year
fvset base 1992 year
reg employed i.treat#i.year i.year i.treat age i.statefip [pw = wtsupp]
* coefplot, keep(1.treat#*.year) vertical recast(connected) scheme(plotplainblind) rename(1.treat#1980.year "1980")
regsave using hw6_es.dta, replace addlabel(y,employed) ci
fvset base 1992 year
reg incwage i.treat##i.year age i.statefip [pw = wtsupp]
regsave using hw6_es.dta, append addlabel(y,incwage) ci
use hw6_es.dta, clear
gen year=.
forvalues i=1980/2007 {
replace year=`i' if var=="1.treat#`i'.year"
}
replace year=1992 if var=="1.treat#1992b.year"
replace year=1992 if var=="1o.treat#1992b.year"
replace year=year-6 if year>=2000
cap label drop year
label define year ///
1994 "2000" ///
1995 "2001" ///
1996 "2002" ///
1997 "2003" ///
1998 "2004" ///
1999 "2005" ///
2000 "2006" ///
2001 "2007"
label values year year
twoway ///
(connected coef year if y=="employed", sort) ///
(line ci_lower year if y=="employed", sort lpattern(dot)) ///
(line ci_upper year if y=="employed", sort lpattern(dot)) if coef<=0.4, ///
scheme(plotplainblind) legend(off) xline(1993) xlabel(1980(1)2001, valuelabel angle(45)) xtitle("") ylabel(-0.4(0.1)0.4) ///
title(Event study: Effect of EITC on Women's employment)
graph export "eventstudy_employment.png", replace
use micro2, clear
// sample restrictions
drop if (nchild == 1)
drop if inrange(year, 1994, 1999) | (year >= 2008)
*drop if (year >= 2008)
// DD regressors
gen post = (year > 1993)
gen treat = (nchild == 2)
gen postTreat = post*treat
fvset base 1992 year
reg employed i.treat#i.year i.year i.treat age i.statefip [pw = wtsupp]
coefplot, keep(1.treat#*.year) vertical recast(connected) scheme(plotplainblind) xlabel(,angle(45)) ylabel(,angle(horizontal)) title(Event study: Effect of EITC on Women's employment)
* or
coefplot, keep(1.treat#*.year) vertical recast(connected) scheme(plotplainblind) xlabel(,angle(45)) ylabel(,angle(horizontal)) title(Event study: Effect of EITC on Women's employment) rename(1.treat#1980.year="1980" 1.treat#1981.year="1981" 1.treat#1982.year="1982" 1.treat#1983.year="1983" 1.treat#1984.year="1984" 1.treat#1985.year="1985" 1.treat#1986.year="1986" 1.treat#1987.year="1987" 1.treat#1988.year="1988" 1.treat#1989.year="1989" 1.treat#1990.year="1990" 1.treat#1991.year="1991" 1.treat#1993.year="1993" 1.treat#2000.year="2000" 1.treat#2001.year="2001" 1.treat#2002.year="2002" 1.treat#2003.year="2003" 1.treat#2004.year="2004" 1.treat#2005.year="2005" 1.treat#2006.year="2006" 1.treat#2007.year="2007")
Extensions
- Given the clean data, see if you can replicate the same sample with the same variables but downloading it from CPS.
- If you need help processing the information from the graphs, ask yourself the following questions: what was the employment rate/earnings for childless women in 1990? In 2000? What was the employment rate/earnings for women with two or more kids in 1990? How did it change by 2000? Express your answer as both a difference in (approximate) percentage points and as a percentage change. Notice how the big shift in mothers’ employment started after 1993, when Congress reformed the EITC in the Omnibus Budget Reconciliation
- This analysis makes a pretty strong case that the EITC expansion achieved one of its major goals: increasing labor force participation among single mothers. When we think about analyzing policies we want to be mindful of the benefits and cost (in fact you can even take a class on this next year!) For the purpose of this class, we are focusing on methods, but we don’t like to abstract ourselves from the interesting policy discussion around these topics, so let’s think about cost/benefits: Is increasing employment necessarily good for low-income, female-headed households? What other outcomes might you want to study (besides earnings and employment) in assessing the costs and benefits of the expansion? Limit your answer to 150 words.
- Now that you’ve done this homework you may find these two (short) readings a bit more entertaining: Safety Programs in the U.S, The Success of the Earned Income Tax Credit or this report. Finally another important policy question is, does the program “pay for itself”? Fortunately there are people moving the needle on this are, check the work by Bastian and Jones (2021) on this. Notice the publishing date, this is timely questions.