7️⃣

Homework DD

Purpose

The objective of this homework is for you to practice concepts learned in class and apply them to a real-case scenario. The concepts we will practice in this homework relate to difference-in-difference.

This homework has the most “real” feeling of what data analysis is like: thinking about a policy, downloading data, cleaning it, analysis, and providing conclusions! I hope you feel accomplished at the end of this homework because of all the progress you’ve made! Think of what your abilities were on data-analysis before starting the program and think of them now, and I hope that you see progress on yourself!

Guidelines

Your responses should be professionally formatted and written. You can type the answers in word, PDF, or a google doc file.
Submit your do-file and answers on this submission form
You will get points for correct answers. You will get points deducted if the answer contains more information that’s not necessary or if the answer contains incorrect statements among correct statements. In short, we are trying to incentivize students to use the least amount of characters while maximizing the accuracy of responses.
The due date is June 30th, 2025

Preamble

In this assignment you’ll study the Earned Income Tax Credit, one of the largest anti-poverty programs in the United States. The EITC provides more than $67 billion to more than 27 million Americans each year. The data for this exercise come from the Current Population Survey (CPS), which you can access online. This is an actual real dataset from a survey from the government. You will be using this data and a difference-in-difference design to understand the impact of EITC on women’s labor supply. Get acquainted with the basic structure of the EITC: What is the EITC? Who’s eligible? How much support can the beneficiaries receive? (The Tax Policy Center and the Center on Budget and Policy Priorities offer two good primers you can read. Jeffrey Liebman provides a more thorough history.)

Getting acquiescence with the data

For this homework, we will use the CPS Annual Social and Economic Supplement (ASEC). You’ll often see the ASEC referred to as the “March CPS” since it’s collected each year in March. Download the March CPS data for the years 1980 through 2015. The variables to include are the ones below. The data can be found here as micro_2.dta

Once you’ve downloaded and opened the data, check that the number of observations is the same.

UPDATES:

Please notice that CPI is a variable I added from the Excel file I’ve attached below; it does not come with IPUMS CPS. In short, you don’t have to add that variable when selecting a variable to download. The data should contain the following variables:

The first step in cleaning the data is to change the years: replace year = year - 1 , this is because people reporting in outcomes in 2000 are talking about things that happened in 1999.
Since we need to compare apples to apples, convert those nominal earnings (incwage) to 2015 real dollars using the Consumer Price Index (CPI). The variable you need is already in the data. We have not discussed in RMDA how to turn nominal values to real values, but I hear that something you’ve learned in other classes ;), so let’s put it into practice! In the end your average real income variable should look like of these screenshots

Restrict your sample to single women between the ages of 25 and 45 (inclusive). These women are likely the primary earners in their households. You should exclude women who were married when surveyed but include those who were never married, separated, divorced, or widowed.
Since the EITC targets low-income workers, we’ll focus on women who never attended college. Restrict your sample to women whose educational attainment was a high school diploma, GED, or less.
We’ll use incwage to measure annual earnings. We’ll refer to women as employed if they had positive earnings in a given year. Create a binary variable that identifies employed respondents.
Because EITC benefits vary with family size, we’ll explore how female employment varied with number of children. We’ll focus on three groups: women with zero kids, those with one kid, and those with two or more kids. Create binary variables that identify each of these groups.
As always, you should investigate the data for missing values and outliers before proceeding. You needn’t report any specific summary statistics from that investigation for this assignment, however. After the sample selections, do not drop any data in this process, just learn about outliers and missing values.
Here are some of the sum stats for some variables, your data after cleaning should have the same summary stats. These are not all the variables.

 sum year serial asecwth cpsid statefip  pernum cpsidp  age cpi employed
    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
        year |    158,640    1997.174    10.13177       1980       2014
      serial |    158,640    42045.73    25683.39          1      99984
     asecwth |    158,640    1577.066    1001.918          0   27403.12
       cpsid |    158,640    1.54e+13    8.39e+12          0   2.02e+13
    statefip |    158,640    27.63933     15.4911          1         56
-------------+---------------------------------------------------------
      pernum |    158,640    1.738225    1.260835          1         21
      cpsidp |    158,640    1.54e+13    8.39e+12          0   2.02e+13
         age |    158,640    34.32857    6.134668         25         45
         cpi |    158,640    .6810778    .1913255   .3476793   .9987342
    employed |    158,640    .6999685    .4582728          0          1

Alright, now that you are done with downloading and cleaning the data, let’s start doing some analysis! Do note that for everything above you are welcome to openly talk with your peers or TAs or Professor. In short, we are not grading you on getting to this point, we are just assuming that with the directions you can get to this point.

Graphing the Big Picture

We’ll start by graphing some big picture trends in employment and earnings for this sample.

Create a graph that shows how female employment rates varied with family size from 1980- 2015. The y-axis should be employment rate (ranging from 0 to 1 or 0 to 100, depending on how you coded the variable) and the x-axis should be years. Your graph should have three lines for each of these three groups: zero kids, one kid, and two or more kids. You may find Stata’s collapse command helpful in dividing your dataset into groups. Be sure to use the weight your data by the sample weight ([pw=ASECWTH]) when calculating the employment rate. You can use weights with the collapse command. Add a vertical line at the year 1993. I invite you to use the scheme plotplainblind, but is not necessary. In gradescope put (a) your code that replicates the figure and (b) the figure. Title your axes!
Create a similar graph (as above) that shows how women’s average earnings varied with family size from 1980-2015. Income should range from $0 to $25,000
Researchers have used difference-in-differences (DD) to measure the effects of the 1993 EITC reforms on female employment and earnings. The DD strategy compares women with two or more children (the treated group) to childless women, who were arguably less affected by the EITC expansion. Use your graphs to assess whether DD is an appropriate strategy for this analysis, think of: What assumption has to hold for DD to capture the causal effect of the EITC reforms on employment? Do your graphs support that assumption?

Doing Difference-in-Difference

We’ll now use DD regressions to estimate the causal effects of EITC expansions on female employment. For this part of the analysis, we’ll make three extra sample restrictions to make our analysis more clean and straightforward:

We’ll only compare childless women to those with two or more children. You can drop women with one child from your sample entirely.
Since the Great Recession also greatly affected employment, we’ll want to exclude it from our analysis, so we’ll focus on years prior to the Great Recession, so you can drop observations from 2008 onward.
We’ll also exclude the years 1994-1999 from our analysis. EITC reforms were phased in slowly during the 1990s, so these interim years were not fully “treated” by the expansion. We’ll therefore define the post-reform period as the years 2000-2007, when the expansion was in full effect. Your analysis sample should include only the years 1980-1993 and 2000-2007. (DD can be adapted for treatments that are phased in slowly over time. Though those tools are beyond the scope of this assignment, you can read more about them in Hoynes and Patel (2016).)
The simplest DD specification has four regressors:

(1)\ \ \ \ \ Y_{it}=\alpha +\beta TwoKids_{it}+\gamma Post_{t}+\delta (TwoKids_{it}\times Post_t)+\epsilon_{it}

where $TwoKids$ identifies women $i$ who had two or more children in year $t$ , and $Post$ identifies years from the post-reform period. You are about to run a total of eight regressions and at the end we are asking you to make two tables, so we recommend reading the whole set of instructions and then writing your code appropriately. Create all the variables are needed and make sure you add the weighting option across all your regressions: reg y x [pw=ASECWTH]

Run the DD regression in (1) using the employment dummy as your outcome variable. Interpret each of the parameter estimates ( $\alpha,\beta,\gamma,\ and\ \delta$ ) in words that everyone can understand, don’t use jargon from class. Which parameter captures the causal effect of the EITC expansion on mothers’ employment?
Replace the $Post$ dummy in regression (1) with a full set of year fixed effects – i.e. a separate dummy variable for each year in the sample:

(2)\ \ \ \ \ Y_{it}=\alpha +\beta TwoKids_{i}+\vec{\gamma_t}+\delta (TwoKids_{i}\times Post_t)+\epsilon_{it}

How does this model differ conceptually from the regression in (1)? That is, what do the year effects $\vec{\gamma_t}$ capture that the single $Post$ dummy did not? Does this change alter your estimated effect of EITC expansion on employment?

Women with children tend to be older than childless women, and older women are also more likely to work. Age is therefore a potential source of bias in this analysis. Add the age of the respondent as a control variable in your regression:

(3)\ \ \ \ \ Y_{it}=\alpha +\beta TwoKids_{it}+\vec{\gamma_t}+\delta (TwoKids_{it}\times Post_t)+\phi Age_{it}+ \epsilon_{it}

Interpret your estimate of $\phi$ in words (talk about statistical and economic significance). In addition, discuss the following: does controlling for age alter the estimated effect of EITC expansion on employment? Are you more (or less) confident about giving our estimate a causal interpretation?

Female employment rates also vary widely across states. Add state fixed effects to your model in step 3. Answer the following questions: How do these controls alter the estimated effect of EITC expansion on employment? Do these results strengthen or weaken your confidence in the DD strategy?
Create a table that shows the main results from the regressions above. Each column should be a different regression, and the rows should have the main coefficients (not the fixed effects) with their standard errors and *s to indicate their statistical significance. Make sure to add footnotes to explain the *s, and any other important notes. (Hint: use the esttab command). Present the code and the table
Repeat steps 1-4 using annual earnings as the outcome variable instead of employment.

Present your results from these 8 regressions (4 for earnings, 4 for employment) in a single (professional looking) table. You only need to include the main DD estimate from each regression. No need to report the coefficients from all of the other year effects, control variables, etc. in your table. If you are looking to do it all within stata and esttab, check out appendmodels, and examples of how to use it online. Again you can just show me an excel file or a table in word, if you don’t want to use STATA.
Briefly interpret your results from the earnings regressions. What is the main result from the simple DD specification? How does that estimate change as you add the controls in steps 2-4?

7. Creating an event study. Graphing an event study is important for assessing parallel trends assumptions. For the employment outcome run the model below which includes, year fixed-effects, state fixed-effects, age, the treatment group variable (two kids) and the interaction of the treatment group variables with every single year dummy. You will notice that instead of having one $\delta$ , we have several, each representing a $\delta_t$ for every single year. This specification is the event study. Pick the year 1992 as base.

(4)\ \ \ \ \ Y_{it}=\alpha +\beta TwoKids_{i}+\vec{\gamma_t}+\vec{\delta_t} (TwoKids_{i}\times \vec{\gamma_t})+\vec{S_s}+\phi Age_{it}+ \epsilon_{it}

Present the results of the event study for the outcome “employment” in a graph. This means making a plot where year is the x-axis and you are plotting the $\delta_t$ from the regression, so the y-axis is the possible ranges of the coefficients. Make sure you also plot their confidence intervals. You will notice that because we have dropped years 1994-1999, that there is a “dead” space in the graph. You can amend this by changing the values from the year variable to be subtracted by 6 starting in 2000. This will provide a continuous graph. Hint: Check out the commands regsave or coefplot.
How are the coefficients from $\delta_t$ related to the graph of employment that you made above?
Given the results from the graph, what is your assessment of parallel trends?
Write a tweet-thread of your main findings so that everyone can understand. Lots of things you can talk about: context, findings, size, length of effects, to who, implication, etc. A tweet itself is about 280 characters (I think), so a max of 8 tweets. Feel free to include figures or pics!