You can use these questions as a guide to practice for the exam. We have about 35 questions here, plus all the questions from your quizzes, homework, and worksheets. If you need more than that, you also have questions in your books and questions from the discussion section.

A couple of guidelines:

- You can use these questions as either assessment or evaluative.
- If you plan to use these questions as an “assessment,” I recommend you not study, take these questions, and then go back to exploring those topics in which you feel weaker.
- If you plan to use these questions as "evaluative," I recommend also timing yourself. Since the exam is a time-constrained exercise, it's good also to practice questions with a time constraint.
- We note that some answers are meant to be didactical (teaching moments) rather than answers that get straight to the point.
- Some questions will say, "Show your work," but in the answers, we show numbers. One would want to show the process; the answer is to check if you use the proper method.
- It would be best to try not to learn all types of questions, as this will train you to answer a particular question rather than "any question."

### Job Training Program

Examining job-training programs has been an important policy to evaluate, as these are pretty common. The following example follows from Scott Cunningham's Mixtape book. The National Supported Work Demonstration (NSW) job-training program was operated by the Manpower Demonstration Research Corp (MRDC) in the mid-1970s. The NSW was a temporary employment program designed to help disadvantaged workers lacking basic job skills move into the labor market by giving them work experience and counseling in a sheltered environment. It was also unique in randomly assigning qualified applicants to training positions. The treatment group received all the benefits of the NSW program. The controls were left to fend for themselves. The program admitted women receiving Aid to Families with Dependent Children (AFDC), recovering individuals with addiction, released individuals who were found guilty of an offense, and men and women of both sexes who had not completed high school. The MDRC collected earnings and demographic information from the treatment and the control group at baseline and every nine months after that. MDRC also conducted up to four post-baseline interviews. There were different sample sizes from study to study, which can be confusing.

- Write the earnings comparison between people in the program vs. those not in the program in a regression format. Explain how you would code each variable.
- Write out the same comparison in conditional expectation language.

The good news for MDRC and the treatment group was that the treatment benefited the workers. Treatment group participants’ actual earnings post-treatment in 1978 were more than the earnings of the control group by approximately $1,600, as Dehejia and Wahba (2002) estimated. In the paper by Dehejia and Wahba (2002), they wanted to explore whether the results from an experimental setting (an RCT) could be replicated using covariates. The authors used the non-experimental data from the current population survey (CPS) and the panel study of income dynamics (PSID) (two publicly available surveys) to create a similar control group with the data of people in the treated group from the RCT. The adjusted models are the models which include covariates. The list of covariates varies by sample.

- Practice reading the results from the table and interpreting each coefficient.
- Practice knowing which estimates are statistically significant and which are not
- What are the results of the naive comparison?
- How does the naive comparison change when adding controls? What does this tell you about the signs of bias when adding controls?
- The results from the multivariate regressions are different from those with an RCT. What could be explaining this difference?
- Starting with the model with covariates that don't come from the RCT. Write down one regression that will allow us to explore how the effect differs between females and males.
- The main effect in the model with covariates and using the PSID is $731. If we ran the same regression but only using males, the program's effect would be $800, and the program's effect when only using females would be $200. We also know that males' average earnings before the program is $600, and females earn 85 cents on the male dollar. Use these numbers to provide numerical values for the coefficient in the regression you proposed. Then, tell us if this program reduces or increases the earning gap, and if so, by how much.
- For this exercise, let's use the results from the second column. Let's assume the RCT was run using a representative sample of Virginia. The earnings for the control group were $2,000. If we were to use these results for the rest of the U.S., what would be the average salary of people in the program if the average salary for the target group of the rest of the U.S. is around $1,300?

$\beta_1 =800\ and\ \beta_3=-600$ $\alpha_0=600\ and\ \beta_2=-90$ In the second part, without the program, we know that females make 85% of what males make. After the program, males make $1,400 while females make $710. This means that now females make 50% of what males make, widening the gap by 35 percentage points or increasing the gap by 41%.

### CEO's and Salary

In some economic frameworks, wages represent productivity. CEOs of medium and big businesses have their wages attached to the profits in a given year. We want to explore this relationship. Suppose you estimate a simple regression of CEO salary (in $ 1,000) on firm profits (in $ 1,000,000s) and obtain the following result:

- The average profits in this sample are $200 million. Given this information, can you determine the average annual salary for a CEO in this sample? Show your work.
- Given the estimated model, what do you predict would be the salary for a CEO whose company breaks even? How much is the salary expected to increase for every additional million in profits? What would happen if the company lost a million dollars? Show your work.
- Does this simple regression necessarily capture the causal relationship between a firm's profits and its CEO's salary? Explain your answer.
- Suppose the "true model" includes firm size. Since we have omitted company size in the model above, how is the coefficient on profits likely to be biased? Explain your answer.
- Unfortunately, you have lots of missing data from the CEO salary side, and we are wondering how the missing data will bias our estimates. Formulate a situation in which this missing data would imply that our current regression understates the relationship between profits and salary (i.e., our estimates are more negative because we don't include the missing data).
- We are following the example of missing data. Imagine there is no pattern, or we don't have any reason to believe there is a pattern in the missing data. Will this still bias our results?

### An example from an APP

Last year, a Batten student had an Applied Policy Project (APP) about malnutrition in Guatemala. Her client has told her that malnutrition is prevalent in poor rural communities and indigenous communities. The student thinks that there may be discrimination against indigenous communities from the government, but her client thinks she is correlating poverty with being indigenous. She is given data that has information at the “county” level. It contains information on whether or not a given county has received government aid, the percentage of people living in poverty in that county, and whether that county is considered an Indigenous community or not (which takes the value of 1 if at least 60% of the population comes from an indigenous background and 0 if not). To disentangle the effect of both and test her hypothesis, the student first runs this regression:

- Practice what is the interpretation of each coefficient.
- Using the estimates above, what is the
*marginal effect*of being indigenous for a county with a 10% poverty level? - Using the results from this model, she estimates the predicted probability for two counties with the same poverty level (10%), one county considered indigenous and the other not. She finds that the model predicts that indigenous counties have a
*higher likelihood*of receiving government aid than non-indigenous counties. Does this finding contradict her hypothesis that indigenous communities are discriminated against? - She now realizes that there are non-linear returns to poverty, and she estimates the following equation: $Pr(Receiving\ Government\ Aid)=\gamma_{0}+\gamma_{1}Poverty\ Level_{c}+\gamma_{2}Poverty\ Level_{c}^{2}+\gamma_{3}IndigenousCommunity{}_{c}+\epsilon_{c}$

*one percentage point*increase in poverty level is associated with a 0.75 percentage point increase in receiving aid. Another interpretation: a change of 0.75 percentage points is poverty's marginal effect on non-indigenous communities receiving aid. Now, the confusing part here is the 0.75 interpretation. Shouldn't it be 75 percentage points? The Y, being a binary variable, does give us the percentage point unit change for our beta. Notice however, what is a unit change in poverty level, the variable is going from 0 to 1 but not in a binary way, but a continuous way, so going from 0 to 1 is like going from 0 poverty to 100 percent poverty. Therefore, a one percentage point change is not a change from 0 to 1, but a change from 0 to 0.01, and so 0.01x0.75=0.0075, but now to have the percentage point interpretation, we multiply by 100, so 0.0075*100=0.75, and so we end up with the interpretation above. This exercise highlights thinking clearly about what a one-unit change in X is and what a one-unit change in Y is.

2) Indigenous communities with zero poverty are 35 percentage points more likely to receive aid than non-indigenous communities.

3) Compared to non-indigenous communities, indigenous communities are 1.2 percentage points less likely to receive aid per percentage point increase in poverty.

```
clear
sysuse auto
set seed 12345
generate scale_1 = runiform()
sum uniform
gen scale_100= scale_1*100
reg price scale_1
reg price scale_100
* Notice how a one unit change in scale_1 is really a 100% unit change in scale_100
```

Where, $\gamma_0=3.99\ \gamma_1.=0.12,\ \gamma_2=-0.01\ \gamma_3=-0.3$At what poverty level do the predicted probabilities reach a maximum? Explain how the effects of poverty on the likelihood of receiving aid change across poverty levels. For this problem, consider the poverty level ranging from 0 to 100 instead of 0-1. Another way of asking the same question is to “provide a full interpretation of the relationship between the poverty level and government aid.” Notice that the second version of the question is more general and does not hint at precisely what you should do. Still, hopefully, you recognize that seeing a squared term should mean a more careful interpretation.

Since $\gamma_1>0\ and\ \gamma_2<0$, this means that as poverty increases, the likelihood of receiving government aid increases as well then once the poverty level hits 6% each marginal percentage point increase in the poverty level decreases the likelihood of receiving aid.

### Three times the charm

A researcher is interested in the effect of having a third child on a woman’s wages (where the data set contains women with at least two children). She wants to estimate the following model:

Where wages are log hourly wages, $thirdkid$ is a dummy=1 if the woman has a third child, and the education and experience variables are defined in years.

- The researcher decides to use “$sexmix$” as an instrument for “$thirdkid$,” where “$SameSex$” is a dummy=1 if the first two children are of the same sex and is equal to zero if they are of the opposite sex. First, why might the researcher want to use an instrument for “$thirdkid$?”
- Do you think the variable “$SameSex$” meets the requirements for an instrument? Be sure to address each of the requirements for instrumental variables.
- Does the instrument affect X (First-stage)? This is plausible. We have seen evidence that having two sets of same-sex kids makes certain people more likely to have a third kid. More importantly, this is testable.
- Is the instrument randomly assigned? Conditional on having two kids, who gets two males or females vs female and males could be considered random. One could argue that in certain places families could “chose” this based on sex preferences before birth (i.e. making birthing decisions based on gender) and that families with higher resources could enact these preferences in a higher rate than families with lower resources.
- Can the instrument affect Y through another mechanism that is not X? It is hard to come up with arguments of another path in which sex-mix could affect wages that is not through the number of children's mechanism. One argument could be that the gender compositions of the siblings directly affects educational investment because of gender preferences of the parents.
- Monotonicity: Does the instrument only push people in one direction? This is credible; it's hard to think that having two same-sex children makes you more likely not to have a third child (relative to a non-same-sex pair). It's possible but less likely.
- Write down the equation the researcher will estimate as the first stage using 2SLS.
- Write down the equation the researcher will estimate as the second step. Which parameter tells you the effect of a third child on wages?
- Write down the equation to estimate if they were to use the reduced form.
- Who are the never-takers in this example?
- Imagine you find a table with the following results.
- Which of these estimates is the IV estimates? What is the interpretation?
- What are columns 1,2 and 3 representing, respectively?
- What is the value of the coefficient on same-sex in the third column? Provide an interpretation

note that is is important to have the controls.

Where $\widehat{ThirdKid}$ comes from the predicted values of the first stage (which is important to note). $\beta_1$ would recover the parameter of interest. Note that it is important to have the controls and to specify what a third kid hat is, not enough to say it is just a “hat.”

### Get out the vote

To better understand the effects of “Get-out-the-vote” messages on voter turnout, Gerber and Green (2005) conducted an RCT involving approximately 30,000 individuals in New Haven, CT, in 1998. One of the treatments was randomly assigned in-person visits in which a volunteer visited the person's home and encouraged him or her to vote. Table 3 reflects the findings from the RCT. Before answering the questions, think about what the instrument (Z), the main explanatory variable (D), and the main outcome (Y) would be.

- What is the estimate of the first stage (Effect of Z on D)? Show your calculation
- What is the estimate of the reduced form (Effect of Z on Y)? Show your calculation
- What is the IV estimate of the effect of
*in-person contact*on voting? - Provide an interpretation of the IV estimate.

### Linearities

A hypothesis exists out there that discrimination happens to people with high BMI (Body Mass Index) and potentially low BMI as well. Hence, it is important to understand the relationship between BMI and wages. However, it’s unclear if the relationship is causal; if we see lower wages for people with high BMI, is it because having a high BMI is correlated with characteristics that would make them have lower wages? Or because they are discriminated against because they have high BMI? This graph tried to answer that question by dividing the effects of BMI on wages between jobs requiring high and low levels of social interaction. The idea is that if discrimination doesn’t contribute to the decreases in wages, then we should not see any difference between these two lines, but if there is, then we would see a difference. The following graph presents the relationship between BMI and wages across high and low-social jobs. Wages are measured and Ln(wages), which we have not seen yet, so for this exercise, you can treat them as just hourly wages.

- Write out one regression that would allow you to draw the graph for the line of low-social jobs
- Write out one regression that would allow you to draw the graph for the line of high-social jobs
- What would be the sign of the coefficient on the square term in both regressions proposed?
- Write out one regression that would allow you to draw both lines from the graph

### Practice multiple choice in quizzes

Recall that you can re-take the quizzes for more practice or work on examples from the worksheets.