🎒

Practice Problems for Exam 2 (2026)

The following are practice questions for the upcoming exam; we’ve written a hefty 75ish questions. Just a reminder that some of the answers are didactical as opposed to exactly what we would want in an answer. In addition to the questions here, recall that you can always re-take some of the quizzes, or you can go over the worksheets and turn some of those into questions; the worksheet has several questions to help you think about regression in different ways and of course, if more questions are needed, you can go over homeworks. If that’s not enough, you could also go over lectures through exercises we did in class or that are on the slides. You can also create new exercises by doing things “backward” or in different directions. Finally, you can find even more exercises/questions in the book Real Stats at the end of each chapter. And if that’s not quite enough, the internet is your oyster!

Warm-up is a great way to start your exercises, so as a warm-up, you can redo the practice for midterm 1 and then do midterm 1 again!
Recall that these exercises are not exhaustive of all the concepts we’ve seen in class! or all type of questions we’ve seen or will see.

Some broad tips on using these tools

Doing a bunch of questions could be useful, but it could also be not beneficial if you are not training your brain to think carefully. For example, a better way of using tools like this is to first write answers for all without looking at the right answer. Second, discuss the answers with a peer. Finding someone who disagrees with your answer is particularly helpful. Discussing how to approach a question (without looking at the right answer) is a useful exercise that the brain can take advantage of to make things “click.” Finally, have someone grade you and give you a grade without them telling you what you got wrong or right. Then, re-take it or re-do the exercise and have them grade it again until you get 100%. In short, the ideal case is when one never looks at the right answer.

Notice that when doing any work assignment, especially in a non-school setting, one doesn’t know the “right answer.” You only know how you think you would approach it, and that’s exactly what we are after. So overall, the key is to see the “right answer” as the last thing you do.

When you have a question that has multiple-choice options (like in the quizzes), go through each option and think about why it’s right or wrong or what you could change it to to make it right.
“Trivial mistakes.” Sometimes, we look at an answer, realize we made a mistake, and categorize it as a “silly mistake.” Sometimes, this makes sense, but we must be careful about what we categorize as a “silly mistake” and not do anything about it. You want to ask yourself, “How could I change my process to guarantee that this doesn’t happen again?”. For example, if the pilot industry were comfortable with “silly mistakes,” we’d be in a pickle. Their approach is to create a set of “checklists” to ensure the likelihood of making a silly mistake is zero. What does that mean for RMDA? For example, let’s say your silly mistake is “wrong units,” so something that you want to add to your process is “check in what units should be the final answer” as part of the process; you can add that step, maybe at the end. The takeaway from this tip should be: “How could I change my process to guarantee that this doesn’t happen again?”.
Change the scenarios: You can create more questions out of these questions. For example, change some numbers and re-do problems. Maybe change the Y to other units; how do the results change? Ask yourself: What other questions could we ask given this setting? etc. The practice of making your brain think about other potential questions is the “studying” itself.
You can use these questions as either assessment or evaluative. We note that some answers are meant to be didactical (teaching moments) rather than answers that get straight to the point. Some questions will say, “Show your work,” but in the answers, we show numbers. It should be understood that one would want to show the process, and the answer is to check if you are using the right method.

If you plan to use these questions as an “assessment,” I recommend you not study, take these questions, and then go back to studying those topics in which you feel weaker.
If you plan to use these questions as “evaluative,” I recommend timing yourself. Since the exam is a time-constrained exercise, it’s also good to practice questions with a time constraint.

Practice questions

Let’s make sure your regression interpretation is not rusty. Let’s work through these questions (some that you may have seen before) and try to get 100% before moving to the more conceptual questions.

For the following questions, refer to the following equation and its respective graph

Y_{i}=\alpha_0+\alpha_1Treatment_{i}+\epsilon_{i}

What’s the value of $\hat\alpha_0 \ and\ \hat\alpha_1$ ?

‣

Answer

✅

\hat\alpha_0=0\ \ and\ \ \hat\alpha_1=2

What’s the value of $\hat\alpha_0 \ and\ \hat\alpha_1$ ?

‣

Answer

✅

\hat\alpha_0=4 \ and\ \hat\alpha_1=-10

When do countries tax wealth? Taxes are a big deal. they affect how people allocate their time, how much money government has, etc. Inheritance taxes are a particularly interesting tax policy because of the clear potential for conflict between rich and poor. Scheve and Stasavage (2012) investigated the sources of inheritance taxes by looking at tax policy and other characteristics of 19 countries for which data is available from 1816 to 2000. The data is measured every five years. Specifically the researchers looked at the relationship between inheritance taxes and who was allowed to vote. To assess if expanded suffrage led to increases or decreases in inheritance taxes, we can begin with the following model:

Inheritance\ tax_{it}=\beta_0+\beta_1Expanded\ Suffrage_{i,t-1}+\epsilon_{it}

The dependent variable is the top inheritance tax rate, which is measure as a percent (0-100), and the independent variable is a dummy variable for whether all men were eligible to cote in at least half of the previous years.

What does $\beta_0$ represents?

‣

Answer

What does $\beta_1$ represent?

‣

Answer

What’s the average difference in inheritance tax between expanded and not expanded suffrage?

‣

Answer

✅

\beta_1

Let’s say you get:

Inheritance\ tax_{it}=4.75+19.33Expanded\ Suffrage_{i,t-1}

What’s the average inheritance tax for countries without expanded suffrage? Recall that units of inheritance tax are on percent from 0-100%

‣

Answer

What’s the average inheritance tax for countries with expanded suffrage?

‣

Answer

What’s the average difference in inheritance tax between expanded and not expanded suffrage?

‣

Answer

3. For the following questions refer to the following table. The outcome is inheritance tax rate.

What’s the marginal effect of having universal suffrage on inheritance tax rate in column (d)?

‣

Answer

Write column (d) as an equation with numbers.

‣

Answer

✅

InheritanceTax=-521.63+0.69UniversalCoverage+0.28Year+11.76War+2.19Europe+18.71Asia+7.84NorthAmerica + \epsilon

In an effort to better understand the effects of “Get-out-the-vote” messages on voter turnout, Gerber and Green (2005) conducted an RCT involving approximately 30,000 individuals in New Haven, CT, in 1998. One of the treatments was randomly assigned in person visits in which a volunteer visited the person's home and encouraged him or her to vote. Table 3 reflects the findings from the RCT.

What’s the marginal effect of being assign to in person contact on voting?

‣

Answer

Use the figure below to answer: What’s the sign of $\beta_2$ in the following equation? $Y=\beta_0+\beta_1X+\beta_2X^2$

‣

Answer

✅

It’s negative. This is because since there is a local max, this means that the second derivative must be negative, because the rate of change of the first derivative is decreasing. The first derivative represents the slope at any point, so in order to see that the slope is decreasing, pick a point of x (say X=25) and see what the slope is, in this case it will be a positive number. Now pick a point that’s higher than your original point (say x=75), and now see the slope of this point, notice that now the slope is negative. Therefore the slope went from a positive number to a negative number, so the slope is decreasing, which means the first-derivative is decreasing, which means the second derivative is negative, and since the second derivative is the sign of

\beta_2

we then know that

\beta_2

is negative.

Energy Efficiency promises a double whammy of benefits: if we reduce the amount of energy we can both save the world and save money. What’s not to love? In this exercise we’ll dig into how to explore this relationship. The technology innovation is a programmable thermostat, which is a device that allows the user to preset temperatures at energy-efficient levels. Another important variable is HDD “heating degree-days”, which is a measure of how cold it was in the month (it is the number of degrees that a day’s average temperature is below 65 degree Fahrenheit). Usually the relationship between HDD and temperature (measure as Therms) is positive (the colder it gets, the higher the temperature people set their thermostat). We have data of houses that use thermostat and houses that don’t. The results from an OLS analysis are below, the main outcome variable for all of these regressions is “Therms”. The cost of a therm is $1.59 per therm. The cost of the thermostat is $60

Using Model (a), What’s the main conclusion?

‣

Answer

Does you main conclusion change when accounting for HDD?

‣

Answer

Using results from model (b), how much money are houses who use thermostat saving? According to this model is the thermostat worth it?

‣

Answer

Does it make sense that the programmable thermostat should save $30 in the middle of the summer? This indicates that the cost-savings depend on the weather outside. It makes more sense to think about the effects of the thermostat with respect to temperature outside. Therefore we focus on model (c). What’s the interpretation of the number -0.48?

‣

Answer

Using Model (c), what is the effect of the thermostat when HDD is 500?

‣

Answer

✅

The effect of the thermostat is

\hat\beta_1+\hat\beta_3\times500

-0.48-0.062\times500=-31.48

, this means that the thermostat help reduce therms by 31.48, lowering the bill by $50.05

Using Model (c) What’s the average therm use for houses that don’t have a thermostat is particular hot months?

‣

Answer

Enough warm-up let’s get it: Research Design Questions

Get out the vote

To better understand the effects of “Get-out-the-vote” messages on voter turnout, Gerber and Green (2005) conducted an RCT involving approximately 30,000 individuals in New Haven, CT, in 1998. One of the treatments was randomly assigned in-person visits in which a volunteer visited the person's home and encouraged him or her to vote. Table 3 reflects the findings from the RCT. Before answering the questions, think about what the instrument (Z), the main explanatory variable (D), and the main outcome (Y) would be.

What is the estimate of the first stage (Effect of Z on D)? Show your calculation

‣

Answer

What is the estimate of the reduced form (Effect of Z on Y)? Show your calculation

‣

Answer

What is the IV estimate of the effect of in-person contact on voting?

‣

Answer

✅

In order to obtain the IV estimate we need to obtain the

\frac{ITT}{FS}

, so in this case, the ITT is=0.02 and the first stage is 0.25, so the IV estimate is: 0.020.25=0.08.

Provide an interpretation of the IV estimate.

‣

Answer

Three times the charm

A researcher is interested in the effect of having a third child on a woman’s wages (where the data set contains women with at least two children). She wants to estimate the following model:

log(wage)=\beta_{0}+\beta_{1}ThirdKid+\beta_{2}Educ+\beta_{3}Exper+\beta_{4}Exper^{2}+\epsilon

Where wages are log hourly wages, $thirdkid$ is a dummy=1 if the woman has a third child, and the education and experience variables are defined in years.

The researcher decides to use “ $sexmix$ ” as an instrument for “ $thirdkid$ ,” where “ $SameSex$ ” is a dummy=1 if the first two children are of the same sex and is equal to zero if they are of the opposite sex. First, why might the researcher want to use an instrument for “ $thirdkid$ ?”

‣

Answer

Do you think the variable “ $SameSex$ ” meets the requirements for an instrument? Be sure to address each of the requirements for instrumental variables.

‣

Answer

Write down the equation the researcher will estimate as the first stage using 2SLS.

‣

Answer

✅

ThidKid=\alpha_{0}+\alpha_{1}SameSex+\alpha_{2}Educ+\alpha_{3}Exper+\alpha_{4}Exper^{2}

note that is is important to have the controls.

Write down the equation the researcher will estimate as the second step. Which parameter tells you the effect of a third child on wages?

‣

Answer

log(wage)=\beta_{0}+\beta_{1}\widehat{ThirdKid}+\beta_{2}Educ+\beta_{3}Exper+\beta_{4}Exper^{2}+\epsilon

Where $\widehat{ThirdKid}$ comes from the predicted values of the first stage (which is important to note). $\beta_1$ would recover the parameter of interest. Note that it is important to have the controls and to specify what a third kid hat is, not enough to say it is just a “hat.”

Write down the equation to estimate if they were to use the reduced form.

‣

Answer

✅

log(wage)=\beta_{0}+\beta_{1}SameSex+\beta_{2}Educ+\beta_{3}Exper+\beta_{4}Exper^{2}+\epsilon

Who are the never-takers in this example?

‣

Answer

Imagine you find a table with the following results.

Which of these estimates is the IV estimates? What is the interpretation?

‣

Answer

What are columns 1,2 and 3 representing, respectively?

‣

Answer

What is the value of the coefficient on same-sex in the third column? Provide an interpretation

‣

Answer

NICU Babies

Suppose infants with birthweights below 1500 grams are classified as “very low birthweight” and are therefore automatically eligible for a stay in the neonatal intensive care unit (NICU) under most insurance plans.

Explain intuitively how you would use this fact to estimate the effect of NICU visits on infant health outcomes.

‣

Answer

We want to know the effect of being sent to the NICU on 1-year infant mortality. Do you think this is sharp or fuzzy regression discontinuity? What is the “running variable”?

‣

Answer

What do we need to assume about the 1500-gram cutoff to get credible identification of the effect of NICU stays?

‣

Answer

How would you explore the discontinuity in a regression? Write the equation.

‣

Answer

✅

There are a couple of possible answers:

Y_{i}=\alpha+\delta D_{i}+\gamma(Birthweight-1500)+\epsilon_{i}

Y_{i}=\alpha+\delta D+\gamma(Birthweight-1500)+\beta_{1}(D\times(Birthweight-1500)+\epsilon_{i}

Where Yi is the outcome of interest, D is a binary variable taking the value of 1 if the birthweight is 1500 or lower. However, you would have to do a 2SLS or reduced form if we wanted to explore the effect of NICU on 1-year mortalitysince since this is a fuzzy RD.

Mention a particular robustness check you would suggest performing.

‣

Answer

Let’s say we are concerned that hospital staff are coding some babies that are above the cutoff (e.g., 1550, 1600) as under the cutoff (e.g., 1450, 1490). How would this bias the coefficients? Let’s say we do the analysis and ignore these potential sources of bias. Would the effect we estimate be an upper or lower bound?

‣

Answer

Hope Scholarship

Students who graduate from Georgia high schools with GPAs of 3.0 or higher are eligible for the state’s HOPE scholarship. HOPE scholarships provide tuition support for students to enroll at public or private colleges in Georgia. The program aims to increase college enrollment overall and encourage strong students to stay in their home state.

Describe how you would evaluate the effects of HOPE eligibility on enrollment at Georgia colleges using a regression discontinuity (RD) strategy. Specify the treatment group, the control group, and any assumptions required for this strategy to capture the causal effect of being eligible for HOPE on college-going. Specify what type of relationship the running variable has concerning the outcome.

‣

Answer

Create a hypothesis of how the HOPE scholarship affects enrollment at Georgia colleges. What about how it would impact “college enrollment” overall?

‣

Answer

Draw some graphs that are consistent with your story and that would represent the RDs. Be sure to label any important features of your picture (axes, legend, etc.) How closely your picture matches your story is more important than which story you believe to be true.

‣

Answer

Write out a regression that is consistent with this story. Write down how you would code each variable. Practice seeing how these coefficients map to your graph.

‣

Answer

✅

GACollegeAttendance= \alpha_0+ \beta D_i+ \gamma(GPA-3.00)+ \epsilon

, where

\beta

=0.04 (approximately, the way it was drawn). D is a binary variable indicating whether you are above or below 3.0, and GPA-3.0 indicates the distance between one's GPA and the threshold. We do not include an interaction between GPA-3.0 and D because we drew the slopes as very similar.

Do you think the HOPE scholarship is well suited for an RD study? Would you offer any caveats about using the HOPE eligibility threshold for an RD analysis?

‣

Answer

Imagine you run an RD regression with the student's age as the outcome variable. You find a jump around the threshold. Would this finding make you more or less confident about your results?

‣

Answer

Someone is concerned that you haven’t controlled for the students' race in the proposed model, so your estimate is biased. What conditions would need to be true for a student’s race to be an issue of concern?

‣

Answer

Texas accountability program

In 1996, Texas adopted a new school accountability program to help with student performance, while the states bordering Texas did not adopt such a program. With standardized test scores (Score) in 1995 and 1997 for a large sample of 4th graders in Texas and the bordering states, we could run the regression:

Score_{st}=\beta_{0}+\beta_{1}Texas_{s}+\beta_{2}(Texas_{s}\times D97_{t})+\beta_{3}D97_{t}+\epsilon_{st}

Here, Texas=1 if the observation was drawn from a Texas school (=0 otherwise), and D97=1 if the observation was from 1997 (=0 otherwise).

Interpret what each coefficient is capturing.

‣

Answer

✅

\beta_0

represents the average scores in 95 for all bordering states in Texas.

\beta_1

represents the differences in score between Texas and all other bordering states in 95 (before the reform).

\beta_2

compares is the difference in scores between Texas and all other border states after the reform and the diffeernce in scores between texas and all other border states before the reform. This coefficient give us the DD estimate.

\beta_3

represents the change in scores in all other border states between 95 and 97. Notes: - Notice that we are not using general language like “pre-period” or “control group”

Someone would like to know what would have happened in Texas had we not implemented the policy. How would you obtain this from the regression?

‣

Answer

✅

This is obtained by

\beta_0+\beta_1+\beta_3

Which of the following provides the best estimate of the causal effect of the policy? $\beta_{0},\ \beta_{1},\ \beta_{2}\ or\ \beta_{3}$ ?

‣

Answer

✅

\beta_{2}

provides the causal effect of the policy provided the assumptions of DD are met

d. Imagine that $\beta_{0}=10,\beta_{1}=2, \beta_{2}=4,\beta_{3}=0$ . Fill in the following table:

	Texas	Border States
1995
1997

‣

Answer

e. What would have happened to the treatment group had they not received treatment?

‣

Answer

f. Someone is concerned that in the model above, you have not controlled for the difference in “culture” between TX and other bordering states. What would have to be true for this concern to be valid?

‣

Answer

g. Someone is concerned that you have not included year FE in the model. Explain if this is a concern or not?

‣

Answer

Health Coverage

A state implemented a reform (at year 0) to increase health coverage. You are in charge of estimating the effect of this policy. Use the following graphs to answer the following questions:

‣

Figure A

‣

Figure B

‣

Figure C

You have decided to use a DD strategy to analyze the effects of this policy. Which panel provides the most appropriate control group to implement a DD strategy? Explain.

‣

Answer

Using the graph you picked in 4a, calculate the effect of the policy using a DD strategy. Show your work in a table:

	Before	After	Diff
Treatment
Control
Diff

‣

Answer

Imagine if we had run the following regression: $Y=\alpha+\beta_{1}Treat+\beta_{2}Post+\beta_{3}T\times Post$ . Indicate what would be the value of each coefficient.

‣

Answer

✅

\alpha=0.78,\beta_{1}=-0.18,\beta_{2}=0.03,\ \beta_{3}=0.11

A bill in VA

For the following example, explain how to create a difference-in-difference design to estimate the effects of the policy:

The state of VA passed a bill in February 2015 to increase funding for mental health in schools. The government plans to use this money to increase the counselor-per-student ratio in each school. The bill was passed in February 2015 and enacted that Fall. You have a school-by-year dataset for all the southern states (using the census definition). This dataset includes the average students' mental health outcomes and other school characteristics. Write down the model you would use and explain what each (set of) variables means and how they would be coded (not the STATA code). Try to have a standard model and a generalized model.

‣

Answer

✅

There are several possible models:

$Mental\ Health\ Outcome_{csy}=\alpha_{0}+\beta_{1}VA_{s}+\beta_{2}Post2015_{y}+\beta_{3}Post2015_{y}\times VA_{s}+\epsilon_{csy}$

$Mental\ Health\ Outcome_{csy}=\alpha_{0}+\beta_{1}VA_{s}+\beta_{2}Post2015_{y}+\beta_{3}Post2015_{y}\times VA_{s}+\boldsymbol{\gamma}SchoolControls_{cs}+\epsilon_{csy}$

$Mental\ Health\ Outcome_{cy}=\alpha_{0}+\beta_{DD}VA\times Post2015_{y}\times SchoolsinVA_{s}+\boldsymbol{\gamma}SchoolFE_{c}+\boldsymbol{\lambda}YearFE_{y}+\epsilon_{csy}$

VA is a variable that takes the value of 1 if the school is in VA and 0 otherwise. This captures the difference in MH outcomes between VA and non-VA schools. Post2015 is dummy variables that takes the value of 1 if the year is post 2015 and 0 otherwise. This variable captures the change in MH outcomes in non-VA schools before and after 2015. In the second specification we can include school level controls since these vary by school. This capture potential confounders that would affect the outcome and the timing and adoption of the bill. In the third specification, we are including school fixed effects, which has a dummy for each school in the sample (minus one) and year fixed effects which means we have a dummy for each year in the sample (Except one). In this specification we can't include school level controls because these would be captured by the school FE if they are time-invariant. The school FE will capture time-invariant characteristics of each school. These are things like, student-teacher ratio. The year-FE would capture school-invariant characteristics that occurred that year across school. For example a release of a new social-app or the release of a new TV show like “13 reasons why”.

You have the following dataset representing the share of teens reporting anxiety in school. Using these data, create an event study figure with 2014 as your base year.

	Southern States	Virginia
2011	14	9
2012	15	9
2013	16	7
2014	17	6
2015	18	6
2016	19	8
2017	20	9
2018	21	10

‣

Answer

Use the event study to evaluate the assumptions of the research design. Is this supporting the assumptions or not?

‣

Answer

Discuss how these events could affect the causal interpretation of the generalized model from the example above. For each event, under what conditions would these events be a concern, and what conditions would it not be a concern?

A pandemic occurred in 2020.

‣

Answer

The rollout of a social media app (say Instagram) in 2015.

‣

Answer

The long-term closure of schools in 2013 Tennessee due to unprecedented snow.

‣

Answer

Increases in teacher pay of about 10% in all southern states but VA in 2017

‣

Answer

Increases in teacher pay of about 10% in all southern states but VA in 2011

‣

Answer

True or False

A dataset is a city-year panel. In a DD design, we include city-fixed effects and year-fixed-effects. We cannot include a variable such as “area in squared miles” for each city.

True
False, explain

‣

Answer

In a regression that includes school-fixed effects, these capture all time-invariant characteristics across each child.

True
False, explain

‣

Answer

A new policy in Costa Rica has expanded the number of cafeterias in some public schools. To understand the effects of cafeterias on children's nutrition, we should control enrollment in the schools to account for new students coming because of the new cafeterias.

True
False, explain

‣

Answer

If you were interested in the effect of attending a “selective college” (selective meaning something like an IVY league school; so the main explanatory variable being in a selective college or not) on lifetime earnings, we should not include college fixed effects.

True
False, explain

‣

Answer

A clear rule decides congressional elections: whoever gets the most votes in November wins. Because virtually every congressional race in the United States is between two parties, whoever gets more than 50% of the vote wins. We can use this fact to determine if Republican (or Democrat) elected members of Congress follow the average ideology of their constituents (i.e., the median voter theory) or if they deviate in ideology because of the party they belong to. (In other words, do elected members of Congress follow the people or their party?) Some argue that Republicans and Democrats are very distinctive; others say that members of Congress have a strong incentive to respond to the median voter in the district, regardless of party. We can assess how much party matters by looking at the ideology of members of Congress in the 112th Congress (which covers the years 2011 and 2012). (We define the median voter as the voter that represents the median ideology of the population). Ideology measures the “conservatism”/”liberalism” of the members of Congress. This measure was developed by Carroll et al. (2009, 2014). It ranges from -0.779 to 1.293. Higher values indicate more conservative voting in Congress. Share of vote Republican is the percentage of the vote received by the Republican congressional candidate in districts. Ranges from 0 to 1.

a. If the elected member of Congress is to the left of the threshold, what party do they belong to?

Democrat
Republicans

‣

Answer

b. If the elected member of Congress is to the right of the threshold, what party do they belong to?

Democrat
Republicans

‣

Answer

c. What is the running variable?

Percentage of people that go vote
A dummy variable if republican won or not
Share of votes for Republican candidate
The conservatism of the candidate

‣

Answer

d. What is the outcome variable?

A variable measuring distance from 50% share of republican voting
Ideology voting of the elected member of congress
Being Republican
Being Democrat

‣

Answer

e. Given the research question from the prompt, what is the treatment?

Being conservative
Being less conservative
Being from a particular party (R or D) and elected member of Congress

‣

Answer

Is this a sharp or fuzzy discontinuity?

Sharp
Fuzzy

‣

Answer

g. The results from this RD provide empirical evidence that strengthens which theory:

Congress members respond to the political party.
Congress members respond to the median voter.

‣

Answer

	Before	After	Diff
Treatment	0.60	0.74	0.14
Control	0.78	0.81	0.03
Diff	-0.18	-0.07	0.11