Purpose
The objective of this homework is for you to practice concepts learned in class and apply them to data. The research design we will practice here is “mean comparison” with and without covariates. The tool you will use is OLS, and we will also practice things like interaction and non-linearities. In addition, this HW is born out of a paper, so we’ll start diving into “reading & understanding papers.” Finally, you will practice how to speak in technical and “colloquial” ways and use our framework to understand what they are trying to say. It’s a jam-packed homework of fun and excitement. Clean your desk, get a bottle of water, pick your favorite beverage, turn on “do not disturb,” set a timer for 45 min (then take breaks), put on some work tunes, and dive into the fun of learning.
Guidelines
- You can work by yourself or with groups of up to three.
- Submit your group answers to Gradescope (within Canvas). One submission per group, please.
- You will get points for correct answers. You will get points deducted if the answer contains more information that’s not necessary or if the answer contains incorrect statements among accurate statements. In short, we are trying to incentivize students to use the least amount of characters while maximizing the accuracy of responses. This will get more strict over time.
- Your responses should be professionally formatted and written.
- The due date is Wednesday, February 21st, at 8 am EDT.
- You can answer all of your questions to the nearest 0.0x decimal.
Preamble
As we have said in class, papers can entail evaluating a policy or helping us understand a social concept. This paper is about understanding the social concepts of what drives people to change their perspectives on women’s rights: Does parenting daughters influence parental political behavior? This is the question we will try to answer with data. You can imagine why this question is interesting and what its repercussions are on policy. Sociologists have long argued that parenting daughters increase feminist sympathies. We could discuss all day about why it could or couldn’t and create many theories; that’s all fun, but in practice, what happens? Can we evaluate this empirically? We will assess this claim using data from a paper by Ebonya Washington that explores this.
Let’s get things ready before we start digging into the data:
You can also obtain it from this link. If you want to practice the skill of obtaining data from their main source, here is the link to where you can find all the data. See if you can download the basic.dta from there. Could be useful to practice for future homeworks.
Practice the above skill before reading the following text: You can find a detailed description of each variable in the original paper. The main variable in this analysis is AAUW
, a score created by the American Association of University Women (AAUW). For each congress, AAUW selects pieces of legislation in the areas of education, equality, and reproductive rights. The AAUW keeps track of how each legislator voted on these pieces of legislation and whether their vote aligned with the AAUW’s position. The legislator’s score is equal to the proportion of these votes made in agreement with the AAUW.
Getting to work
- Report summary statistics of the following variables (proxies for) in the dataset: political party, age, race (only white or non-white), gender, AAUW score, the number of children, and the number of daughters. Compare the mean of each of these variables between legislators that have girls vs. legislators that do not have girls. Your table should include the means for both groups, the sample size of each group, a column for the difference between means, and a column with the p-value testing if the difference is statistically different from zero or not. (Note: there are many ways to construct this table, try to use as much STATA possible to construct the table, but fine if some of the process is out of STATA). We recommend having up to two decimal points for means or differences and up to 4 decimal points for p-values. You don’t need a column showing “total values” (i.e. the full sample)
- To disentangle the causal effect of parenting daughters on feminist sympathies, a peer suggested comparing the difference in AAUW scores before a congressperson had a girl and after they had a girl and averaging the difference. Let’s say your peer makes the above mentioned comparison and concludes that having a girl increases one’s AAUW score, on average. Assuming that the causal effect of having a girl on AAUW scores is 0, what’s one reason that could explain your peer’s finding? (Recall that you have to ability to elevate your answer given the “signs” implied here).
- Another peer looked into the data and told you the following statement: “I compared the share of legislators that report having a girl across different levels of the score and noticed that as score increases, the likelihood of reporting having a girl does not increase, therefore the having a girl doesn’t really have an effect on voting preference of the legislators. I did this by comparing increases of 10 percentage points in AAUW score and observing changes in share of people having a girl.”
- [Bonus] Write in conditional expectation language the exercise that this peer perform. Use variable names if it is easier.
- Another peer suggests comparing the AAUW score between legislators with girls and legislators with no girls.
- Write down this comparison in conditional expectation language. Use variable names if it is easier.
- Compute this difference in STATA. (Hint: using sum and display commands).
- Write down the regression that represents the main causal question of this exercise.
- Use STATA to run the regression to perform the comparison that your peer suggested. Put output in the box.
- Read the main finding from the regression result in a technical way. Make sure you are aware of units. Express the relationship as a percentage of the average score.
- Use a max of two sentences & non-technical language answer the following question: What’s the conclusion from the empirical exercises above? (i.e. do not write out the interpretation of the regression coefficients)
- Assuming that the causal effect of having a girl on AAUW is positive. What’s one reason that could explain the findings in 4e?
- We will continue with the exercise above and compare AAUW scores between legislators who have girls and legislators who don’t.
- The simple mean comparison may not provide the causal effect because of potential confounders. Therefore, we will add some covariates to mitigate bias and see how our result changes. Run the following regressions and report the results on a formatted table within STATA/R: (Hint: use the esttab command, there is a worksheet on it)
- Could controlling for Republican be considered an issue here? Explain your answer.
- You should have found a difference between the coefficient on between equations (1) and (2). In technical language, or language we use in class, how do you explain the change from the result in equation (1) to equation (2).
- Now explain the difference in non-technical language, or in a way everyone can understand. Write the number of characters at the end of your answer. It should be smaller than 1,000 for the answer to be counted correctly.
- Would your qualitative conclusion change given the results of equation (2) to equation (3)? Which control variable is particularly important and why? (Hint: feel free to run another regression to help support your claim)
- Consider the third specification (with three controls in addition to ). Conditional on the number of children and other variables, do you think is plausibly exogenous? What identifying assumption is necessary for to be interpreted as a causal estimate? What evidence does Washington give to support this assumption?
- Run the following regressions and show them in a nicely formatted table in STATA. Given the results from this regression
- Using results from equation (1) What is the full relationship between age and scores? Provide process.
- Using results from equation (2), what is the marginal effect of being white for non-females on scores? Show process
- Using results from equation (2), what is the marginal effect of being female for a white person on scores? Show process
- Using results from equation (3), what is the predicted scores for a male who has a daughter, and has the average number of children that male legislators have? (We’ll assume for the purposes of this question that non-female=male). Show process
- Using results from equation (3), how does the effect of having a girl change by the gender of the legislator? Show process and formulate a technical and non-technical answer
use "$hw/homework 1/basic.dta", clear
* Creating the Table of Summary Stats with different means. Works in STATA 17 Only
local myresults "NoGirls= r(mu_1) N1=r(N_1) AnyGirl = r(mu_2) N2=r(N_2) Diff = (r(mu_2)-r(mu_1)) pvalue = r(p)"
table (command) (result), ///
command(`myresults' : ttest repub, by(anygirls)) ///
command(`myresults' : ttest age, by(anygirls)) ///
command(`myresults' : ttest white, by(anygirls)) ///
command(`myresults' : ttest female, by(anygirls)) ///
command(`myresults' : ttest aauw, by(anygirls)) ///
command(`myresults' : ttest totchi, by(anygirls)) ///
command(`myresults' : ttest ngirls, by(anygirls)) ///
nformat(%6.2f NoGirls AnyGirl Diff) ///
nformat(%6.3f pvalue)
* Changing Labels
collect label levels command 1 "Share Republican" 2 "Age" 3 "Share White" 4 "Share Female" 5 "AAUW Score" 6 "Total Number of Children" 7 "Number of Girls", modify
* Changing Style of Cells
collect style cell result[NoGirl AnyGirl Diff], nformat(%8.2f)
collect style cell result[pvalue], nformat(%6.4f)
collect style cell border_block, border(right, pattern(nil))
collect preview
-------------------------------------------------------------------------
NoGirls N1 AnyGirl N2 Diff pvalue
-------------------------------------------------------------------------
Share Republican 0.47 495 0.54 1241 0.06 0.0190
Age 50.67 495 53.84 1241 3.17 0.0000
Share White 0.85 495 0.87 1241 0.02 0.3763
Share Female 0.12 495 0.12 1241 -0.00 0.9657
AAUW Score 51.72 494 46.93 1241 -4.79 0.0337
Total Number of Children 1.10 495 2.98 1241 1.89 0.0000
Number of Girls 0.00 495 1.73 1241 1.73 0.0000
-------------------------------------------------------------------------
. summ aauw if anygirls==1
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
aauw | 1,241 46.93473 42.27097 0 100
. summ aauw if anygirls==0
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
aauw | 494 51.72267 42.53554 0 100
. display 46.93473 - 51.72267
-4.78794
reg aauw anygirl
Source | SS df MS Number of obs = 1,735
-------------+---------------------------------- F(1, 1733) = 4.52
Model | 8100.22373 1 8100.22373 Prob > F = 0.0337
Residual | 3107646.72 1,733 1793.21796 R-squared = 0.0026
-------------+---------------------------------- Adj R-squared = 0.0020
Total | 3115746.94 1,734 1796.85522 Root MSE = 42.346
------------------------------------------------------------------------------
aauw | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
anygirls | -4.787942 2.25277 -2.13 0.034 -9.206377 -.3695074
_cons | 51.72267 1.905255 27.15 0.000 47.98583 55.45951
------------------------------------------------------------------------------
estimates clear
eststo: reg aauw anygirl
eststo: reg aauw anygirl totchi
eststo: reg aauw anygirl totchi female repub
esttab, se
------------------------------------------------------------
(1) (2) (3)
aauw aauw aauw
------------------------------------------------------------
anygirls -4.788* 7.404** 3.508**
(2.253) (2.604) (1.207)
totchi -6.430*** -2.010***
(0.731) (0.343)
female 12.05***
(1.421)
repub -72.91***
(0.947)
_cons 51.72*** 58.71*** 86.95***
(1.905) (2.027) (1.044)
------------------------------------------------------------
N 1735 1735 1735
------------------------------------------------------------
Standard errors in parentheses
* p<0.05, ** p<0.01, *** p<0.001
------------------------------------------------------------
(1) (2) (3)
anygirls anygirls anygirls
------------------------------------------------------------
totchi 0.148*** 0.150*** 0.150***
(0.00572) (0.00579) (0.00579)
female 0.0247 0.0187
(0.0281) (0.0284)
repub -0.0283 -0.0264
(0.0187) (0.0189)
_cons 0.349*** 0.363*** 0.360***
(0.0172) (0.0183) (0.0190)
------------------------------------------------------------
N 1736 1736 1736
------------------------------------------------------------
Standard errors in parentheses
* p<0.05, ** p<0.01, *** p<0.001
Now we set it equal to 0 to find the min or max:
Now we plug in those values from the regression:
Now, we need to find out if this is a min or a max, for that we take the second derivative:
Since given the coefficient on , this indicates that we are dealing with a max.
Now, it could be the case that no legislator is 67 or older, so let's check in our data the range of age, cause otherwise, the min or max may not be hit.
sum age
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
age | 1,740 52.92759 9.504845 26 87
From here we see that the effect varies by whether we include or not, =5.451. Now we have information for the whole answer: The effect of having a girl varies by the gender of the legislator, female legislators have a larger effect and non-females legislators, in fact they score 5.451 higher if they have girl than male legislators. In colloquial term, having a daughter has a larger effect for female legislators than male legislators.
Extensions
This section is not graded and you don’t have to submit, but will help you push your thinking further. Think of the questions of “extensions” as questions that we could ask in this homework, but we decided not to grade them. Therefore, you should be able to know how to answer these questions or think of them as practice questions
- Let’s say you want to test even more hypotheses with these data and model:
- If you wanted to explore not just the margin of having a girl but the number of girls, what regressions would you run?
- If you wanted to explore whether the effect on female legislators differs from men, what regressions would you run? Why doesn’t just controlling for “female” count as exploring the difference in effect between men and women?
We could just change the main explanatory variable to “ngirls” which represents the number of girls. We would then interpret the coefficient on ngirls as the effect of having one more girl.
--------------------------------------------
(Female Sample) (Male Sample)
aauw aauw
--------------------------------------------
anygirls 9.058** 2.639*
(3.150) (1.299)
totchi -0.808 -2.147***
(0.977) (0.365)
repub -71.26*** -73.18***
(2.526) (1.016)
_cons 91.78*** 88.06***
(2.309) (1.116)
--------------------------------------------
N 213 1522
--------------------------------------------
Standard errors in parentheses
* p<0.05, ** p<0.01, *** p<0.001
- Using the residual regression approach, obtain the coefficient from equation (3).
- Obtain an approximate coefficient from equation using the “by hand” approach as seen in the worksheet.
- Checking the relationship between variables and our Y is important, install the command
binscatter2
. This commands produces graph where you can observe the relationship between two variables, and works great to visualize non-linear relationships, try the following commands - Write out a single regression equation that represents the last graph produced.
Using the “by-hand” method, the weighted mean is 8.61706968 or 8.61, which is near 7.04. We used excel for this.
0 | . | 56.41333 | 225 | - |
1 | 57.35714 | 68.78082 | 185 | -11.42368 |
2 | 55.34515 | 44.37143 | 563 | 10.97372 |
3 | 44.45198 | 29.46667 | 399 | 14.98531 |
4 | 37.47716 | 26.71429 | 204 | 10.76287 |
5 | 42.00952 | 4 | 108 | 38.00952 |
6 | 18 | 100 | 17 | -82 |
7 | 39.28571 | - | 14 | - |
8 | 1.083333 | - | 12 | - |
9 | 6 | - | 3 | - |
10 | 3 | - | 4 | - |
11 | - | - | 0 | - |
12 | 0 | - | 1 | - |
Total | 494 | 1,241 | 1,735 |
binscatter2 aauw age
binscatter2 aauw age, linetype(qfit)
Then compare this to the regression with age and age squared. Now run the following command:
binscatter2 aauw age, linetype(qfit) by(female)