💻

STATA: Binscatter and Collapse

This worksheet walks you through using the commands binscatter, collapse, and twoway. It is important to learn these commands because they are very useful for describing data, specifically regression discontinuity.

Let’s start with binscatter. Binscatter is a command created by Michael Stepner and Jessica Laird. There is a new version called binscatter2, which I’ll explain the differences later, but for now, binscatter will do the trick. If you want more detailed information, check these slides from the author.

Let’s first install it:

ssc install binscatter

After installing binscatter, you can read the documentation by running help binscatter.

Now let’s open some data. Let’s follow the example from this worksheet. You can download the data from here:

use "$data/LudwigMiller_head_start.dta", clear

These data has information by county a county’s poverty rate (how far away from a threshold for eligibility of head start), then head start eligibility and mortality rate. Let’s start doing a binscatter.

binscatter Mortality Poverty

What is this doing? Well there are a couple of moving parts. The first one is there is a line of best fit for the data, this is just a slope of poverty for mortality. If you were to run a regression of Mortality on poverty, the coefficient on poverty is the slope of that line. Let’s try it:

reg Mortality Poverty

So the coefficient on Poverty (0.2345494) is the slope of the red line above. If you want to add more labels to the axis to see if we can see that by just looking at the graph, you can change the binscatter command in this way, this allows the y and x axis to have 9 values instead of whatever the baseline is

binscatter Mortality Poverty, xlabel(#9) ylabel(#9)

Ok, not let’s figure out what the points are. First, we know that these points do not reflect actual data points. That’s easy to tell because we have 3,109 observations, and there are only 20 points on this graph. What each point represents is the mean mortality for a given set of “poverty” values. That is, binscatter is creating bins of values of X (e.g. Poverty from -5.6 to -4.8), and then for those bins, obtaining the mean of Y (mortality).

How does binscatter decide the bins? Well as a baseline, it creates 20 equal size bins. That is, the number of observations in those bins is either the same or approximately the same, and so each point has similar “weight” in the regression. Why 20? That’s just a random number, you can actually change it in the command:

binscatter Mortality Poverty,  nquantiles(40)

binscatter Mortality Poverty,  nquantiles(5)

So now let’s understand the binning and how it works. Let’s create the bins that binscatter uses:

binscatter Mortality Poverty,  genxq(bins)

This will create a variable called bins and if you type br Mortality Poverty bins you will be able to see how each observation belongs to a bin. Now let’s see it visually. Let’s plot the full data:

twoway (scatter Mortality Poverty, sort)

This is the full data, with all the points, it’s clear how binscatter allows us to see the pattern a bit better by binning the data. Here is a graph with the bins of X, noticing that some have wider range that others and that’s because the bins are trying to balance the number of observation.

twoway (scatter Mortality Poverty, sort), xline(-5.703496 -4.651548   -4.411279 -4.231661 -4.013422 -3.818647 -3.606618  -3.425287 -3.212406  -2.995848 -2.785729 -2.565694 -2.313802 -2.030928 -1.716421 -1.356852 -.9406679 -.503325 -.0453706 .4964114 2.237187)

I obtained the values of the bins by asking STATA for the max and min of the bins:

table bins, statistic(min Poverty) statistic(max Poverty)

--------------------------------------------------------
                        |  Minimum value   Maximum value
------------------------+-------------------------------
20 quantiles of Poverty |                               
  1                     |      -5.703496       -4.656513
  2                     |      -4.651548       -4.411495
  3                     |      -4.411279       -4.232955
  4                     |      -4.231661       -4.015943
  5                     |      -4.013422       -3.818717
  6                     |      -3.818647       -3.609018
  7                     |      -3.606618       -3.426195
  8                     |      -3.425287       -3.215073
  9                     |      -3.212406       -2.996294
  10                    |      -2.995848       -2.789504
  11                    |      -2.785729        -2.56633
  12                    |      -2.565694       -2.315941
  13                    |      -2.313802        -2.03276
  14                    |      -2.030928       -1.716709
  15                    |      -1.716421       -1.357394
  16                    |      -1.356852        -.943292
  17                    |      -.9406679       -.5042802
  18                    |       -.503325       -.0486356
  19                    |      -.0453706        .4946207
  20                    |       .4964114        2.237187
  Total                 |      -5.703496        2.237187
--------------------------------------------------------

Then binscatter gets the mean of Y for all of those bins:

bys bins: egen mean_y=mean(Mortality)

twoway (scatter mean_y Poverty, sort), xline(-5.703496 -4.651548   -4.411279 -4.231661 -4.013422 -3.818647 -3.606618  -3.425287 -3.212406  -2.995848 -2.785729 -2.565694 -2.313802 -2.030928 -1.716421 -1.356852 -.9406679 -.503325 -.0453706 .4964114 2.237187)

Now, one more to really see it:

binscatter Mortality Poverty,  xline(-5.703496 -4.651548   -4.411279 -4.231661 -4.013422 -3.818647 -3.606618  -3.425287 -3.212406  -2.995848 -2.785729 -2.565694 -2.313802 -2.030928 -1.716421 -1.356852 -.9406679 -.503325 -.0453706 .4964114 2.237187) linetype(noline)

Now let’s take off the vertical lines:

binscatter Mortality Poverty,  linetype(noline)

and then we add the line back:

binscatter Mortality Poverty

Ok so hopefully you get the main point of how to do binscatter, and now let’s dive into collapse, which will again help us reinforce binscatter. Collapse changes your data into means by group. It’s a great command to help you graph the data. Let’s apply it!

collapse (mean) Mortality, by(bins)

This command is saying, collapse the data to obtain the mean mortality by the variable bins, that is bins takes many values. For each value of bins, get the mean of mortality. Since the bins variable is the one created from the binscatter commands, the mean mortality should exactly be the points from the graph above. Now browse your data

br

As you can see, you have 20 observations for 20 bins, and the other variable is the mean Mortality for each bin. Now let’s plot it:

twoway (scatter Mortality bins, sort)

Now let’s add the line:

twoway (scatter Mortality bins, sort) (lfit Mortality bins)

And, as you see this is the same graph as the binscatter one, but did it by hand. Now notice that if we want to go back to our original data, we have to open our data again.

Binscatter and RD

We already talked in class about this, but we can use binscatter for RD in a very simple way!

use "$data/LudwigMiller_head_start.dta", clear
binscatter Mortality Poverty, rd(0)
* where 0 is the cutoff value of where the discontinuity happens

You can also add a quadratic or different polynomial

binscatter Mortality Poverty, rd(0) linetype(qfit)