STATA: Summary Statistics Tables

In this worksheet we will work through multiple ways to create summary statistic tables in Stata. We go from the most basic approach to the most polished, publication-ready table. We use the built-in nlsw88 dataset (National Longitudinal Survey of Women, 1988). Our binary grouping variable is union (1 = union member, 0 = not). Recall that for the homework you can build this table with excel as well, but if you wanted to learn a bit on ways of doing it with STATA, here are a couple:

Setup

Before we begin, let’s load the data and see what we’re working with.

sysuse nlsw88, clear

describe union wage age tenure ttl_exp hours

Output:

Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
union           byte    %8.0g      unionlbl   Union worker
wage            float   %9.0g                 Hourly wage
age             byte    %8.0g                 Age in current year
tenure          float   %9.0g                 Job tenure (years)
ttl_exp         float   %9.0g                 Total work experience (years)
hours           byte    %8.0g                 Usual hours worked

Let’s check how our grouping variable looks:

tab union

      Union |
     worker |      Freq.     Percent        Cum.
------------+-----------------------------------
   Nonunion |      1,417       75.45       75.45
      Union |        461       24.55      100.00
------------+-----------------------------------
      Total |      1,878      100.00

About 75% are non-union and 25% are union. Now let’s define a global with the variables we want to summarize:

global sumvars "wage age tenure ttl_exp hours"

Method 1: `summarize` with `if` (The Basics)

The simplest approach. Just run summarize twice, once for each group.

summarize $sumvars if union == 1

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
        wage |        461    8.674294    4.174539    1.80602   39.23074
         age |        461    39.28416    3.022299         34         46
      tenure |        460    7.888225    6.105057          0   25.91667
     ttl_exp |        461    13.25391    4.553527   1.474359   25.98718
       hours |        461    38.65944    9.110139          2         70

summarize $sumvars if union == 0

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
        wage |      1,417    7.204669    4.103694   1.151368   30.96618
         age |      1,417    39.20536    3.039435         34         46
      tenure |      1,408    6.140743    5.413678          0      24.75
     ttl_exp |      1,417    12.67667      4.6162   .1153846   28.88461
       hours |      1,416    37.26201    10.22723          1         80

Pros: Very easy. No extra packages needed.

Cons: You get two separate tables. No difference or significance test. You have to eyeball everything. Not something you’d put in a paper.

Method 2: `tabstat` with `by()` (A Step Up)

tabstat lets you see both groups in one table. You can also request specific statistics.

tabstat $sumvars, by(union) stats(mean sd n) columns(statistics)

Summary for variables: wage age tenure ttl_exp hours
Group variable: union (Union worker)

   union |      Mean        SD         N
---------+------------------------------
Nonunion |  7.204669  4.103694      1417
         |  39.20536  3.039435      1417
         |  6.140743  5.413678      1408
         |  12.67667    4.6162      1417
         |  37.26201  10.22723      1416
---------+------------------------------
   Union |  8.674294  4.174539       461
         |  39.28416  3.022299       461
         |  7.888225  6.105057       460
         |  13.25391  4.553527       461
         |  38.65944  9.110139       461
---------+------------------------------
   Total |  7.565423  4.168369      1878
         |  39.22471  3.034623      1878
         |  6.571065  5.640675      1868
         |  12.81837  4.606392      1878
         |  37.60522   9.98027      1877
----------------------------------------

You can clean it up with formatting:

tabstat $sumvars, by(union) stats(mean sd min max n) columns(statistics) format(%9.2f)

   union |      Mean        SD       Min       Max         N
---------+--------------------------------------------------
Nonunion |      7.20      4.10      1.15     30.97   1417.00
         |     39.21      3.04     34.00     46.00   1417.00
         |      6.14      5.41      0.00     24.75   1408.00
         |     12.68      4.62      0.12     28.88   1417.00
         |     37.26     10.23      1.00     80.00   1416.00
---------+--------------------------------------------------
   Union |      8.67      4.17      1.81     39.23    461.00
         |     39.28      3.02     34.00     46.00    461.00
         |      7.89      6.11      0.00     25.92    460.00
         |     13.25      4.55      1.47     25.99    461.00
         |     38.66      9.11      2.00     70.00    461.00
------------------------------------------------------------

Pros: Both groups in one view. You can pick your statistics (mean, sd, n, min, max, etc.).

Cons: Still no difference in means or significance test. You’d have to calculate those yourself.

Method 3: `ttest` (Getting the Difference + Significance)

The ttest command gives you group means, the difference, and a p-value for whether the difference is statistically significant. The downside: it only works one variable at a time.

ttest wage, by(union)

Two-sample t test with equal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
Nonunion |   1,417    7.204669    .1090159    4.103694    6.990819    7.418519
   Union |     461    8.674294    .1944277    4.174539    8.292218    9.056371
---------+--------------------------------------------------------------------
Combined |   1,878    7.565423    .0961874    4.168369    7.376778    7.754069
---------+--------------------------------------------------------------------
    diff |           -1.469625    .2209702               -1.902999   -1.036252
------------------------------------------------------------------------------
    diff = mean(Nonunion) - mean(Union)                           t =  -6.6508
H0: diff = 0                                     Degrees of freedom =     1876

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 1.0000

To do this for all your variables, you can loop:

foreach var of global sumvars {
    ttest `var', by(union)
}

How to read this output: The diff row shows that non-union workers earn $1.47 less per hour on average than union workers. The two-sided p-value is 0.0000, meaning this difference is statistically significant.

Pros: You get the difference and a formal significance test with p-values and confidence intervals.

Cons: Output is verbose - one giant table per variable. Hard to present in a paper. You’d need to manually extract numbers.

Method 4: `estpost tabstat` + `esttab` (Formatted Table)

Now we start using the estout package (which contains esttab). If you haven’t installed it:

ssc install estout, replace

estpost tabstat stores the results from tabstat so that esttab can display them in a clean table.

estpost tabstat $sumvars, by(union) statistics(mean sd count) columns(statistics)
esttab ., cells("mean(fmt(2)) sd(fmt(2)) count(fmt(0))") noobs nonumber label

-----------------------------------------------------------

                             mean           sd        count
-----------------------------------------------------------
Nonunion
Hourly wage                  7.20         4.10         1417
Age in current year         39.21         3.04         1417
Job tenure (years)           6.14         5.41         1408
Total work experie~)        12.68         4.62         1417
Usual hours worked          37.26        10.23         1416
-----------------------------------------------------------
Union
Hourly wage                  8.67         4.17          461
Age in current year         39.28         3.02          461
Job tenure (years)           7.89         6.11          460
Total work experie~)        13.25         4.55          461
Usual hours worked          38.66         9.11          461
-----------------------------------------------------------

You can also display the standard deviation in parentheses (a common formatting convention):

estpost tabstat $sumvars, by(union) statistics(mean sd) columns(statistics) listwise
esttab ., cells("mean(fmt(2)) sd(par fmt(2))") noobs nonumber label

----------------------------------------------

                             mean           sd
----------------------------------------------
Nonunion
Hourly wage                  7.23       (4.11)
Age in current year         39.20       (3.04)
Job tenure (years)           6.14       (5.41)
Total work experie~)        12.70       (4.61)
Usual hours worked          37.26      (10.21)
----------------------------------------------
Union
Hourly wage                  8.68       (4.17)
Age in current year         39.28       (3.02)
Job tenure (years)           7.89       (6.11)
Total work experie~)        13.26       (4.56)
Usual hours worked          38.69       (9.10)
----------------------------------------------

Pros: Clean, formatted tables. Uses variable labels. Easy to customize.

Cons: Still no difference column or significance test. But we’re getting there…

Method 5: `estpost ttest` + `esttab` (The Balance Table)

This is the most important method in this worksheet. estpost ttest runs t-tests for all your variables at once and stores everything. Then esttab can display group means, differences, standard errors, significance stars, and p-values all in one clean table.

estpost ttest $sumvars, by(union)

This stores a bunch of matrices: mu_1 (group 0 means), mu_2 (group 1 means), b (difference), se (standard error), p (p-value), N_1 and N_2 (sample sizes), and more.

Version A: Means + Difference with Stars

esttab ., cells("mu_1(fmt(2)) mu_2(fmt(2)) b(fmt(2) star)") ///
    wide noobs nonumber star(* 0.10 ** 0.05 *** 0.01) ///
    collabels("Non-Union" "Union" "Difference") label

--------------------------------------------------------------

                        Non-Union        Union   Difference
--------------------------------------------------------------
Hourly wage                  7.20         8.67        -1.47***
Age in current year         39.21        39.28        -0.08
Job tenure (years)           6.14         7.89        -1.75***
Total work experie~)        12.68        13.25        -0.58**
Usual hours worked          37.26        38.66        -1.40***
--------------------------------------------------------------

This is already very useful! You can immediately see that union workers have higher wages, more tenure, more experience, and work more hours. Age is balanced (no significant difference). The stars tell you significance at a glance.

Version B: Add Standard Errors of the Difference

esttab ., cells("mu_1(fmt(2)) mu_2(fmt(2)) b(fmt(2) star) se(fmt(2) par)") ///
    wide noobs nonumber star(* 0.10 ** 0.05 *** 0.01) ///
    collabels("Non-Union" "Union" "Difference" "SE") label

---------------------------------------------------------------------------

                        Non-Union        Union   Difference              SE
---------------------------------------------------------------------------
Hourly wage                  7.20         8.67        -1.47***       (0.22)
Age in current year         39.21        39.28        -0.08          (0.16)
Job tenure (years)           6.14         7.89        -1.75***       (0.30)
Total work experie~)        12.68        13.25        -0.58**        (0.25)
Usual hours worked          37.26        38.66        -1.40***       (0.53)
---------------------------------------------------------------------------

Version C: With p-values Instead of Stars

esttab ., cells("mu_1(fmt(2)) mu_2(fmt(2)) b(fmt(2)) p(fmt(3))") ///
    wide noobs nonumber ///
    collabels("Non-Union" "Union" "Difference" "p-value") label

------------------------------------------------------------------------

                        Non-Union        Union   Difference      p-value
------------------------------------------------------------------------
Hourly wage                  7.20         8.67        -1.47        0.000
Age in current year         39.21        39.28        -0.08        0.628
Job tenure (years)           6.14         7.89        -1.75        0.000
Total work experie~)        12.68        13.25        -0.58        0.019
Usual hours worked          37.26        38.66        -1.40        0.009
------------------------------------------------------------------------

Method 5b: Adding Sample Sizes to the Balance Table

You might want to show the N for each group directly in the table. Good news: estpost ttest stores N_1 and N_2 as matrices (one value per variable), so you can include them directly as cells in esttab.

estpost ttest $sumvars, by(union)

esttab ., cells("mu_1(fmt(2)) N_1(fmt(0)) mu_2(fmt(2)) N_2(fmt(0)) b(fmt(2) star) se(fmt(2) par)") ///
    wide noobs nonumber star(* 0.10 ** 0.05 *** 0.01) ///
    collabels("Non-Union" "N" "Union" "N" "Difference" "SE") ///
    label

----------------------------------------------------------------------------------

                        Non-Union     N        Union     N   Difference         SE
----------------------------------------------------------------------------------
Hourly wage                  7.20  1417         8.67   461       -1.47***   (0.22)
Age in current year         39.21  1417        39.28   461       -0.08      (0.16)
Job tenure (years)           6.14  1408         7.89   460       -1.75***   (0.30)
Total work experie~)        12.68  1417        13.25   461       -0.58**    (0.25)
Usual hours worked          37.26  1416        38.66   461       -1.40***   (0.53)
----------------------------------------------------------------------------------

This is the full balance table: group means, sample sizes, difference, standard error, and significance stars all in one table.

Method 6: Export to Word (.rtf)

Once you’re happy with how the table looks, export it to an RTF file you can open in Word.

estpost ttest $sumvars, by(union)

esttab . using "/your/filepath/balance_table.rtf", replace ///
    cells("mu_1(fmt(2)) N_1(fmt(0)) mu_2(fmt(2)) N_2(fmt(0)) b(fmt(2) star) se(fmt(2) par)") ///
    wide noobs nonumber star(* 0.10 ** 0.05 *** 0.01) ///
    collabels("Non-Union" "N" "Union" "N" "Difference" "SE") ///
    label ///
    title("Table 1: Summary Statistics by Union Status") ///
    addnotes("Data: National Longitudinal Survey of Women, 1988" ///
             "Stars: * p<0.10, ** p<0.05, *** p<0.01")

This creates a file called balance_table.rtf. Open it in Microsoft Word and it will look like a nicely formatted table. Make sure to change "/your/filepath/" to wherever you want to save the file.

You can also export to CSV (for Excel) or LaTeX (for academic papers) by changing the file extension:

* For CSV (Excel):
esttab . using "/your/filepath/balance_table.csv", replace ...

* For LaTeX:
esttab . using "/your/filepath/balance_table.tex", replace ...

Method 7: `iebaltab` (The One-Liner Balance Table)

iebaltab is a command from the World Bank’s ietoolkit package. It was designed specifically for creating balance tables in impact evaluations. One line of code gives you everything: group means, standard errors (or SDs), sample sizes, difference in means, and significance stars.

First, install it:

ssc install ietoolkit, replace

Basic usage

iebaltab wage age tenure ttl_exp hours, grpvar(union)

That single line produces a full balance table. By default it shows: N and Mean/(SE) for each group, the total N, and the mean difference with significance stars. When run, iebaltab opens the table in the data browser. The table looks like this:

+----------------------------------------------------------------------+
| Variable         |    N  | Mean/(SE) |    N  | Mean/(SE) |    N  | Mean difference |
|                  |       | Nonunion  |       |   Union   |       | Pairwise t-test |
+----------------------------------------------------------------------+
| wage             | 1417  |     7.205 |  461  |     8.674 | 1878  |      -1.470***  |
|                  |       |   (0.109) |       |   (0.194) |       |                 |
| age              | 1417  |    39.205 |  461  |    39.284 | 1878  |         -0.079  |
|                  |       |   (0.081) |       |   (0.141) |       |                 |
| tenure           | 1408  |     6.141 |  460  |     7.888 | 1868  |      -1.747***  |
|                  |       |   (0.144) |       |   (0.285) |       |                 |
| ttl_exp          | 1417  |    12.677 |  461  |    13.254 | 1878  |       -0.577**  |
|                  |       |   (0.123) |       |   (0.212) |       |                 |
| hours            | 1416  |    37.262 |  461  |    38.659 | 1877  |      -1.397***  |
|                  |       |   (0.272) |       |   (0.424) |       |                 |
+----------------------------------------------------------------------+

Using variable labels instead of variable names

Add rowvarlabels to display the full variable labels:

iebaltab wage age tenure ttl_exp hours, grpvar(union) rowvarlabels

Now “wage” becomes “Hourly wage”, “ttl_exp” becomes “Total work experience (years)”, etc.

Showing Standard Deviations instead of Standard Errors

By default, iebaltab shows standard errors in parentheses under each mean. If you want standard deviations instead (which is more common for descriptive/summary stats), use the stats() option:

iebaltab wage age tenure ttl_exp hours, grpvar(union) stats(desc(sd)) rowvarlabels

+----------------------------------------------------------------------+
| Variable                      |    N  | Mean/(SD) |    N  | Mean/(SD) |    N  | Mean difference |
|                               |       | Nonunion  |       |   Union   |       | Pairwise t-test |
+----------------------------------------------------------------------+
| Hourly wage                   | 1417  |     7.205 |  461  |     8.674 | 1878  |      -1.470***  |
|                               |       |   (4.104) |       |   (4.175) |       |                 |
| Age in current year           | 1417  |    39.205 |  461  |    39.284 | 1878  |         -0.079  |
|                               |       |   (3.039) |       |   (3.022) |       |                 |
| Job tenure (years)            | 1408  |     6.141 |  460  |     7.888 | 1868  |      -1.747***  |
|                               |       |   (5.414) |       |   (6.105) |       |                 |
| Total work experience (years) | 1417  |    12.677 |  461  |    13.254 | 1878  |       -0.577**  |
|                               |       |   (4.616) |       |   (4.554) |       |                 |
| Usual hours worked            | 1416  |    37.262 |  461  |    38.659 | 1877  |      -1.397***  |
|                               |       |  (10.227) |       |   (9.110) |       |                 |
+----------------------------------------------------------------------+

Notice the parentheses now show SD values (e.g., 4.104 for wage) instead of SE values (0.109).

Exporting to LaTeX or CSV

You can export directly to LaTeX (for academic papers) or CSV (for Excel):

* Export to LaTeX
iebaltab wage age tenure ttl_exp hours, grpvar(union) rowvarlabels ///
    savetex("/your/filepath/balance_table.tex") replace texnotewidth(1)

* Export to CSV (opens in Excel)
iebaltab wage age tenure ttl_exp hours, grpvar(union) rowvarlabels ///
    savecsv("/your/filepath/balance_table.csv") replace

Pros: One command does everything. Built-in N per group, means, SE/SD, difference, stars. Designed for balance tables. Exports directly to LaTeX or CSV.

Cons: Requires installing ietoolkit. Less flexible than the estpost ttest + esttab approach for custom column layouts. The stats() syntax can be finicky.

Method 8: `table` command (Stata 17+)

Stata 17 introduced a completely revamped table command. It can compute statistics by groups and is deeply integrated with the collect framework for export.

Basic table of means by group

table union, statistic(mean wage age tenure ttl_exp hours) nformat(%9.2f)

-----------------------------------------------------------
             |  Hourly   Age in    Job      Total work   Usual
             |  wage     current   tenure   experience   hours
             |           year      (years)  (years)      worked
-------------+---------------------------------------------
Union worker |
  Nonunion   |    7.20    39.21     6.14      12.68      37.26
  Union      |    8.67    39.28     7.89      13.25      38.66
  Total      |    7.57    39.22     6.57      12.82      37.61
-----------------------------------------------------------

Adding standard deviations

table union, statistic(mean wage age tenure ttl_exp hours) ///
    statistic(sd wage age tenure ttl_exp hours) nformat(%9.2f)

This produces a wider table with both means and SDs for each variable. The output looks best in the Stata results window (the console log wraps the wide table).

Adding counts (sample sizes)

table union, statistic(mean wage age tenure ttl_exp hours) ///
    statistic(count wage age tenure ttl_exp hours) ///
    nformat(%9.2f mean) nformat(%9.0f count) totals(union)

The totals(union) option adds a total row. The nformat() options let you format different statistics differently (2 decimals for means, 0 for counts).

Exporting the table

The table command stores its output in a collect framework, so you can export:

table union, statistic(mean wage age tenure ttl_exp hours) ///
    statistic(sd wage age tenure ttl_exp hours) nformat(%9.2f)
collect export "/your/filepath/my_table.html", replace

You can export to .html, .docx, .xlsx, .tex, or .pdf.

Pros: Built into Stata (no packages needed). Very flexible with the collect framework. Can export to many formats including Word and Excel. Clean syntax.

Cons: No built-in difference-in-means or significance test. The layout is “wide” (variables as columns, groups as rows), which is the opposite of most balance tables. Not ideal for the specific “balance table” use case.

Method 9: `dtable` (Stata 18)

Stata 18 introduced dtable, which is purpose-built for descriptive statistics tables. It’s the easiest built-in way to get group comparisons with a significance test.

Basic dtable by group

dtable wage age tenure ttl_exp hours, by(union) nformat(%9.2f)

--------------------------------------------------------------------------------
                                                 Union worker
                                  Nonunion          Union            Total
--------------------------------------------------------------------------------
N                             1417.00 (75.45%) 461.00 (24.55%) 1878.00 (100.00%)
Hourly wage                        7.20 (4.10)     8.67 (4.17)       7.57 (4.17)
Age in current year               39.21 (3.04)    39.28 (3.02)      39.22 (3.03)
Job tenure (years)                 6.14 (5.41)     7.89 (6.11)       6.57 (5.64)
Total work experience (years)     12.68 (4.62)    13.25 (4.55)      12.82 (4.61)
Usual hours worked               37.26 (10.23)    38.66 (9.11)      37.61 (9.98)
--------------------------------------------------------------------------------

That’s very clean! Each cell shows mean (sd) by default. It also includes N and percentages for each group at the top.

Adding a significance test

Just add tests inside the by() option:

dtable wage age tenure ttl_exp hours, by(union, tests) nformat(%9.2f)

--------------------------------------------------------------------------------------
                                                    Union worker
                                  Nonunion          Union            Total        Test
--------------------------------------------------------------------------------------
N                             1417.00 (75.45%) 461.00 (24.55%) 1878.00 (100.00%)
Hourly wage                        7.20 (4.10)     8.67 (4.17)       7.57 (4.17) <0.00
Age in current year               39.21 (3.04)    39.28 (3.02)      39.22 (3.03)  0.63
Job tenure (years)                 6.14 (5.41)     7.89 (6.11)       6.57 (5.64) <0.00
Total work experience (years)     12.68 (4.62)    13.25 (4.55)      12.82 (4.61)  0.02
Usual hours worked               37.26 (10.23)    38.66 (9.11)      37.61 (9.98)  0.01
--------------------------------------------------------------------------------------

Now you get a Test column with p-values. You can immediately see that wage, tenure, experience, and hours are significantly different between groups, while age is not (p = 0.63).

Customizing the statistics shown

You can specify exactly which statistics to display for continuous variables:

dtable wage age tenure ttl_exp hours, by(union, tests) ///
    continuous(wage age tenure ttl_exp hours, stat(mean sd)) ///
    nformat(%9.2f)

Exporting

dtable can export directly:

dtable wage age tenure ttl_exp hours, by(union, tests) nformat(%9.2f) ///
    export("/your/filepath/my_dtable.html", replace)

Supported formats: .html, .docx, .xlsx, .tex, .pdf.

Pros: Built into Stata 18 (no packages). One command gives you means, SDs, N, and p-values. Clean, compact output. Easy export.

Cons: Only available in Stata 18+. Doesn’t show the raw difference in means as a separate column (only the p-value). Less customizable than estpost ttest + esttab. No significance stars (uses p-values instead).