Primer: Where to find data

Primer: Where to find data

Finding data may not be an easy task. The degree of difficulty varies greatly based on the topic, geography, and other dimensions. The following is a document that can help ease this process. I provide some links to places where you can find datasets and then provide some tips on other way to find data outside of searching on the web. If you have other suggestions of things to add to this list, send me an email: sebastian.tello@virginia.edu

🔍Data Search Engines

The following websites can be thought of as "Data Search Engines"; that is, you enter a query and the website will show you a set of options of possible data available (either publicly or privately) related to that query.

In some cases, the data is not readily available for the public. However, do not get discouraged!  Sometimes the barriers to entry are as small as sending an e-mail, while in other situations, you will need to fill out a data-request form. Don’t be afraid to ask an organization or institution how you can gain access to their data.

  • ICPSR A service of the Institute for Social Research at the University of Michigan. This data-search engine is a great resource and will provide you with many good options. While you can find information on other countries, this source is skewed heavily toward U.S. Queries can be of the type “incarceration” or “mental health”. Once you found a dataset that you’ll want to download, you will need to create an account and maybe fill out a form on why you need the data. Think of this as a one-time investment rather than a per-data cost.
  • ResearchDataGov - In 2019, ICPSR was awarded a contract from the Census Bureau to create ResearchDataGov.org, this is still a proof of concept website. So, RDG contains only restricted-access data, and only from about 5 agencies at that. They are in the process of building the actual portal that will involve restricted data from 17 federal agencies/programs. (Thanks Lynette Hoelter!)
  • Data.gov - is the place all government agencies (at any level -- municipal through federal) are supposed to share their public-use data, where as researchdata.gov if for restricted access data. (h/t to @willwheels and Lynette Hoelter!)
  • World Bank Microdata Library - A data search engine hosted by the world bank
  • Dataverse This page is a data repository for papers that already exist. In many cases, if you are interested in the data sources used by a specific paper, Dataverse can provide you with this information (as well do-files for that data). Like ICPSR, you can just type a search word and try your luck! (You should be aware that the journal itself may have asked the authors to provide a replication kit, which would be available in the online page of the paper or in the author’s page.)
  • Quand This another data-search engine, it focuses more on financial-related data.
  • FRED A service of the St. Louis Fed. This is a good website for findings data related to macroeconomic variables (unemployment, interest rates, inequality indexes, etc.). While it has information on other countries for certain variables, it is mostly U.S.-focused.
  • Google Public Data This is a great data-search engine for data on aggregated indicators (think HDI, GDP per capita, etc.) at the country level and over time. Bu it is a great way to see what's available in a given country. Most of the data comes from organizations like UN, World Bank, etc. This is also another great resource to explore trends and create interesting graphs.
  • OECD Data Similar to Google Public Data, but contains more “economic” indicators and has a wider set of topics. Mostly OECD countries. (HT/ @biancafrogner)
  • Health Data.Gov - This site is dedicated to making data discoverable and making valuable government data available to the public in the hopes of better health outcomes for all. On this site, you can find data on a wide range of topics, including environmental health, medical devices, Medicare & Medicaid, social services, community health, mental health, and substance abuse. The data is collected and supplied from agencies from the U.S. Department of Health and Human Services as well as state partners. This includes the Centers for Medicare and Medicaid Services, Centers for Disease Control and Prevention, Food and Drug Administration, and the Agency for Health Care Research and Quality, among others. View the full listing of agencies that contribute data to HealthData.gov. (HT/ @biancafrogner)
  • Medicare Data – Similar to the page above but focused on Medicare. (HT/ @biancafrogner)
  • NYC OpenData - (Blurb by Grant R. McDermott): Its mission is to “make the wealth of public data generated by various New York City agencies and other City organizations available for public use”. You can get data on everything from arrest data, to the location of wi-fi hotspots, to city job postings, to homeless population counts, to dog licenses, to a directory of toilets in public parks. (HT @grant_mcdermott). Grant has written algorithms to scrape them and you can find them in this link).
  • JPAL Catalog of Admin Datasets – This is a really cool resource from JPAL, it documents procedures on how to access 35 (and counting) different admin data sets. Administrative (admin) data is very useful when running experiments, and they’ve been collecting a searchable base. For example, looking for mental health outcomes related to crime? They’ll lead you to admin data from NY or IL. (HT/ @JPAL_NA)

📃Data lists

The following are websites that have a list of datasets. The benefit from most of the data found on these sites is that the data is relatively "cleaner" than what you would find in the search engines.

  • Census - This is a PDF of most of the datasets that are worked under the census umbrella. Some will be public and others will be restricted access, but should give you a flavor of what they have
  • NBER - This is data provided by the National Bureau of Economic Research. It covers a wide range of topics: Macro, Industry and Productivity, International Trade, Household Surveys, Health Care Data, Demographics, Patents, and Others. Use Control+F to search for general terms, as oppose to specific variables.
  • NBER for members - if you are a member of NBER you can also have access to the data on this link. If you are not a member, maybe partnering with an NBER member could provide you access to this list.
  • IPUMS - This fantastic institution provides around 10 surveys (mostly U.S.) that have been cleaned and harmonized. The data might be challenging to use in the beginning, but once you get the hang of it, it is very useful (especially when trying to get a quick estimate on a variable). I tend to frequently use the IPUMS-CPS (which is the cleaned version of the Current Population Survey) and IHIS (which is the cleaned version of the National Health Interview Survey). Note that not all variables found in the original dataset are in the IPUMS data. Hence, if you don't find a variable you are looking through IPUMS for a specific year, it doesn't mean that it doesn't exist in the original data.
  • Data is plural is a mailing list that will send you a newsletter with some odd datasets. Subscribe! But also find “weird” data on this repository. I recommend checking this repository for data that may be more obscure to find, they have a number of topics so a control + F search will go a long way.
  • CDC Wonder - This site from the CDC has a list of datasets that they sponsor. It is mostly on health outcomes.
  • CDC Health Data (Others) - This is another list on other health-related datasets.
  • BEA - This is a data list from the datasets offered by the Bureau of Economic Analysis.
  • AEA Data List - This is a data list from the American Economic Association.
  • Historical Statistics - Historical data may be hard to find, but this website provides is a great first step.

🏥 Health Specific

  • Health Policy Tracking - This is not a dataset list per se. This is a database in itself of changes in policies by state. It not comprehensive, so there might be a policy change that is not recorded in this database, but it is a good place to start.
  • Health Policy Research Dataset - This is a very handy dataset that has a lot of information on states’ policy changes and the year in which they occurred. It provides lots of quick “controls” at the state-year level.
  • Area Health Resources Files - The Area Health Resources Files (AHRF) include data on Health Care Professions, Health Facilities, Population Characteristics, Economics, Health Professions Training, Hospital Utilization, Hospital Expenditures, and Environment at the county, state and national levels, from over 50 data sources and over time (2014-2016 publicly available). (HT/ @biancafrogner)

🗺Development specific

These are all survey that contain information about several countries, recommended by @DaveEvansPhD

  • Demographic and Health Surveys (DHS) - Program has collected, analyzed, and disseminated accurate and representative data on population, health, HIV, and nutrition through more than 400 surveys in over 90 countries.
  • Living Standards Measurement Study - The Living Standards Measurement Study - Integrated Surveys on Agriculture (LSMS-ISA) is a household survey project established with a grant from the Bill and Melinda Gates Foundation and implemented by the LSMS team.
  • Young Lives surveys - The Young Lives datasets from the first five rounds of household and child surveys, school surveys, and Call 1 of the COVID-19 Phone Survey are publicly archived and available to download from the UK Data Service, along with the documentation and questionnaires for each survey round. For users in our study countries, they are also available on CD-Rom, on request from the Principal Investigator.
  • Violence against Children Surveys (VACS) - Currently, over 24 countries in Africa, Asia-Pacific, Latin America, and the Caribbean are actively engaged in critical work to prevent violence against children and youth. Country reports, survey questionnaires, and supporting data for VACS in each country are included below. VACS data sets for many countries are made available for public use, as consistent with agreements with country partners. Access to the public use datasetsexternal icon is coordinated through our partnership with Together for Girls. Together for Girls Resource Bankexternal icon also includes the latest reports and tools from the Together for Girls partnership.

Others

  • International Ipums - IPUMS-International is dedicated to collecting and distributing census data from around the world. The project goals are to collect and preserve data and documentation, harmonize data, and disseminate the harmonized data free of charge. This particular dataset contains harmonized international census data for social science and health research. (HT/ @aiyaranaka)
  • Google BigQuery – Suggested by Grant Mcdermott – unclear how to use it yet, but he may have more to explain here.
  • JEP Data watch Articles – The Journal of Economic Perspectives (JEP) before 2000, had articles where authors would summarize a particular dataset they found useful. These were called “Data Watch” it explained with more information what the data have. So you can start here or if you want to know more about say the American Time Use Survey (ATUS) you can go to the particular article that talks about it. (HT/ @bartonwillage)
  • Ian Mccarthy’s website has a nice design of providing information on this website and other methods stuff.

💁🏾‍♂️Tips:

  • If you find reports or papers where you see a statistic or a “number” that you need, look in the footnote or reference of the document to see where the authors obtained the data. At some point you may need to contact or email someone. Be sure to be polite and use personal judgement (i.e. do not constantly bombard an individual with lots of emails!) when composing e-mails. Some of those papers or reports may have a replication-file that is publicly accessible.
  • Ask around for data sources. If you have people in your network who are experts on a topic, ask them for their advice.
  • Get creative. If you cannot find the variable that you are looking for, what is another variable that would be a good proxy? Can you find that?
  • Validate your data source. If you find data that you liked, see if anyone else has used it, and for what purpose?