The National Center for Healthcare Statistics provides this data as flatfiles and through a well documented API. The dataset includes crude birth rates and general fertility rates in the United States since 1909. This particular dataset has 107 observations accounting for year, birth number, crude birth rate, and general fertility rate.
Example in Python
import pandas as pd # See all observations in 1909. df = pd.read_json('https://data.cdc.gov/resource/tndt-s2gv.json?year=1909')
Example in R
library(RSocrata) # See all observations in 1909. df <- read.socrata("https://data.cdc.gov/resource/tndt-s2gv.json?year=1909")
The Consumer Price Index (CPI) is a measure of the average change over time in the prices paid by urban consumers for a market basket of consumer goods and services. Indexes are available for the U.S. and various geographic areas. Average price data for select utility, automotive fuel, and food items are also available. All data is available in flat files and through the Bureau of Labor Statistics API.
This large data set is segmented into four groups:
import requests import pandas as pd # All items in U.S. city average, all urban consumers, seasonally adjusted response = requests.get('https://api.bls.gov/publicAPI/v2/timeseries/data/CUSR0000SA0') data = response.json() df = pd.DataFrame(data['Results']['series'][0]['data']) df.value = df.value.astype('float') # See the average price by year print(df.groupby('year')['value'].mean())
library(dplyr) library(jsonlite) # All items in U.S. city average, all urban consumers, seasonally adjusted data <- fromJSON("https://api.bls.gov/publicAPI/v2/timeseries/data/CUSR0000SA0") df <- data[["Results"]][["series"]][["data"]][[1]] df$value <- as.numeric(df$value) # See the average price by year df %>% group_by(year) %>% summarise(avg_rate = mean(value))
This dataset of U.S. mortality trends since 1900 highlights the differences in age-adjusted death rates and life expectancy at birth by race and sex. This particular dataset has 1,044 observations accounting for year, age, race, sex, average life expectancy in years, and mortality.
The National Center for Healthcare Statistics provides this data as flatfiles and through a well documented API.
import pandas as pd df = pd.read_json('https://data.cdc.gov/resource/bgqx-uh4z.json') # See Average Life Expectancy for All Races and Genders by Year df[(df.race == 'All Races') & (df.sex == 'Both Sexes')][['average_life_expectancy', 'year']]
library(RSocrata) df <- read.socrata("https://data.cdc.gov/resource/bgqx-uh4z.json") # See Average Life Expectancy for All Races and Genders by Year df[df$race == "All Races" & df$sex == "Both Sexes", c("average_life_expectancy", "year")]
The National Center of Health Statistics publishes this data at the county and state level as flat files and through an API. Each dataset describes drug poisoning deaths by selected demographic characteristics, and includes age-adjusted death rates for drug poisoning from 1999 to 2015. The county level dataset has 8 variables/columns and 53,387 rows, which are described in the API Documentation. The state level dataset also has full documentation of its 18 variables/columns and 2,703 rows.
import pandas as pd # Get all mortality levels for the state of Texas. df = pd.read_json("https://data.cdc.gov/resource/tenp-43rk.json?st=TX")
library(RSocrata) # Get all mortality levels for the state of Texas. df <- read.socrata("https://data.cdc.gov/resource/tenp-43rk.json?st=TX")
This dataset describes injury mortality in the United States beginning in 1999. Two concepts are included in the circumstances of an injury death: intent of injury and mechanism of injury. This particular dataset has 17 variables/columns and 98,280 rows.
Documentation on the latest version of this dataset provides complete information on variables, data sources, dataset identifier, definitions, and classifications can be found at the API docs here
import pandas as pd # Get all injury mechanisms for mortality in the United States df = pd.read_json("https://data.cdc.gov/resource/6j4j-ispt.json")
library(RSocrata) # Get all injury mechanisms for mortality in the United States df <- read.socrata("https://data.cdc.gov/resource/6j4j-ispt.json")
This data set comes from the Current Population Survey (CPS), a monthly survey of households conducted by the Bureau of Census for the Bureau of Labor Statistics. This large dataset has data for the years 1995-1999, as well as 2002-2017. All data is available in HTML, PDF, and XLSX flat formats, as well as through the Bureau of Labor Statistics API.
The 57 data tables are grouped together in the following catagories:
A full list of tables and variables for the Current Population Survey can be found here.
import requests import pandas as pd # Seasonally Adjusted Unemployment Rate response = requests.get('https://api.bls.gov/publicAPI/v2/timeseries/data/LNS14000000') data = response.json() df = pd.DataFrame(data['Results']['series'][0]['data']) df.value = df.value.astype('float') # See the average rate by year print(df.groupby('year')['value'].mean())
library(dplyr) library(jsonlite) # Seasonally Adjusted Unemployment Rate data <- fromJSON("https://api.bls.gov/publicAPI/v2/timeseries/data/LNS14000000") df <- data[["Results"]][["series"]][["data"]][[1]] df$value <- as.numeric(df$value) # See the average rate by year df %>% group_by(year) %>% summarise(avg_rate = mean(value))
Labor productivity is a measure of economic performance that compares the amount of goods and services produced (output) with the number of hours worked to produce those goods and services. The BLS also publishes measures of multifactor productivity.
The data is organized into two separate databases - Major Sector Productivity and Costs and Industry Productivity. Both databases are available as flat files and through the Bureau of Labor Statistics API.
import requests import pandas as pd # Office of Productivity And Technology and Percent/Rate/Ratio and Productivity : Nonfarm Business response = requests.get('https://api.bls.gov/publicAPI/v2/timeseries/data/PRS85006092') data = response.json() df = pd.DataFrame(data['Results']['series'][0]['data']) df.value = df.value.astype('float') # See the rate change by quarter print(df.sort_values(['year', 'period'])[['year', 'period', 'value']])
library(dplyr) library(jsonlite) # Office of Productivity And Technology and Percent/Rate/Ratio and Productivity : Nonfarm Business data <- fromJSON("https://api.bls.gov/publicAPI/v2/timeseries/data/PRS85006092") df <- data[["Results"]][["series"]][["data"]][[1]] df$value <- as.numeric(df$value) # See the rate change by quarter print(df[order(df$year, df$period), c("year", "period", "value")])
This dataset presents the age-adjusted death rates for the 10 leading causes of death in the United States beginning in 1999. This particular dataset has 10,296 observations describing the year, state, cause of death, number of deaths, and age adjusted death rate. Documentation on the latest version of this dataset provides complete information on variables, data sources, dataset identifier, definitions, and classifications can be found at the API docs here
import pandas as pd # All records for the state of Alabama df = read_json('https://data.cdc.gov/resource/u4d7-xz8k.json?state=Alabama')
library(RSocrata) # All records for the state of Alabama df <- read.socrata("https://data.cdc.gov/resource/u4d7-xz8k.json?state=Alabama")
The Global Health Observatory data repository is the World Health Organization’s gateway to health-related statistics for its 194 Member States. It provides access to over 1000 indicators on priority health topics including mortality and burden of diseases, the Millennium Development Goals (child nutrition, child health, maternal and reproductive health, immunization, HIV/AIDS, tuberculosis, malaria, neglected diseases, water and sanitation), non communicable diseases and risk factors, epidemic-prone diseases, health systems, environmental health, violence and injuries, equity among others.
Many of these datasets represent the best estimates of WHO using methodologies for specific indicators that aim for comparability across countries and time. Please check the Indicator and Measurement Registry for indicator specific information. Additional metadata and definitions can be found here. The World Health Organization also provides examples of API usage.
import requests import numpy as np import pandas as pd base = ('http://apps.who.int/gho/athena/data/GHO/{}' '.json?profile=simple&filter=COUNTRY:*') ad_restrictions = 'SA_0000001515' def get_data(code): response = requests.get(base.format(code)) data = get_data_helper(response.json()) return pd.DataFrame(data) def get_data_helper(who_dictionary): '''Pads the dictionary so entries are same length''' rv = [] fact_table = who_dictionary['fact'] for observation in fact_table: new_row = observation['dim'] new_row['VALUE'] = observation['Value'] rv.append(new_row) return rv def clean_who_data(df): df['ad_type'] = df.ADVERTISINGTYPE.apply(lambda s: s.replace(' Ads', '')) df.drop(['GHO', 'PUBLISHSTATE' , 'ADVERTISINGTYPE'], axis=1, inplace=True) df.replace('', np.NaN, inplace=True) if __name__ == "__main__": df = get_data(ad_restrictions) clean_who_data(df) print('Preview:') print(df.head()) df.to_csv('WHO_ad_data.csv', index=False)
library(readr) library(dplyr) library(ggplot2) # Flat file downloaded from: # http://apps.who.int/gho/data/view.main.REGION2480A?lang=en data <- read_csv('obesity.csv') by.region <- data %>% group_by(REGION, YEAR) %>% summarize(mean.Numeric = mean(Numeric)/100) ggplot(by.region, aes(x = YEAR, y = mean.Numeric, color = REGION)) + geom_point() + geom_line() + ggtitle('Obesity Rates Over Time', subtitle = 'Grouped By Region') + ylab('Average Obesity Rate')
The Small Business Administration Survey records general charactersitics of small businesses in the United States, such as the number of employees, industry, number of locations, paid wages, etc. It also considers the demographic information of owners, such as marital status and ethnicity. The data provided below is for the year of 1992.
Survey 1 Download | Documentation
Survey 2 Download | Documentation
import pandas as pd df = pd.read_table('sbaraw-s1.dta')
df <- pd.read_table("sbaraw-s1.dta")
The Statistical Abstract of the United States, published from 1878 to 2012, is the authoritative and comprehensive summary of statistics on the social, political, and economic organization of the United States. It is designed to serve as a convenient volume for statistical reference, and as a guide to other statistical publications and sources both in print and on the Web. These sources of data include the U.S. Census Bureau, Bureau of Labor Statistics, Bureau of Economic Analysis, and many other Federal agencies and private organizations.
The documentation is segmented by year, and then separated into parts. For example, the documentation for 1994 can be found here.
import pandas as pd # Download Variables of interest from data portal # You can load the data file like any text file df = pd.read_table('default.dat')
# Download Variables of interest from data portal # You can load the data file like any text file df <- pd.read_table("default.dat")
This dataset assembles all final birth data for females aged 15–19, 15–17, and 18–19 for the United States and each of the 50 states. This particular dataset notes the year, state, age of mother, and relevant birthrates for 4,212 observations.
import pandas as pd # Observations where the state teen birth rate is 37.5 % df = pd.read_json('https://data.cdc.gov/resource/sgfp-ytm5.json?$where=state_rate=37.5')
library(RSocrata) df <- read.socrata('https://data.cdc.gov/resource/sgfp-ytm5.json?$where=state_rate=37.5')
Researchers can produce their own time-use estimates using the ATUS microdata files. The ATUS data files include information for over 190,000 respondents total from 2003 to 2017. Because of the size of these data files, it is easiest to work with them using statistical software such as Stata, SAS, or SPSS.
The survey is sponsored by the Bureau of Labor Statistics and is conducted by the U.S. Census Bureau.
The major purpose of ATUS is to develop nationally representative estimates of how people spend their time. The survey also provides information on the amount of time people spend in many other activities, such as religious activities, socializing, exercising, and relaxing. Demographic information such as sex, race, age, educational attainment, etc. is also available for each respondent.
Microdata | Data Dictionary | User Guide
import pandas as pd mapping = {1: 'New England' , 2: 'Middle Atlantic' , 3: 'East North Central' , 4: 'West North Central' , 5: 'South Atlantic' , 6: 'East South Central' , 7: 'West South Central' , 8: 'Mountain' , 9: 'Pacific'} df = pd.read_table('atuscps_2017.dat', delimiter=',') df['division'] = df['GEDIV'].map(mapping) # See number of housing units by geographic division. print(pd.crosstab(df.division, df.HEHOUSUT))
df <- read.csv("atuscps_2017.dat") df$GEDIV <- factor(df$GEDIV) levels(df$GEDIV) <- c("New England" , "Middle Atlantic" , "East North Central" , "West North Central" , "South Atlantic" , "East South Central" , "West South Central" , "Mountain" , "Pacific") # See number of housing units by geographic division. table(df$GEDIV, df$HEHOUSUT)
Analyze Boston is the City of Boston’s open data hub to find facts, figures, and maps related to our lives within the city. We are working to make this the default technology platform to support the publication of the City’s public information, in the form of data, and to make this information easy to find, access, and use by a broad audience. This platform is managed by the Citywide Analytics Team.
Each dataset from Analyze Boston typically has metadata and relevant information. For example, this dataset from ParkBoston.
import pandas as pd df = pd.read_csv('park-boston-monthly-transactions-by-zone-2015.csv') # Remove trailing whitespace in column names. df.columns = [c.strip() for c in df.columns] # See the 20 most used parking zones in January. print(df.sort_values('January', ascending=False)['Zone Name'].head(20))
df <- read.csv("park-boston-monthly-transactions-by-zone-2015.csv") # See the 20 most used parking zones in January. head(df$Zone.Name[order(df$January), decreasing = TRUE], 20)
This API makes all data available from the Department of Education’s College Scorecard, as well as supporting data on student completion, debt and repayment, earnings, and more. The files include data from 1996 through 2016 for all undergraduate degree-granting institutions of higher education. Data includes institution level characteristic such as average cost of attendance and retention rates for first-time students, as well as student characteristics such as student body by ethnicity and age. The full documentation and data dictionary can be found at here.
import requests key = 'YOUR API KEY HERE' base = ('https://api.data.gov/ed/collegescorecard/v1/' 'schools?school.name=chicago&api_key=') response = requests.get(base + key) data = response.json()['results'] # See all schools in Chicago for observation in data: print(observation['school']['name'])
library(httr) key <- "YOUR API KEY HERE" base <- paste("https://api.data.gov/ed/collegescorecard/v1/" , "schools?school.name=chicago&api_key=" , sep = "") response <- GET(paste(base, key, sep = "")) data <- content(response, "parsed")[["results"]] # See all schools in Chicago for (observation in data) { print(observation[["school"]][["name"]]) }
The U.S. Energy Information Administration has its data free and open through an API, bulk file download, Excel / Google Sheets add-ons, and pluggable online widgets. EIA’s API contains the datasets centered around hourly electricity operations, state energy systems, petroleum products, crude imports, natural gas, coal, international energy, and short-term and annual energy outlook. While the API is offered as a free public service, registration is required. The EIA also provides:
import requests import pandas # See MMBtu by year for the plant in Tracy, Nevada. url = ('http://api.eia.gov/series/?api_key=YOUR_API_KEY&' 'series_id=ELEC.PLANT.CONS_EG_BTU.2336-ALL-ALL.A') response = requests.get(url) info = response.json() df = pd.DataFrame(info['series'][0]['data'], columns=['Year', 'MMBtu'])
library(httr) library(purrr) # See MMBtu by year for the plant in Tracy, Nevada. url <- paste("http://api.eia.gov/series/?api_key=" , "YOUR_API_KEY&" , "series_id=ELEC.PLANT.CONS_EG_BTU.2336-ALL-ALL.A" , sep = "") response <- GET(url) data <- content(response, "parsed")[["series"]][[1]][["data"]] years <- unlist((transpose(data)[[1]])) MMBtu <- unlist((transpose(data)[[2]])) df <- data.frame(years, MMBtu)
The FBI Crime Data API is a read-only web service that returns JSON or CSV data. It is broadly organized around the FBI’s Uniform Crime Reporting systems data, and requires a data.gov API network key. Agencies submit data using one of two reporting formats – the Summary Reporting System (SRS), or the National Incident Based Reporting System (NIBRS).
The FBI also provides full documentation and source code.
import requests import pandas as pd base = 'https://api.usa.gov/crime/fbi/sapi/' query = 'api/summarized/agencies/WY0200100/homicide?api_key=' key = 'your_api_key' # Homicides Recorded by Jackson Police Department response = requests.get(base + query + key) data = response.json() df = pd.DataFrame(data['results'])
library(jsonlite) url <- paste("https://api.usa.gov/crime/fbi/sapi/" , "api/summarized/agencies/WY0200100/homicide?api_key=" , "your_api_key" , sep = "") data <- fromJSON(url) df <- data[["results"]]
This data product provided by the USDA contains statistics on four main feed grains - corn, grain sorghum, barley, and oats - as well as foreign coarse grains such as rye, millet, hay, and related items. This includes data published in the monthly Feed Outlook and previously annual Feed Yearbook. Data are monthly, quarterly, and/or annual depending upon the data series.
Latest data may be preliminary or projected. Missing values indicate unreported values, discontinued series, or not yet released data. It is available in a bulk download from here.
import pandas as pd df = pd.read_excel('Feed Grains Yearbook Tables-Recent.xls' , sheet_name='FGYearbookTable01' , skiprows=[0, 1, 38, 39, 40, 41] , names=['commodity', 'year', 'planted', 'harvested', 'production', 'yield', 'price', 'loan_rate'] , header=None) df.dropna(how='all', inplace=True) df.commodity.fillna(method='ffill', inplace=True)
The current version of the Food Environment Atlas has over 275 variables, including new indicators on access and proximity to a grocery store for sub populations; an indicator on the SNAP Combined Application Project for recipients of Supplemental Security Income (at the State level); and indicators on farmers’ markets that report accepting credit cards or report selling baked and prepared food products. All of the data included in the Atlas are aggregated into an Excel spreadsheet for easy download. These data are from a variety of sources and cover varying years and geographic levels. The documentation for each version of the data provides complete information on definitions and data sources.
import pandas as pd # After downloading the excel file. df = pd.read_excel('August2015.xls', sheet_name='PRICES_TAXES') # See the 10 counties with the highest soda price. df.sort_values('SODA_PRICE10' , ascending=False).head(10)[['State', 'County', 'SODA_PRICE10']]
library(xlsx) # After downloading the excel file. df <- read.xlsx("August2015.xls", sheetName = "PRICES_TAXES") # See the 10 counties with the highest soda price. head(df[order(df$SODA_PRICE10, decreasing = TRUE) , c("State", "County", "SODA_PRICE10")], 10)
The GSS gathers data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes. The survey contains a standard core of demographic, behavioral, and attitudinal questions, plus topics of special interest. Among the topics covered are civil liberties, crime and violence, intergroup tolerance, morality, national spending priorities, psychological well-being, social mobility, and stress and traumatic events. The data is available for SPSS and STATA here.
IPUMS is not a collection of compiled statistics; it is composed of microdata. Each record is a person, with all characteristics numerically coded. In most samples persons are organized into households, making it possible to study the characteristics of people in the context of their families or other co-residents. Because the data are individuals and not tables, researchers must use a statistical package to analyze the millions of records in the database. A data extraction system enables users to select only the samples and variables they require. Data is received in a gzip file. Data that is used for publicatoin must be cited. The IPUMS download portal yields a data file as well as command files for SAS, SPSS, Stata, and R. Researchers using R are recommended to use the ipumsr package (manual).
ipumsr
Helpful Links:
This API provides access to data from the Census of Agriculture as well as national, state, and county level surveys. Data is queried through requesting commodities encapsulated in the sectors of Crops, Animals & Products, Economics, Demographics, and Environmental. The commodity statistics are aggregated for standard census geographies, agricultural statistics districts, and watershed boundaries over annual, seasonal, monthly, weekly, and daily time periods.
Full Documentation, a Data Dictionary, and API Registration can be found here. Example in Python
import requests import pandas as pd key = 'your_api_key' # Observations for corn in Virginia in 2010 url = 'http://quickstats.nass.usda.gov/api/api_GET/?key={}&\ commodity_desc=CORN&year__GE=2010&state_alpha=VA' response = requests.get(url.format(key)) data = response.json() df = pd.DataFrame(data['data'])
library(jsonlite) key <- "your_api_key" # Observations for corn in Virginia in 2010 url <- paste("http://quickstats.nass.usda.gov/api/api_GET/?key=" , key , "&commodity_desc=CORN&year__GE=2010&state_alpha=VA" , sep = "") data <- fromJSON(url) df <- data.frame(data)
The purpose of this study was to collect extensive information on the sexual experiences and other social, demographic, attitudinal, and health-related characteristics of adults in the United States. The survey collected information on sexual practices with spouses/cohabitants and other sexual partners and collected background information about the partners. Major areas of investigation include sexual experiences such as number of sexual partners in given time periods, frequency of particular practices, and timing of various sexual events. The data cover childhood and adolescence, as well as adulthood. Other topics in the survey relate to sexual victimization, marriage and cohabitation, and fertility. Respondents were also queried about their physical health, including history of sexually transmitted diseases. Respondents’ attitudes toward premarital sex, the appeal of particular practices such as oral sex, and levels of satisfaction with particular sexual relationships were also studied. Demographic items include race, education, political and religious affiliation, income, and occupation.
The codebook can be found here.
Information on the labor market activities and other significant life events of several groups of men and women at multiple points in time. For more than 4 decades, NLS data have served as an important tool for economists, sociologists, and other researchers.The NLS program includes the following cohorts :
The download functionality for these data sets provides access to files for SPSS, SAS, Stata, R, or simply a csv. A tagset, codebook, description file, and log file are also included with a download.
The R, SAS, and SPSS files contain code needed to load the data set, as well as short explanations for missing values and level names. Example in Python
The Nationwide Readmissions Database is designed to support various types of analyses of national readmission rates. The NRD includes discharges for patients with and without repeat hospital visits in a year and those who have died in the hospital. The criteria to determine the relationship between hospital admissions is left to the analyst using the NRD. This database was compiled by the Agency for Healthcare Research and Quality (AHRQ), who provides a Data Dictionary and Full Documentation.
This data must be purchased, follow this link for more information.
This data collection consists of a survey of private ownership of firearms by adults in the United States. Respondents who both did and did not own firearms were included. The variables cover topics such as the number and type of guns owned privately, methods of, and reasons for, firearms acquisition, the storage and carrying of guns, the defensive use of firearms against criminal attackers, and reasons for and against firearm ownership. Basic demographic variables include sex, age, education, and employment.
The full codebook can be viewed here.
The public-use data files in this collection are available for access by the general public. Access does not require affiliation with an ICPSR member institution. However, an ICPSR login is still required to download the data itself.
The MyTSA Web Service API supports several features, some of which include: Security Checkpoint Wait Times, TSA Pre-Check locations, and Sunrise/Sunset times for all locations. Data can be queried by state and/or airport. The TSA provides XML files of data in addition to the API with documentation. Example in Python
import requests # Returns JSON file of today’s sunset time for DCA airport response = requests.get('http://apps.tsa.dhs.gov/MyTSAWebService/GetEventInfo.ashx?eventtype=sunset&airportcode=DCA&output=json') data = response.json()
library(jsonlite) # Returns JSON file of today’s sunset time for DCA airport data <- fromJSON('http://apps.tsa.dhs.gov/MyTSAWebService/GetEventInfo.ashx?eventtype=sunset&airportcode=DCA&output=json')