Data Census

US Births and General Fertility Rates (NCHS)

The National Center for Healthcare Statistics provides this data as flatfiles and through a well documented API. The dataset includes crude birth rates and general fertility rates in the United States since 1909. This particular dataset has 107 observations accounting for year, birth number, crude birth rate, and general fertility rate.

import pandas as pd

# See all observations in 1909.
df = pd.read_json('https://data.cdc.gov/resource/tndt-s2gv.json?year=1909')

library(RSocrata)

# See all observations in 1909.
df <- read.socrata("https://data.cdc.gov/resource/tndt-s2gv.json?year=1909")

Consumer Price Index (CPI)

The Consumer Price Index (CPI) is a measure of the average change over time in the prices paid by urban consumers for a market basket of consumer goods and services. Indexes are available for the U.S. and various geographic areas. Average price data for select utility, automotive fuel, and food items are also available. All data is available in flat files and through the Bureau of Labor Statistics API.

This large data set is segmented into four groups:

All Urban Consumers (Current Series)
Urban Wage Earners and Clerical Workers (Current Series)
All Urban Consumers (Chained CPI)
Average Price Data

import requests
import pandas as pd

# All items in U.S. city average, all urban consumers, seasonally adjusted
response = requests.get('https://api.bls.gov/publicAPI/v2/timeseries/data/CUSR0000SA0') 
data = response.json()
df = pd.DataFrame(data['Results']['series'][0]['data'])
df.value = df.value.astype('float')

# See the average price by year
print(df.groupby('year')['value'].mean())

library(dplyr)
library(jsonlite)

# All items in U.S. city average, all urban consumers, seasonally adjusted
data <- fromJSON("https://api.bls.gov/publicAPI/v2/timeseries/data/CUSR0000SA0") 
df <- data[["Results"]][["series"]][["data"]][[1]]
df$value <- as.numeric(df$value)

# See the average price by year
df %>%
  group_by(year) %>%
  summarise(avg_rate = mean(value))

US Death Rates and Life Expectancy at Birth (NCHS)

This dataset of U.S. mortality trends since 1900 highlights the differences in age-adjusted death rates and life expectancy at birth by race and sex. This particular dataset has 1,044 observations accounting for year, age, race, sex, average life expectancy in years, and mortality.

The National Center for Healthcare Statistics provides this data as flatfiles and through a well documented API.

import pandas as pd

df = pd.read_json('https://data.cdc.gov/resource/bgqx-uh4z.json')

# See Average Life Expectancy for All Races and Genders by Year
df[(df.race == 'All Races') & (df.sex == 'Both Sexes')][['average_life_expectancy', 'year']]

library(RSocrata)

df <- read.socrata("https://data.cdc.gov/resource/bgqx-uh4z.json")

# See Average Life Expectancy for All Races and Genders by Year
df[df$race == "All Races" & df$sex == "Both Sexes", c("average_life_expectancy", "year")]

Drug Poisoning Mortality (NCHS)

The National Center of Health Statistics publishes this data at the county and state level as flat files and through an API. Each dataset describes drug poisoning deaths by selected demographic characteristics, and includes age-adjusted death rates for drug poisoning from 1999 to 2015. The county level dataset has 8 variables/columns and 53,387 rows, which are described in the API Documentation. The state level dataset also has full documentation of its 18 variables/columns and 2,703 rows.

import pandas as pd

# Get all mortality levels for the state of Texas.
df = pd.read_json("https://data.cdc.gov/resource/tenp-43rk.json?st=TX")

library(RSocrata)

# Get all mortality levels for the state of Texas.
df <- read.socrata("https://data.cdc.gov/resource/tenp-43rk.json?st=TX")

US Injury Mortality (NCHS)

This dataset describes injury mortality in the United States beginning in 1999. Two concepts are included in the circumstances of an injury death: intent of injury and mechanism of injury. This particular dataset has 17 variables/columns and 98,280 rows.

Documentation on the latest version of this dataset provides complete information on variables, data sources, dataset identifier, definitions, and classifications can be found at the API docs here

import pandas as pd

# Get all injury mechanisms for mortality in the United States 
df = pd.read_json("https://data.cdc.gov/resource/6j4j-ispt.json")

library(RSocrata)


# Get all injury mechanisms for mortality in the United States 
df <- read.socrata("https://data.cdc.gov/resource/6j4j-ispt.json")

Labor Force Statistics (CPS)

This data set comes from the Current Population Survey (CPS), a monthly survey of households conducted by the Bureau of Census for the Bureau of Labor Statistics. This large dataset has data for the years 1995-1999, as well as 2002-2017. All data is available in HTML, PDF, and XLSX flat formats, as well as through the Bureau of Labor Statistics API.

The 57 data tables are grouped together in the following catagories:

A full list of tables and variables for the Current Population Survey can be found here.

import requests
import pandas as pd

# Seasonally Adjusted Unemployment Rate 
response = requests.get('https://api.bls.gov/publicAPI/v2/timeseries/data/LNS14000000') 
data = response.json()
df = pd.DataFrame(data['Results']['series'][0]['data'])
df.value = df.value.astype('float')

# See the average rate by year
print(df.groupby('year')['value'].mean())

library(dplyr)
library(jsonlite)

# Seasonally Adjusted Unemployment Rate 
data <- fromJSON("https://api.bls.gov/publicAPI/v2/timeseries/data/LNS14000000") 
df <- data[["Results"]][["series"]][["data"]][[1]]
df$value <- as.numeric(df$value)

# See the average rate by year
df %>%
  group_by(year) %>%
  summarise(avg_rate = mean(value))

Labor Productivity and Costs (BLS)

Labor productivity is a measure of economic performance that compares the amount of goods and services produced (output) with the number of hours worked to produce those goods and services. The BLS also publishes measures of multifactor productivity.

The data is organized into two separate databases - Major Sector Productivity and Costs and Industry Productivity. Both databases are available as flat files and through the Bureau of Labor Statistics API.

import requests
import pandas as pd

# Office of Productivity And Technology and Percent/Rate/Ratio and Productivity : Nonfarm Business
response = requests.get('https://api.bls.gov/publicAPI/v2/timeseries/data/PRS85006092') 
data = response.json()
df = pd.DataFrame(data['Results']['series'][0]['data'])
df.value = df.value.astype('float')

# See the rate change by quarter 
print(df.sort_values(['year', 'period'])[['year', 'period', 'value']])

library(dplyr)
library(jsonlite)

# Office of Productivity And Technology and Percent/Rate/Ratio and Productivity : Nonfarm Business
data <- fromJSON("https://api.bls.gov/publicAPI/v2/timeseries/data/PRS85006092") 
df <- data[["Results"]][["series"]][["data"]][[1]]
df$value <- as.numeric(df$value)

# See the rate change by quarter 
print(df[order(df$year, df$period), c("year", "period", "value")])

US Leading Causes of Death (NCHS)

This dataset presents the age-adjusted death rates for the 10 leading causes of death in the United States beginning in 1999. This particular dataset has 10,296 observations describing the year, state, cause of death, number of deaths, and age adjusted death rate. Documentation on the latest version of this dataset provides complete information on variables, data sources, dataset identifier, definitions, and classifications can be found at the API docs here

import pandas as pd
# All records for the state of Alabama
df = read_json('https://data.cdc.gov/resource/u4d7-xz8k.json?state=Alabama')

library(RSocrata)
# All records for the state of Alabama
df <- read.socrata("https://data.cdc.gov/resource/u4d7-xz8k.json?state=Alabama")

Global Health Observatory Data Repository

The Global Health Observatory data repository is the World Health Organization’s gateway to health-related statistics for its 194 Member States. It provides access to over 1000 indicators on priority health topics including mortality and burden of diseases, the Millennium Development Goals (child nutrition, child health, maternal and reproductive health, immunization, HIV/AIDS, tuberculosis, malaria, neglected diseases, water and sanitation), non communicable diseases and risk factors, epidemic-prone diseases, health systems, environmental health, violence and injuries, equity among others.

Many of these datasets represent the best estimates of WHO using methodologies for specific indicators that aim for comparability across countries and time. Please check the Indicator and Measurement Registry for indicator specific information. Additional metadata and definitions can be found here. The World Health Organization also provides examples of API usage.

import requests
import numpy as np
import pandas as pd

base = ('http://apps.who.int/gho/athena/data/GHO/{}'
        '.json?profile=simple&filter=COUNTRY:*')

ad_restrictions = 'SA_0000001515'

def get_data(code):
    response = requests.get(base.format(code))
    data = get_data_helper(response.json())
    return pd.DataFrame(data)

def get_data_helper(who_dictionary):
    '''Pads the dictionary so entries are same length'''
    rv = [] 
    fact_table = who_dictionary['fact']
    for observation in fact_table:
        new_row = observation['dim']
        new_row['VALUE'] = observation['Value']
        rv.append(new_row)
    return rv

def clean_who_data(df):
    df['ad_type'] = df.ADVERTISINGTYPE.apply(lambda s: 
                                     s.replace(' Ads', ''))
    df.drop(['GHO', 'PUBLISHSTATE'
           , 'ADVERTISINGTYPE'], axis=1, inplace=True)
    df.replace('', np.NaN, inplace=True)

if __name__ == "__main__":
    df = get_data(ad_restrictions)
    clean_who_data(df)
    print('Preview:') 
    print(df.head())
    df.to_csv('WHO_ad_data.csv', index=False)

library(readr)
library(dplyr)
library(ggplot2)

# Flat file downloaded from:
# http://apps.who.int/gho/data/view.main.REGION2480A?lang=en

data <- read_csv('obesity.csv')

by.region <- data %>%
  group_by(REGION, YEAR) %>%
  summarize(mean.Numeric = mean(Numeric)/100)

ggplot(by.region, aes(x = YEAR, y = mean.Numeric, color = REGION)) +
  geom_point() +
  geom_line() +
  ggtitle('Obesity Rates Over Time', subtitle = 'Grouped By Region') +
ylab('Average Obesity Rate')

Small Business Administration Survey (1992)

The Small Business Administration Survey records general charactersitics of small businesses in the United States, such as the number of employees, industry, number of locations, paid wages, etc. It also considers the demographic information of owners, such as marital status and ethnicity. The data provided below is for the year of 1992.

Survey 1 Download | Documentation

Survey 2 Download | Documentation

import pandas as pd
df = pd.read_table('sbaraw-s1.dta')

df <- pd.read_table("sbaraw-s1.dta")

Statistical Abstracts of the United States

The Statistical Abstract of the United States, published from 1878 to 2012, is the authoritative and comprehensive summary of statistics on the social, political, and economic organization of the United States. It is designed to serve as a convenient volume for statistical reference, and as a guide to other statistical publications and sources both in print and on the Web. These sources of data include the U.S. Census Bureau, Bureau of Labor Statistics, Bureau of Economic Analysis, and many other Federal agencies and private organizations.

The documentation is segmented by year, and then separated into parts. For example, the documentation for 1994 can be found here.

import pandas as pd
# Download Variables of interest from data portal
# You can load the data file like any text file
df = pd.read_table('default.dat')

# Download Variables of interest from data portal
# You can load the data file like any text file
df <- pd.read_table("default.dat")

US and State Trends on Teen Births (NCHS)

This dataset assembles all final birth data for females aged 15–19, 15–17, and 18–19 for the United States and each of the 50 states. This particular dataset notes the year, state, age of mother, and relevant birthrates for 4,212 observations.

Documentation on the latest version of this dataset provides complete information on variables, data sources, dataset identifier, definitions, and classifications can be found at the API docs here

import pandas as pd
# Observations where the state teen birth rate is 37.5 %
df = pd.read_json('https://data.cdc.gov/resource/sgfp-ytm5.json?$where=state_rate=37.5')

library(RSocrata)
df <- read.socrata('https://data.cdc.gov/resource/sgfp-ytm5.json?$where=state_rate=37.5')

American Time Use Survey

Researchers can produce their own time-use estimates using the ATUS microdata files. The ATUS data files include information for over 190,000 respondents total from 2003 to 2017. Because of the size of these data files, it is easiest to work with them using statistical software such as Stata, SAS, or SPSS.

The survey is sponsored by the Bureau of Labor Statistics and is conducted by the U.S. Census Bureau.

The major purpose of ATUS is to develop nationally representative estimates of how people spend their time. The survey also provides information on the amount of time people spend in many other activities, such as religious activities, socializing, exercising, and relaxing. Demographic information such as sex, race, age, educational attainment, etc. is also available for each respondent.

Microdata | Data Dictionary | User Guide

import pandas as pd
    
mapping = {1: 'New England'
         , 2: 'Middle Atlantic'
         , 3: 'East North Central'
         , 4: 'West North Central'
         , 5: 'South Atlantic'
         , 6: 'East South Central'
         , 7: 'West South Central'
         , 8: 'Mountain'
         , 9: 'Pacific'}

df = pd.read_table('atuscps_2017.dat', delimiter=',')
df['division'] = df['GEDIV'].map(mapping)

# See number of housing units by geographic division.
print(pd.crosstab(df.division, df.HEHOUSUT))

df <- read.csv("atuscps_2017.dat")
df$GEDIV <- factor(df$GEDIV)
levels(df$GEDIV) <- c("New England"
                   ,  "Middle Atlantic"
                   ,  "East North Central"
                   ,  "West North Central"
                   ,  "South Atlantic"
                   ,  "East South Central"
                   ,  "West South Central"
                   ,  "Mountain"
                   ,  "Pacific")

# See number of housing units by geographic division.
table(df$GEDIV, df$HEHOUSUT)

City of Boston

Analyze Boston is the City of Boston’s open data hub to find facts, figures, and maps related to our lives within the city. We are working to make this the default technology platform to support the publication of the City’s public information, in the form of data, and to make this information easy to find, access, and use by a broad audience. This platform is managed by the Citywide Analytics Team.

Each dataset from Analyze Boston typically has metadata and relevant information. For example, this dataset from ParkBoston.

import pandas as pd
df = pd.read_csv('park-boston-monthly-transactions-by-zone-2015.csv')

# Remove trailing whitespace in column names.
df.columns = [c.strip() for c in df.columns]

# See the 20 most used parking zones in January.
print(df.sort_values('January', ascending=False)['Zone Name'].head(20))

df <- read.csv("park-boston-monthly-transactions-by-zone-2015.csv")

# See the 20 most used parking zones in January.
head(df$Zone.Name[order(df$January), decreasing = TRUE], 20)

College ScoreCard API

This API makes all data available from the Department of Education’s College Scorecard, as well as supporting data on student completion, debt and repayment, earnings, and more. The files include data from 1996 through 2016 for all undergraduate degree-granting institutions of higher education. Data includes institution level characteristic such as average cost of attendance and retention rates for first-time students, as well as student characteristics such as student body by ethnicity and age. The full documentation and data dictionary can be found at here.

import requests

key = 'YOUR API KEY HERE'
base = ('https://api.data.gov/ed/collegescorecard/v1/'
        'schools?school.name=chicago&api_key=')

response = requests.get(base + key)
data = response.json()['results']

# See all schools in Chicago
for observation in data:
    print(observation['school']['name'])

library(httr)

key <- "YOUR API KEY HERE"
base <- paste("https://api.data.gov/ed/collegescorecard/v1/"
              , "schools?school.name=chicago&api_key="
              , sep = "")
response <- GET(paste(base, key, sep = ""))
data <- content(response, "parsed")[["results"]]

# See all schools in Chicago
for (observation in data) {
  print(observation[["school"]][["name"]])
}

Energy Information Administration API

The U.S. Energy Information Administration has its data free and open through an API, bulk file download, Excel / Google Sheets add-ons, and pluggable online widgets. EIA’s API contains the datasets centered around hourly electricity operations, state energy systems, petroleum products, crude imports, natural gas, coal, international energy, and short-term and annual energy outlook. While the API is offered as a free public service, registration is required. The EIA also provides:

import requests
import pandas

# See MMBtu by year for the plant in Tracy, Nevada.
url = ('http://api.eia.gov/series/?api_key=YOUR_API_KEY&'
       'series_id=ELEC.PLANT.CONS_EG_BTU.2336-ALL-ALL.A')
response = requests.get(url)
info = response.json()

df = pd.DataFrame(info['series'][0]['data'], columns=['Year', 'MMBtu'])

library(httr)
library(purrr)
# See MMBtu by year for the plant in Tracy, Nevada.
url <- paste("http://api.eia.gov/series/?api_key="
            , "YOUR_API_KEY&"
            , "series_id=ELEC.PLANT.CONS_EG_BTU.2336-ALL-ALL.A"
            , sep = "")

response <- GET(url)
data <- content(response, "parsed")[["series"]][[1]][["data"]]
years <- unlist((transpose(data)[[1]]))
MMBtu <- unlist((transpose(data)[[2]]))
df <- data.frame(years, MMBtu)

FBI Crime Data API

The FBI Crime Data API is a read-only web service that returns JSON or CSV data. It is broadly organized around the FBI’s Uniform Crime Reporting systems data, and requires a data.gov API network key. Agencies submit data using one of two reporting formats – the Summary Reporting System (SRS), or the National Incident Based Reporting System (NIBRS).

The FBI also provides full documentation and source code.

import requests
import pandas as pd
base = 'https://api.usa.gov/crime/fbi/sapi/'
query = 'api/summarized/agencies/WY0200100/homicide?api_key='
key = 'your_api_key' 

# Homicides Recorded by Jackson Police Department
response = requests.get(base + query + key)
data = response.json()
df = pd.DataFrame(data['results'])

library(jsonlite)
url <- paste("https://api.usa.gov/crime/fbi/sapi/"
            , "api/summarized/agencies/WY0200100/homicide?api_key="
            , "your_api_key" 
            , sep = "")

data <- fromJSON(url)
df <- data[["results"]]

Feed Grains' Yearbook Tables (USDA)

This data product provided by the USDA contains statistics on four main feed grains - corn, grain sorghum, barley, and oats - as well as foreign coarse grains such as rye, millet, hay, and related items. This includes data published in the monthly Feed Outlook and previously annual Feed Yearbook. Data are monthly, quarterly, and/or annual depending upon the data series.

Latest data may be preliminary or projected. Missing values indicate unreported values, discontinued series, or not yet released data. It is available in a bulk download from here.

import pandas as pd

df = pd.read_excel('Feed Grains Yearbook Tables-Recent.xls'
    , sheet_name='FGYearbookTable01'
    , skiprows=[0, 1, 38, 39, 40, 41]
    , names=['commodity', 'year', 'planted', 'harvested', 'production', 'yield', 'price', 'loan_rate']
    , header=None)
df.dropna(how='all', inplace=True)
df.commodity.fillna(method='ffill', inplace=True)

Food Environment Atlas (USDA)

The current version of the Food Environment Atlas has over 275 variables, including new indicators on access and proximity to a grocery store for sub populations; an indicator on the SNAP Combined Application Project for recipients of Supplemental Security Income (at the State level); and indicators on farmers’ markets that report accepting credit cards or report selling baked and prepared food products. All of the data included in the Atlas are aggregated into an Excel spreadsheet for easy download. These data are from a variety of sources and cover varying years and geographic levels. The documentation for each version of the data provides complete information on definitions and data sources.

import pandas as pd

# After downloading the excel file.
df = pd.read_excel('August2015.xls', sheet_name='PRICES_TAXES')

# See the 10 counties with the highest soda price.
df.sort_values('SODA_PRICE10'
         , ascending=False).head(10)[['State', 'County', 'SODA_PRICE10']]

library(xlsx)

# After downloading the excel file.
df <- read.xlsx("August2015.xls", sheetName = "PRICES_TAXES")

# See the 10 counties with the highest soda price.
head(df[order(df$SODA_PRICE10, decreasing = TRUE)
  , c("State", "County", "SODA_PRICE10")], 10)

General Social Survey (GSS)

The GSS gathers data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes. The survey contains a standard core of demographic, behavioral, and attitudinal questions, plus topics of special interest. Among the topics covered are civil liberties, crime and violence, intergroup tolerance, morality, national spending priorities, psychological well-being, social mobility, and stress and traumatic events. The data is available for SPSS and STATA here.

Integrated Public Use Microdata Series

IPUMS is not a collection of compiled statistics; it is composed of microdata. Each record is a person, with all characteristics numerically coded. In most samples persons are organized into households, making it possible to study the characteristics of people in the context of their families or other co-residents. Because the data are individuals and not tables, researchers must use a statistical package to analyze the millions of records in the database. A data extraction system enables users to select only the samples and variables they require. Data is received in a gzip file. Data that is used for publicatoin must be cited. The IPUMS download portal yields a data file as well as command files for SAS, SPSS, Stata, and R. Researchers using R are recommended to use the ipumsr package (manual).

Helpful Links:

USDA National Agricultural Statistics Service API

This API provides access to data from the Census of Agriculture as well as national, state, and county level surveys. Data is queried through requesting commodities encapsulated in the sectors of Crops, Animals & Products, Economics, Demographics, and Environmental. The commodity statistics are aggregated for standard census geographies, agricultural statistics districts, and watershed boundaries over annual, seasonal, monthly, weekly, and daily time periods.

Full Documentation, a Data Dictionary, and API Registration can be found here.

import requests
import pandas as pd

key = 'your_api_key'

# Observations for corn in Virginia in 2010
url = 'http://quickstats.nass.usda.gov/api/api_GET/?key={}&\
       commodity_desc=CORN&year__GE=2010&state_alpha=VA'

response = requests.get(url.format(key))
data = response.json()
df = pd.DataFrame(data['data'])

library(jsonlite)

key <- "your_api_key"

# Observations for corn in Virginia in 2010
url <- paste("http://quickstats.nass.usda.gov/api/api_GET/?key="
           , key
           , "&commodity_desc=CORN&year__GE=2010&state_alpha=VA"
           , sep = "")

data <- fromJSON(url)
df <- data.frame(data)

National Health and Social Life Survey, 1992

The purpose of this study was to collect extensive information on the sexual experiences and other social, demographic, attitudinal, and health-related characteristics of adults in the United States. The survey collected information on sexual practices with spouses/cohabitants and other sexual partners and collected background information about the partners. Major areas of investigation include sexual experiences such as number of sexual partners in given time periods, frequency of particular practices, and timing of various sexual events. The data cover childhood and adolescence, as well as adulthood. Other topics in the survey relate to sexual victimization, marriage and cohabitation, and fertility. Respondents were also queried about their physical health, including history of sexually transmitted diseases. Respondents’ attitudes toward premarital sex, the appeal of particular practices such as oral sex, and levels of satisfaction with particular sexual relationships were also studied. Demographic items include race, education, political and religious affiliation, income, and occupation.

The codebook can be found here.

import pandas as pd
# Download Variables of interest from data portal
# You can load the data file like any text file
df = pd.read_table('default.dat')

# Download Variables of interest from data portal
# You can load the data file like any text file
df <- pd.read_table("default.dat")

National Longitudinal Surveys (NLS)

Information on the labor market activities and other significant life events of several groups of men and women at multiple points in time. For more than 4 decades, NLS data have served as an important tool for economists, sociologists, and other researchers.The NLS program includes the following cohorts :

NLS Youth 1997 (NLSY97): Respondents were ages 12-17 when first interviewed in 1997.
NLS Youth 1979 (NLSY79): Respondents were ages 14-22 when first interviewed in 1979.
NLSY79 Children and Young Adults: Assessments of biological children of women in the NLSY79, starting in 1986.
NLS Young and Mature Women (NLSW): Young women born in the years 1943-53 and mature women born in the years 1922-37.
NLS Young and Older Men (NLSM): Young men born in the years 1941-52 and older men born in the years 1906-21.

The download functionality for these data sets provides access to files for SPSS, SAS, Stata, R, or simply a csv. A tagset, codebook, description file, and log file are also included with a download.

The R, SAS, and SPSS files contain code needed to load the data set, as well as short explanations for missing values and level names.

import pandas as pd
# Download Variables of interest from data portal
# You can load the data file like any text file
df = pd.read_table('default.dat')

# Download Variables of interest from data portal
# You can load the data file like any text file
df <- pd.read_table("default.dat")

Nationwide Readmissions Database (NRD)

The Nationwide Readmissions Database is designed to support various types of analyses of national readmission rates. The NRD includes discharges for patients with and without repeat hospital visits in a year and those who have died in the hospital. The criteria to determine the relationship between hospital admissions is left to the analyst using the NRD. This database was compiled by the Agency for Healthcare Research and Quality (AHRQ), who provides a Data Dictionary and Full Documentation.

This data must be purchased, follow this link for more information.

National Study of Private Ownership of Firearms in the United States, 1994

This data collection consists of a survey of private ownership of firearms by adults in the United States. Respondents who both did and did not own firearms were included. The variables cover topics such as the number and type of guns owned privately, methods of, and reasons for, firearms acquisition, the storage and carrying of guns, the defensive use of firearms against criminal attackers, and reasons for and against firearm ownership. Basic demographic variables include sex, age, education, and employment.

The full codebook can be viewed here.

The public-use data files in this collection are available for access by the general public. Access does not require affiliation with an ICPSR member institution. However, an ICPSR login is still required to download the data itself.

import pandas as pd
# Download Variables of interest from data portal
# You can load the data file like any text file
df = pd.read_table('default.dat')

# Download Variables of interest from data portal
# You can load the data file like any text file
df <- pd.read_table("default.dat")

TSA API

The MyTSA Web Service API supports several features, some of which include: Security Checkpoint Wait Times, TSA Pre-Check locations, and Sunrise/Sunset times for all locations. Data can be queried by state and/or airport. The TSA provides XML files of data in addition to the API with documentation.

import requests
# Returns JSON file of today’s sunset time for DCA airport
response = requests.get('http://apps.tsa.dhs.gov/MyTSAWebService/GetEventInfo.ashx?eventtype=sunset&airportcode=DCA&output=json')
data = response.json()

library(jsonlite)
# Returns JSON file of today’s sunset time for DCA airport
data <- fromJSON('http://apps.tsa.dhs.gov/MyTSAWebService/GetEventInfo.ashx?eventtype=sunset&airportcode=DCA&output=json')