Data Census

US Births and General Fertility Rates (NCHS)

The National Center for Healthcare Statistics provides this data as flatfiles and through a well documented API. The dataset includes crude birth rates and general fertility rates in the United States since 1909. This particular dataset has 107 observations accounting for year, birth number, crude birth rate, and general fertility rate.

import pandas as pd

# See all observations in 1909.
df = pd.read_json('https://data.cdc.gov/resource/tndt-s2gv.json?year=1909')

library(RSocrata)

# See all observations in 1909.
df <- read.socrata("https://data.cdc.gov/resource/tndt-s2gv.json?year=1909")

US Death Rates and Life Expectancy at Birth (NCHS)

This dataset of U.S. mortality trends since 1900 highlights the differences in age-adjusted death rates and life expectancy at birth by race and sex. This particular dataset has 1,044 observations accounting for year, age, race, sex, average life expectancy in years, and mortality.

The National Center for Healthcare Statistics provides this data as flatfiles and through a well documented API.

import pandas as pd

df = pd.read_json('https://data.cdc.gov/resource/bgqx-uh4z.json')

# See Average Life Expectancy for All Races and Genders by Year
df[(df.race == 'All Races') & (df.sex == 'Both Sexes')][['average_life_expectancy', 'year']]

library(RSocrata)

df <- read.socrata("https://data.cdc.gov/resource/bgqx-uh4z.json")

# See Average Life Expectancy for All Races and Genders by Year
df[df$race == "All Races" & df$sex == "Both Sexes", c("average_life_expectancy", "year")]

Drug Poisoning Mortality (NCHS)

The National Center of Health Statistics publishes this data at the county and state level as flat files and through an API. Each dataset describes drug poisoning deaths by selected demographic characteristics, and includes age-adjusted death rates for drug poisoning from 1999 to 2015. The county level dataset has 8 variables/columns and 53,387 rows, which are described in the API Documentation. The state level dataset also has full documentation of its 18 variables/columns and 2,703 rows.

import pandas as pd

# Get all mortality levels for the state of Texas.
df = pd.read_json("https://data.cdc.gov/resource/tenp-43rk.json?st=TX")

library(RSocrata)

# Get all mortality levels for the state of Texas.
df <- read.socrata("https://data.cdc.gov/resource/tenp-43rk.json?st=TX")

Global Health Observatory Data Repository

The Global Health Observatory data repository is the World Health Organization’s gateway to health-related statistics for its 194 Member States. It provides access to over 1000 indicators on priority health topics including mortality and burden of diseases, the Millennium Development Goals (child nutrition, child health, maternal and reproductive health, immunization, HIV/AIDS, tuberculosis, malaria, neglected diseases, water and sanitation), non communicable diseases and risk factors, epidemic-prone diseases, health systems, environmental health, violence and injuries, equity among others.

Many of these datasets represent the best estimates of WHO using methodologies for specific indicators that aim for comparability across countries and time. Please check the Indicator and Measurement Registry for indicator specific information. Additional metadata and definitions can be found here. The World Health Organization also provides examples of API usage.

import requests
import numpy as np
import pandas as pd

base = ('http://apps.who.int/gho/athena/data/GHO/{}'
        '.json?profile=simple&filter=COUNTRY:*')

ad_restrictions = 'SA_0000001515'

def get_data(code):
    response = requests.get(base.format(code))
    data = get_data_helper(response.json())
    return pd.DataFrame(data)

def get_data_helper(who_dictionary):
    '''Pads the dictionary so entries are same length'''
    rv = [] 
    fact_table = who_dictionary['fact']
    for observation in fact_table:
        new_row = observation['dim']
        new_row['VALUE'] = observation['Value']
        rv.append(new_row)
    return rv

def clean_who_data(df):
    df['ad_type'] = df.ADVERTISINGTYPE.apply(lambda s: 
                                     s.replace(' Ads', ''))
    df.drop(['GHO', 'PUBLISHSTATE'
           , 'ADVERTISINGTYPE'], axis=1, inplace=True)
    df.replace('', np.NaN, inplace=True)

if __name__ == "__main__":
    df = get_data(ad_restrictions)
    clean_who_data(df)
    print('Preview:') 
    print(df.head())
    df.to_csv('WHO_ad_data.csv', index=False)

library(readr)
library(dplyr)
library(ggplot2)

# Flat file downloaded from:
# http://apps.who.int/gho/data/view.main.REGION2480A?lang=en

data <- read_csv('obesity.csv')

by.region <- data %>%
  group_by(REGION, YEAR) %>%
  summarize(mean.Numeric = mean(Numeric)/100)

ggplot(by.region, aes(x = YEAR, y = mean.Numeric, color = REGION)) +
  geom_point() +
  geom_line() +
  ggtitle('Obesity Rates Over Time', subtitle = 'Grouped By Region') +
ylab('Average Obesity Rate')

Nationwide Readmissions Database (NRD)

The Nationwide Readmissions Database is designed to support various types of analyses of national readmission rates. The NRD includes discharges for patients with and without repeat hospital visits in a year and those who have died in the hospital. The criteria to determine the relationship between hospital admissions is left to the analyst using the NRD. This database was compiled by the Agency for Healthcare Research and Quality (AHRQ), who provides a Data Dictionary and Full Documentation.

This data must be purchased, follow this link for more information.