Exploring Data from ProPublica’s Forensics API
Update: June 20, 2020: The API appears to have changed structure, so this code no longer functions. The RPublica
app has not been
updated since 2017-10-10. All code chunks are not run.
library(tidyverse)
library(sf)
library(RPublica)
library(albersusa)
library(visdat)
library(ggrepel)
library(colorRamps)
ProPublica
For those who don’t know, ProPublica is “an independent, nonprofit newsroom that produces investigative journalism with moral force.” In addition to being a force for truth, they are also a force for openness, and have several publicly available data sets in their Data Store with a wide array of interests and implications. I recently stumbled across their forensics API while doing some research for CSAFE. As a researcher for CSAFE, I thought it’d be pertinent to our work to download and explore this data, which pertains to coroner and medical examiner systems in the United States. For more on ProPublica’s forensic-related endeavors, see their website.
Getting the Data
There is, of course, an
R
package for accessing the data from this API. It’s called
RPublica
and it’s authored by
Thomas J. Leeper. There are two functions for accessing the ProPublica Forensics API:
geos()
: Access information about the geographical levels (state or county) with coroner or medical examiner systems in the U.S.systems()
: Query further details about a given system using anid
from ageos
query.
First, I query all available geos
:
# install.packages("RPublica")
allgeos <- RPublica::geos()
The geo-level variables returned by geos()
and prefixed with geo.
are:
ar_association_key_name
: The location ID to pass to thesystems()
functioncached_slug
: Shortened name of geocoroner_system
,medical_examiner_system
,system_types.medical_examiner_system
,system_types.coroner_system
: “In the case of county geos, we report if the county is part of astate
,county
ordistrict
system (or none at all). If we don’t have data, it will returnunavailable
. In the case of states, we return the same values, though we also addcounty_state
for states that have a statewide system with exception counties, andsometimes
for when certain counties have medical examiner systems and others have coroner systems. At the county level, we returnunavailable
when we haven’t gathered data for the county, and we can’t deduce the system type from its parent state (i.e. when its state’s system iscounty_state
orsometimes
).” (API documentation)created_at
: Date entry createddistrict
: Returns allNA
valuesid
: Geo-level ID variablekind
: Is it astate
orcounty
system?name
: State or county namenote
: Other details about the geopopulation
: Returns allNA
valuespostal_abbreviation
: Two-letter state abbreviationupdated_at
: Date entry last updated
First, let’s look at which states and counties we have information on:
# mapping the offices
# usa county map
usa_counties <- albersusa::counties_sf(proj = "longlat")
# usa state map
usa_states <- albersusa::usa_sf(proj = "longlat")
# get just county names
allgeos %>%
mutate(geo.name = str_replace(geo.name, " County", "")) -> allgeos2
# separate states and counties
cnty_geos <- allgeos2 %>%
filter(geo.kind !="state")
state_geos <- allgeos2 %>%
filter(geo.kind == "state")
# merge with map data
cnty_geos_map <- right_join(usa_counties, cnty_geos, by= c("name"="geo.name", "iso_3166_2"= "geo.postal_abbreviation"))
state_geos_map <- left_join(usa_states, state_geos, by = c("name"="geo.name", "iso_3166_2"="geo.postal_abbreviation"))
# map
ggplot() +
geom_sf(data = state_geos_map, aes(fill = geo.kind, geometry = geometry)) +
geom_sf(data = cnty_geos_map, aes(fill = geo.kind, geometry = geometry), color = "white", size = .1) +
scale_fill_manual(values = c('red', 'blue',"white")) +
coord_sf(crs = st_crs(state_geos_map$geometry)) +
theme_void() +
theme(panel.grid.major = element_line(color = "white"),
legend.position = c(.8,.3))
So, there are a lot of states missing completely, and many where only a few counties are represented. This is important to keep in mind throughout the rest of our analysis.
Next, I use the
purrr::map()
function to query all data from each geo using the systems()
function. First, I wrote my own functions to extract each piece (geos
, stats
, and the other information) one-at-a-time. I did this so that the
tidyr::unnest()
function I use later on to unnest all the data from the list columns would work. The final step here is to look at all the data using the
visdat
package, which allows us to see where there are missing data quickly and easily.
# get the geos information
get_sys_geos <- function(num){
dat <- try(RPublica::systems(id = num))
if (class(dat) == "try-error"){
return(NA)
} else{
geos <- dat$geos
geos
}
}
# get the stats information
get_sys_stats <- function(num){
dat <- RPublica::systems(id = num)
stats <- dat$stats
if (length(stats) == 0){
stats2 <- data.frame()
} else {
stats2 <- stats %>% spread(kind, value)
}
return(stats2)
}
# get the remaining info
get_sys_oth <- function(num){
dat <- RPublica::systems(id = num)
idx <- !(names(dat) %in% c("geos", "stats"))
other <- data.frame(t(unlist(dat[idx])), stringsAsFactors = F)
return(other)
}
# use purrr to extract all the data
alldat2 <- allgeos %>%
mutate(geos = map(.x = geo.ar_association_key_name, get_sys_geos),
stats = map(.x = geo.ar_association_key_name, get_sys_stats),
other = map(.x = geo.ar_association_key_name, get_sys_oth))
# use unnest to unnest the data (What an aptly named function! I can't think of another way to describe it.)
alldat3 <- alldat2 %>%
unnest(geos, .drop=FALSE) %>%
unnest(stats, .drop=FALSE) %>%
unnest(other, .drop = FALSE)
visdat::vis_dat(alldat3)
It looks like there are some columns that are all NA
: geo.district
, geo.note
, geo.population
, district
, note
, population
. It seems like they were stored in one place (e.g. the population1
column) and repeated elsewhere with no values. I drop those columns and continue.
The stats
and other variables are, according to the documentation:
website A website for the system.
accreditation
: Accreditation status from the National Association of Medical Examiners. One ofInspection in progress
,Provisional
,Full Accreditation
bc_forensic_pathologists
: Board Certified Forensic Pathologists working in this system.pt_bc_forensic_pathologists
: Part-time Board Certified Forensic Pathologists working in this system.uc_forensic_pathologists
: Uncertified Forensic Pathologists working in this system.pt_uc_forensic_pathologists
: Part-time Uncertified Forensic Pathologists working in this system.fellows
: Fellows working in this system.autopsy_ratio
: The percentage difference between the actual autopsy rate and the expected autopsy rate according to ProPublica’s analysis.raw_autopsy_rate
: The number of autopsies performed per 100 deaths.autopsy_standard_deviation
: Number of standard deviations based on the distribution. If less than -1, the system performed fewer autopsies than expected. If greater than 1, the system performed more autopsies than expectedpopulation
: State or county population in a given year.autopsies
: Number of completed autopsies performed in a given year.contract_cases
: Number of autopsies performed on cases outside of an office’s jurisdiction but which they take on for smaller counties.forensic_pathologists
: Number of forensic pathologists employed by the state or county in a given year.pt_provisional_forensic_pathologists
: Forensic pathologists employed on a part-time basis, meaning anything from three days a week to an as-needed basis.annual_fiscal_budget
: Annual budget for the state/county coroner or medical examiner system. (May change to annual_budget since budgets are either fiscal or calendar).
Note: a forensic pathologist is a pathologist who focuses on determining the cause of death by examining a corpse (Wikipedia). See more about the variables in this data and how they were calculated here.
Data Exploration
First, let’s look at the relationship between number of autopsies and annual budget:
alldat4 %>%
ggplot() +
geom_point(aes(x = autopsies, y = annual_fiscal_budget)) +
ggrepel::geom_label_repel(data = alldat4 %>% filter(autopsies > 3500) %>% select(autopsies, annual_fiscal_budget, year, name1) %>% unique,
aes(x = autopsies, y = annual_fiscal_budget, label = name1), size = 2) +
facet_wrap(~year, nrow=2)
New York City has far more autopsies and a much higher budget than the others, but the North Carolina medical examiner typically performs way more autopsies than offices with similar budgets.
Next, let’s look at the personnel. First, I get all of the information about pathologists together, then look at the raw numbers of the qualifications of the pathologists.
paths <- alldat4 %>%
select(c(8,9,37, 32, 22, which(str_detect(names(alldat4), "pathologist")))) %>%
unique()
paths <- paths %>% gather(type, number, pt_provisional_forensic_pathologists:uc_forensic_pathologists) %>%
mutate(number = parse_integer(number))
paths <- paths %>% mutate(type = str_replace(type, "_forensic_pathologists", ""))
paths %>% group_by(year, type) %>% summarize(total = sum(number, na.rm = T)) %>%
ggplot( aes(x = type, weight = total )) +
geom_bar() +
facet_wrap(~year, nrow = 2) +
labs(x = "Pathologist type", y = "Total count")
Next, I combine the two uncertified categories and look at which organizations have the most uncertified pathologists working for them, on average from 2004-2010:
paths %>% filter(type %in% c("pt_uc", "uc")) %>%
filter(number > 0) %>%
group_by(year, name1, geo.postal_abbreviation) %>%
summarize(tot = sum(number, na.rm = T)) %>%
ungroup() %>% group_by(name1, geo.postal_abbreviation) %>%
summarize(mean = mean(tot, na.rm = T)) %>% arrange(desc(mean))-> uc_paths
ggplot(data = uc_paths) +
geom_bar(aes(x = reorder(name1, mean), weight = mean, fill = geo.postal_abbreviation)) +
coord_flip() +
labs(x = "Office") +
scale_fill_manual(values = colorRamps::primary.colors(23), name = "State")
Two of the busiest offices, North Carolina and New York City, also have the most uncertified forensic pathologists. This could be, as ProPublica describes here, because they have the most new hires who haven’t yet passed their certification, a test only available once per year. Combined with the previous budget information, this suggests that the medical examiner’s office of North Carolina is in desperate need of more funding so that they can hire more certified pathologists.
Finally, let’s look at the accredidation of medical examiner offices.
# separate states and counties
cnty_data <- alldat4 %>%
filter(geo.kind !="state") %>%
mutate(name = str_replace(name, " County", ""))
state_data <- alldat4 %>%
filter(geo.kind == "state")
# merge with map data
cnty_geos_map2 <- merge(cnty_data, usa_counties, by.x = c("name", "geo.postal_abbreviation"), by.y = c("name", "iso_3166_2"), all = T)
state_geos_map2 <- merge(state_data, usa_states, by.x = c("geo.name", "geo.postal_abbreviation"), by.y = c("name", "iso_3166_2"), all = T)
# map
ggplot() +
geom_sf(data = cnty_geos_map2, aes(fill = accreditation, geometry = geometry), color = "white") +
geom_sf(data = state_geos_map2, aes(fill = accreditation, geometry = geometry)) +
scale_fill_manual(values = c("#e41a1c","#377eb8","#4daf4a","#984ea3","white")) +
coord_sf(crs = st_crs(state_geos_map$geometry)) +
theme_void() +
theme(panel.grid.major = element_line(color = "white"),
legend.position = c(.95,.3))
Labs that show up red on the map have blank value for accreditation. Based on my reading of the ProPublica documentation, it is unclear to me if those values are missing or if it should be “full” accreditation.
Summary and References
This is just a small example of what can be done with this data. We’ve learned that The dataset alldat4
that I used can be downloaded
here.
Thanks to the following for their R packages used above:
- Hadley Wickham (2017). tidyverse: Easily Install and Load the ‘Tidyverse’. R package version 1.2.1. https://CRAN.R-project.org/package=tidyverse
- Edzer Pebesma (2017). sf: Simple Features for R. R package version 0.4-3. https://CRAN.R-project.org/package=sf
- Thomas J. Leeper (2015). RPublica: ProPublica API Client. R package version 0.1.3.
- Bob Rudis (2017). albersusa: Tools, Shapefiles & Data to Work with an ‘AlbersUSA’ Composite Projection. R package version 0.3.0. http://github.com/hrbrmstr/albersusa
- Nicholas Tierney (2017). visdat: Preliminary Data Visualisation. R package version 0.1.0. https://CRAN.R-project.org/package=visdat
- Kamil Slowikowski (2017). ggrepel: Repulsive Text and Label Geoms for ‘ggplot2’. R package version 0.7.0. https://CRAN.R-project.org/package=ggrepel
- Tim Keitt (2012). colorRamps: Builds color tables. R package version 2.3. https://CRAN.R-project.org/package=colorRamps