Visualizing Dead Presidents | Ben Cunningham

Visualizing Dead Presidents

Written by Ben Cunningham · on January 20, 2017

I’ve heard quite a bit recently about President Trump being the oldest individual to take the executive oath. I’m not a big history buff, but all the hubbub did get me wondering about the timeline and ages of past presidents. I’ve also been looking for a chance to recreate the plot in this analysis by FiveThirtyEight, so I figured this might be a great way to combine the two ideas.

To build the visualization, I figured I would need a couple data points for every president: date of birth, date of death, and inauguration date. If you are so inclined, check out my script for scraping the data here. Otherwise, I’ll start this post with the data in life.rds and term.rds.

Data Transformation

The life.rds file captures birth and death dates for every president. It’s already in a tidy format and didn’t require much processing right away, aside from converting date types.

library(tidyverse)

(life <-
  read_rds("data/life.rds") %>%
  mutate_at(vars(birth, death), funs(as.Date(., "%b %d, %Y"))) %>%
  select(president, birth, death))
## # A tibble: 44 x 3
##                 president      birth      death
##                     <chr>     <date>     <date>
##  1      George Washington 1732-02-22 1799-12-14
##  2             John Adams 1735-10-30 1826-07-04
##  3       Thomas Jefferson 1743-04-13 1826-07-04
##  4          James Madison 1751-03-16 1836-06-28
##  5           James Monroe 1758-04-28 1831-07-04
##  6      John Quincy Adams 1767-07-11 1848-02-23
##  7         Andrew Jackson 1767-03-15 1845-06-08
##  8       Martin Van Buren 1782-12-05 1862-07-24
##  9 William Henry Harrison 1773-02-09 1841-04-04
## 10             John Tyler 1790-03-29 1862-01-18
## # ... with 34 more rows

The term data, on the other hand, wasn’t so ready-to-go. Here’s a quick peak:

## # A tibble: 45 x 5
##                 President     First   Second Third Fourth
##                     <chr>     <chr>    <chr> <chr>  <chr>
##  1      George Washington 4/30/1789 3/4/1793             
##  2             John Adams  3/4/1797                      
##  3       Thomas Jefferson  3/4/1801 3/4/1805             
##  4          James Madison  3/4/1809 3/4/1813             
##  5           James Monroe  3/4/1817 3/5/1821             
##  6      John Quincy Adams  3/4/1825                      
##  7         Andrew Jackson  3/4/1829 3/4/1833             
##  8       Martin Van Buren  3/4/1837                      
##  9 William Henry Harrison  3/4/1841                      
## 10             John Tyler  4/6/1841                      
## # ... with 35 more rows

There were a couple of things I knew I wanted to do:

  1. Unpivot the term fields.
  2. Filter out “blank” rows where a president didn’t have additional terms.
  3. Coerce the date column.
  4. Order all of the rows ascending by date.
(term <-
  read_rds("data/term.rds") %>%
  gather(term, term_start, First:Fourth) %>%
  filter(term_start != "") %>%
  mutate(term_start = as.Date(term_start, "%m/%d/%Y")) %>%
  arrange(term_start))
## # A tibble: 67 x 3
##            President   term term_start
##                <chr>  <chr>     <date>
##  1 George Washington  First 1789-04-30
##  2 George Washington Second 1793-03-04
##  3        John Adams  First 1797-03-04
##  4  Thomas Jefferson  First 1801-03-04
##  5  Thomas Jefferson Second 1805-03-04
##  6     James Madison  First 1809-03-04
##  7     James Madison Second 1813-03-04
##  8      James Monroe  First 1817-03-04
##  9      James Monroe Second 1821-03-05
## 10 John Quincy Adams  First 1825-03-04
## # ... with 57 more rows

There. A much tidier format.

But there were still a few issues. I didn’t really care about consecutive terms, so I needed to find a way to filter them out. I ended up doing this by comparing the president name column with a single-index lag of itself.

(term <-
  term %>%
  mutate(consec = if_else(lag(President) != President, 0, 1)) %>%
  filter(consec == 0 | is.na(consec)))
## # A tibble: 45 x 4
##                 President  term term_start consec
##                     <chr> <chr>     <date>  <dbl>
##  1      George Washington First 1789-04-30     NA
##  2             John Adams First 1797-03-04      0
##  3       Thomas Jefferson First 1801-03-04      0
##  4          James Madison First 1809-03-04      0
##  5           James Monroe First 1817-03-04      0
##  6      John Quincy Adams First 1825-03-04      0
##  7         Andrew Jackson First 1829-03-04      0
##  8       Martin Van Buren First 1837-03-04      0
##  9 William Henry Harrison First 1841-03-04      0
## 10             John Tyler First 1841-04-06      0
## # ... with 35 more rows

A couple more things: I also wanted a term end date for each president; I made the assumption that this is the inauguration date of the next president (if this isn’t accurate, apologies to all of my middle school history teachers). Also, I went ahead and numbered the consecutive terms (because I did recall from 6th grade social studies that Grover Cleveland had non-consecutive terms), and I anticipated needing to do some grouping later on.

(term <-
  term %>%
  mutate(
    term_end = lead(term_start),
    n = seq_along(President)
  ) %>%
  select(n, president = President, term_start, term_end))
## # A tibble: 45 x 4
##        n              president term_start   term_end
##    <int>                  <chr>     <date>     <date>
##  1     1      George Washington 1789-04-30 1797-03-04
##  2     2             John Adams 1797-03-04 1801-03-04
##  3     3       Thomas Jefferson 1801-03-04 1809-03-04
##  4     4          James Madison 1809-03-04 1817-03-04
##  5     5           James Monroe 1817-03-04 1825-03-04
##  6     6      John Quincy Adams 1825-03-04 1829-03-04
##  7     7         Andrew Jackson 1829-03-04 1837-03-04
##  8     8       Martin Van Buren 1837-03-04 1841-03-04
##  9     9 William Henry Harrison 1841-03-04 1841-04-06
## 10    10             John Tyler 1841-04-06 1845-03-04
## # ... with 35 more rows

Fuzzy Joining

With my two sources tidied up, I just needed to join them together. However, playing around in the console quickly reveals that this wasn’t going to be as straightforward as a simple left join.

Here’s a peak at the rows that seemed to be giving me grief:

anti_join(term, life, by = "president")
## # A tibble: 7 x 4
##       n             president term_start   term_end
##   <int>                 <chr>     <date>     <date>
## 1    11       James Knox Polk 1845-03-04 1849-03-05
## 2    32 Franklin D. Roosevelt 1933-03-04 1945-04-12
## 3    34  Dwight D. Eisenhower 1953-01-20 1961-01-20
## 4    37      Richard M. Nixon 1969-01-20 1974-08-09
## 5    38        Gerald R. Ford 1974-08-09 1977-01-20
## 6    42    William J. Clinton 1993-01-20 2001-01-20
## 7    45       Donald J. Trump 2017-01-20         NA

Evidently, a couple of presidents were named slightly differently between the data sets. No big deal though, I thought.

A minute later, I was promptly reminded that fuzzy matching is no free lunch.

The situation I was in was somewhat unique: I had an n-length vector that I wanted to match with another n-length vector to finally return an n-length table. I could conceivably adjust the maximum distance parameter all day (for making a positive match), but it would likely leave me with tables of length > n for large values (i.e., names like George H. W. Bush match with George H. W. Bush and George W. Bush) or length < n for small values.

Instead, I opted for the simple solution suggested by David Robinson in this GitHub issue: pick an arbitrarily-large maximum distance (to create many matches), group on an appropriate field, and then filter down to the observation with the smallest distance for each group.

library(fuzzyjoin)

(pres <-
  stringdist_left_join(
    term, life, by = "president",
    max_dist = 10, distance_col = "dist"
  ) %>%
  group_by(n) %>%
  top_n(1, -dist) %>%
  select(n, president = president.x, term_start:term_end, birth:death) %>%
  ungroup())
## # A tibble: 45 x 6
##        n              president term_start   term_end      birth
##    <int>                  <chr>     <date>     <date>     <date>
##  1     1      George Washington 1789-04-30 1797-03-04 1732-02-22
##  2     2             John Adams 1797-03-04 1801-03-04 1735-10-30
##  3     3       Thomas Jefferson 1801-03-04 1809-03-04 1743-04-13
##  4     4          James Madison 1809-03-04 1817-03-04 1751-03-16
##  5     5           James Monroe 1817-03-04 1825-03-04 1758-04-28
##  6     6      John Quincy Adams 1825-03-04 1829-03-04 1767-07-11
##  7     7         Andrew Jackson 1829-03-04 1837-03-04 1767-03-15
##  8     8       Martin Van Buren 1837-03-04 1841-03-04 1782-12-05
##  9     9 William Henry Harrison 1841-03-04 1841-04-06 1773-02-09
## 10    10             John Tyler 1841-04-06 1845-03-04 1790-03-29
## # ... with 35 more rows, and 1 more variables: death <date>

More Transformation

I just needed a few more aesthetic variables before I could slap the data into a plot. First, I anticipated wanting to do something special with presidents who died in office, so I approximated that boolean as follows:

(pres <-
  pres %>%
  mutate(died_in_office = if_else(term_end < death - 2 | is.na(death), FALSE, TRUE)))
## # A tibble: 45 x 7
##        n              president term_start   term_end      birth
##    <int>                  <chr>     <date>     <date>     <date>
##  1     1      George Washington 1789-04-30 1797-03-04 1732-02-22
##  2     2             John Adams 1797-03-04 1801-03-04 1735-10-30
##  3     3       Thomas Jefferson 1801-03-04 1809-03-04 1743-04-13
##  4     4          James Madison 1809-03-04 1817-03-04 1751-03-16
##  5     5           James Monroe 1817-03-04 1825-03-04 1758-04-28
##  6     6      John Quincy Adams 1825-03-04 1829-03-04 1767-07-11
##  7     7         Andrew Jackson 1829-03-04 1837-03-04 1767-03-15
##  8     8       Martin Van Buren 1837-03-04 1841-03-04 1782-12-05
##  9     9 William Henry Harrison 1841-03-04 1841-04-06 1773-02-09
## 10    10             John Tyler 1841-04-06 1845-03-04 1790-03-29
## # ... with 35 more rows, and 2 more variables: death <date>,
## #   died_in_office <lgl>

Then I needed to unpivot all of the “events” into a format more meaningful for the plot I was mimicking.

(pres <-
  pres %>%
  gather(event, date, term_start:death))
## # A tibble: 180 x 5
##        n              president died_in_office      event       date
##    <int>                  <chr>          <lgl>      <chr>     <date>
##  1     1      George Washington          FALSE term_start 1789-04-30
##  2     2             John Adams          FALSE term_start 1797-03-04
##  3     3       Thomas Jefferson          FALSE term_start 1801-03-04
##  4     4          James Madison          FALSE term_start 1809-03-04
##  5     5           James Monroe          FALSE term_start 1817-03-04
##  6     6      John Quincy Adams          FALSE term_start 1825-03-04
##  7     7         Andrew Jackson          FALSE term_start 1829-03-04
##  8     8       Martin Van Buren          FALSE term_start 1837-03-04
##  9     9 William Henry Harrison           TRUE term_start 1841-03-04
## 10    10             John Tyler          FALSE term_start 1841-04-06
## # ... with 170 more rows

Finally, I needed the age of every president for each of his “events” (by joining their birthdays onto every row), a variable indicating whether the president has died, and a default value for events that havne’t happened yet.

birth <-
  pres %>%
  filter(event == "birth") %>%
  select(president, birthday = date)

(pres <-
  left_join(pres, birth, by = "president") %>%
  mutate(
    dead = if_else(is.na(date), FALSE, TRUE),
    date = if_else(is.na(date), Sys.Date(), date),
    age = (date - birthday) / 365.25
  ))
## # A tibble: 188 x 8
##        n              president died_in_office      event       date
##    <int>                  <chr>          <lgl>      <chr>     <date>
##  1     1      George Washington          FALSE term_start 1789-04-30
##  2     2             John Adams          FALSE term_start 1797-03-04
##  3     3       Thomas Jefferson          FALSE term_start 1801-03-04
##  4     4          James Madison          FALSE term_start 1809-03-04
##  5     5           James Monroe          FALSE term_start 1817-03-04
##  6     6      John Quincy Adams          FALSE term_start 1825-03-04
##  7     7         Andrew Jackson          FALSE term_start 1829-03-04
##  8     8       Martin Van Buren          FALSE term_start 1837-03-04
##  9     9 William Henry Harrison           TRUE term_start 1841-03-04
## 10    10             John Tyler          FALSE term_start 1841-04-06
## # ... with 178 more rows, and 3 more variables: birthday <date>,
## #   dead <lgl>, age <time>

Plotting

I was ready to plot. The first thing I wanted to do was visualize the post-inauguration lifetimes of the presidents. I figured I could do this with two layers:

  1. A line between two points for every president: term_start and death.
  2. A point at (date, age) for all president who have died.
(p <-
  ggplot() +
  geom_line(
    data = filter(pres, event %in% c("term_start", "death")),
    aes(date, age, group = n), color = "grey"
  ) +
  geom_point(
    data = filter(pres, event == "death", dead),
    aes(date, age), color = "grey"
  ) +
  labs(x = NULL, y = "Age"))

plot of chunk plotting

Finally, I wanted to visualize the political lives of the presidents in a different color. I needed two more layers for this, with just about the same definitions:

  1. A line between two points for every president: term_start and term_end.
  2. A point at (date, age) for all president who died in office.
p +
  geom_line(
    data = filter(pres, event %in% c("term_start", "term_end")),
    aes(date, age, group = n), color = "red"
  ) +
  geom_point(
    data = filter(pres, event == "death", died_in_office),
    aes(date, age), color = "red"
  )

plot of chunk unnamed-chunk-11

Ta-da. The plot could certainly use some labels to spruce it up a bit more, but I’ll forego that here for the sake of time.