Ingesting Excel Sheets Fast | Ben Cunningham

Ingesting Excel Sheets Fast

Written by Ben Cunningham · on September 2, 2017

I helped instruct an intro to R course at work last fall. One of the most common questions I’ve received from attendees since that time is, “How do I read a bunch of Excel files really fast?” More precisely what this usually means is, “How do I append a bunch of worksheets with as little code as possible?”

This isn’t a use case for every analyst, but it definitely is for most of us in client services. It’s very common for my colleagues to receive a refresh of the same Excel workbook over and over at some regular interval but still need to use all of the historical files for their work. Just as often, those same workbooks will be composed of multiple worksheets with the same structure, but segmented across some variable.

The following “guide” is my usual response to these questions. It’s fairly DRY (I think this is the most import part here) but it’s also hopefully not so optimized that it’s no longer extensible as their needs evolve.

Identifying Files

Your first task involves identifying all of the files that should be read in. The most robust method uses list.files() to return a vector of all files in a directory. If your directory has many other unrelated files, you can additionally provide a regular epression via the pattern argument; this will only return files that match your pattern.

f <- list.files("data", full.names = TRUE, pattern = "diamonds-.+\\.xlsx$")
print(f)
## [1] "data/diamonds-ideal.xlsx"   "data/diamonds-premium.xlsx"

If you aren’t comfortable with this method, you can alternatively create a character vctor with th epaths to your files, as seen below.

f <- c("data/diamonds-ideal.xlsx", "data/diamonds-premium.xlsx")

Reading Multiple Files

We’ll now use the map_df() function from the purrr package to quickly apply a function to read in each file in our vector as a data frame. The result will be a single data frame composed of the first sheet in every file; behind the scenes, map_df() reads in every one of these sheets as a data frame and then combines them all into a single result. Note that this method will assure the columns will be aligned by their name (a pitfall of other common non-R methods of row binding).

library(tidyverse)
library(readxl)

full_df <- map_df(f, function(wb) {
  read_excel(wb, sheet = 1)
})

print(full_df)
## # A tibble: 4 x 3
##   carat     cut color
##   <dbl>   <chr> <chr>
## 1  0.23   Ideal     E
## 2  0.26   Ideal     E
## 3  0.21 Premium     E
## 4  0.23 Premium     E

We haven’t completely solved our problem yet, though. Again, here we only read in the first sheet in each of these files. What if there are multiple worksheets?

Reading Multiple Worksheets

With only a few modifications, we can similarly handle situations where we have multiple worksheets in each workbook. To achieve our end, we can just nest another map_df() inside of the existing one that will handle iterating over the worksheets in each workbook.

full_df <- map_df(f, function(wb) {
  s <- excel_sheets(wb)
  map_df(s, function(ws) {
    read_excel(wb, sheet = ws) %>%
      mutate(source_wb = wb, source_ws = ws)
  })
})

print(full_df)
## # A tibble: 7 x 5
##   carat     cut color                  source_wb source_ws
##   <dbl>   <chr> <chr>                      <chr>     <chr>
## 1  0.23   Ideal     E   data/diamonds-ideal.xlsx   color_e
## 2  0.26   Ideal     E   data/diamonds-ideal.xlsx   color_e
## 3  0.31   Ideal     J   data/diamonds-ideal.xlsx   color_j
## 4  0.24   Ideal     J   data/diamonds-ideal.xlsx   color_j
## 5  0.21 Premium     E data/diamonds-premium.xlsx   color_e
## 6  0.23 Premium     E data/diamonds-premium.xlsx   color_e
## 7  0.29 Premium     I data/diamonds-premium.xlsx   color_i

Take special note the mutate() call in the chunk above that adds a column for the origin workbook and worksheet of every observation. While not strictly necessary, it is often a good idea to include this to help track down the source of any data quality issues you may encounter.