Analyzing Podcast Sentiment | Ben Cunningham

Analyzing Podcast Sentiment

Written by Ben Cunningham · on July 31, 2016

Natural language processing, especially opinion mining, is (apparently) hot right now. It isn’t something that I’ve paid attention to in the past, but over the last six months, I’ve noticed more and more people talking about it. More specifically, I’ve heard more and more non-technical people asking about it (which I think is more telling of a rising trend).

I’ve previously read Matthew Jockers’s Text Analysis with R for Students of Literature, but what really got me interested in trying out a few demos was the recently released tidytext R package and accompanying series of blog posts by David Robinson and Julia Silge. Both developers keep awesome data science journals, but their latest focus on text analysis really made me want to try out their workflow on some data of my own.

Since I don’t get to do radio anymore, I thought I’d try my hand at radio-focused data science. This American Life is one of my favorite shows (isn’t that true for everyone?) and they post pretty spotless transcripts of every episode, so I pulled down their archive as a starting spot. If you want to see exactly how I did it, check that out here, or feel free to follow along with the cleaned-up data frame here.

Here’s what the first few rows of that data looks like:

library(dplyr)
library(stringr)

favorites <- c(550, 513, 360, 355, 241, 218, 206, 175, 109)

raw_tal <-
  readRDS("tal-transcripts.rds") %>%
  filter(episode %in% favorites, !str_detect(line, "^\\[.*\\]$"))

head(raw_tal)
## # A tibble: 6 x 5
##   episode         title       date        time
##     <int>         <chr>     <date>       <chr>
## 1     109 Notes on Camp 1998-08-28 00:00:00.52
## 2     109 Notes on Camp 1998-08-28 00:00:12.39
## 3     109 Notes on Camp 1998-08-28 00:00:13.49
## 4     109 Notes on Camp 1998-08-28 00:00:15.72
## 5     109 Notes on Camp 1998-08-28 00:00:17.84
## 6     109 Notes on Camp 1998-08-28 00:00:31.07
## # ... with 1 more variables: line <chr>

And here’s what the respective spoken lines look like:

head(raw_tal$line)
## [1] "It's a typical camp, all the normal activities-- canoeing, archery, sports. There are two girls who everybody calls the homesick girls, who will cheerfully identify themselves that way to anybody they meet."                                   
## [2] "She wants to go home."                                                                                                                                                                                                                            
## [3] "I do. I swear to God, I want to go home."                                                                                                                                                                                                         
## [4] "I want to go home too."                                                                                                                                                                                                                           
## [5] "Because it's a pretty upscale camp, there's also horseback riding. And on the rifle range, suburban girls from Chicago's North Shore lay on their bellies in sniper position while a real former Israeli soldier barks orders about how to shoot."
## [6] "I know it's really hard for you ladies to keep it quiet, but try, OK? Aim and fire. [GUNSHOTS]"

Before I go futher, a few housekeeping notes: As much as I’d love to use every episode in my analysis, Ira Glass’s two decade head-start on me makes the full transcript archive a bit too big for a blog post. Instead, I decided to filter down to a few classic episodes, as defined in the producers’ favorites list. The str_detect() in the pipeline above additionally strips out all of the one-line theatrical cues ([Laughing], [Talking in Spanish], etc.).

At the end of this post, I’d like to visualize a sort of timeline of each episode, charting how the “mood” of the show changes minute by minute. With tidytext, this is actually pretty simple.

library(tidyr)
library(tidytext)

tal <-
  raw_tal %>%
  group_by(episode) %>%
  mutate(
    linenumber = row_number(),
    hour = as.integer(str_extract(time, "^\\d{2}")),
    minute = as.integer(str_extract(time, "(?<=\\d{2}:)\\d{2}")),
    time = hour * 60 + minute
  ) %>%
  ungroup() %>%
  select(episode, title, time, line) %>%
  unnest_tokens(word, line) %>%
  filter(!(word %in% stop_words$word), str_detect(word, "^[a-z']+$"))

The pipeline above combines the two things we need to do before getting to the so-called sentiment stuff: general data wrangling and tokenizing. Since we’d like the x-axis to be in minutes, we’ll extract that component of each line’s timestamp. Finally, we break each line down into it’s individual word elements, stripping out stop words and non-alpha tokens.

Now we can turn our attention to picking a method for “measuring” mood. A common way to do this is by using a “sentimental lexicon”, a dictionary of sorts in which words are assessed for their positivite or negativite connotations. The dictionary we’ll use in this post is called AFINN (tidytext comes with a copy of it).

dictionary <-
  filter(sentiments, lexicon == "AFINN") %>%
  select(-sentiment, -lexicon)

Let’s take a closer look at the AFINN dictionary to get an idea of how we can use it. Unlike some sentimental lexicons, AFINN assigns words a score ranging from -5 (extremely negative) to 5 (extremely positive). Here are some examples to make this clear:

dictionary %>%
  filter(word %in% c("terrible", "happy", "dull", "normal", "fantastic"))
## # A tibble: 4 x 2
##        word score
##       <chr> <int>
## 1      dull    -2
## 2 fantastic     4
## 3     happy     3
## 4  terrible    -3

Hopefully it’s not too difficult to see what’s going on here. However, we should notice that “normal” wasn’t included in the results; this is our first reminder that sentiment analysis isn’t a free lunch. I’ll talk more later about things to watch out for when conducting more serious explorations.

All the same, we’re now just one join away from “scoring” every word in our collection of transcripts. We can finish up and prep the data for plotting by calculating the mean score for each time index in each episode.

tal_sentiment <-
  inner_join(tal, dictionary, by = "word") %>%
  group_by(episode, title, time) %>%
  summarize(sentiment = mean(score))

Voilà! Let’s visualize our work:

library(ggplot2)

tal_sentiment %>%
  ungroup() %>%
  mutate(episode = factor(title, levels = unique(title))) %>%
  ggplot(aes(time, sentiment, fill = episode)) +
    geom_bar(stat = "identity", show.legend = FALSE) +
    facet_wrap(~ episode, ncol = 3) +
    scale_y_continuous(limits = c(-5, 5), breaks = seq(-5, 5, 5),
                       minor_breaks = -4:4) +
    scale_x_continuous(limits = c(0, 75), breaks = seq(0, 75, 15),
                       minor_breaks = NULL) +
    labs(x = "Time (Minutes)", y = "Mean AFINN Sentiment Score")

plot of chunk sentiment_plot

I really like looking at these plots. I’ve listened to most of these episodes, and since first creating this viz, I’ve listened to a few more while following along with the respective sentiment timeline. It’s fun.

Still, I’m wary of confounding a clean process and pretty results with a robust analysis. I think I’ve overlooked some breaking points along the way, and I want to be sure to outline a few of my thoughts:

Again, these are just some initial thoughts after working through this demo. David and Julia have created a great tool with tidytext and I’m glad it made the technical work easy enough that I was able to think a bit more about some of the assumptions that (ought to) go into my work.