Introducing: subtitler 0.1.0 | Ben Cunningham

Introducing: subtitler 0.1.0

Written by Ben Cunningham · on March 5, 2017

Since the release of tidytext, I’ve been a lot more interested in working with semi-unstructured text. Still, most of the language I have been interested in analyzing comes from my favorite movies and television shows. Subtitles, widely available online for free, provide an obvious bridge to that language data.

When I failed to find a package on CRAN for managing subtitles, I decided to build my own. Today I’m announcing subtitler 0.1.0, my first stable release of the tool.

What is it?

My philosophy for subtitler is to provide minimal tools for working with common subtitle formats, overlapping the strengths of other tidy and text mining packages as little as possible.

This release supports reading and writing SubRip (.srt) format files. SubRip is by far the most commonly used subtitle format freely available online. It is highly structured and therefore lends itself to the tidy data frame paradigm.

Here’s an example snippet from a SubRip file:

1
00:00:00,978 --> 00:00:02,539
Frank, pick up!

2
00:00:02,646 --> 00:00:04,414
Pick up, buddy, pick up, pick
up, pick up, pick up, pick up!

Using subtitler, these captions can easily be transformed into a data frame, as shown in the glimpse below.

## Observations: 2
## Variables: 4
## $ index <int> 1, 2
## $ start <chr> "00:00:00,978", "00:00:02,646"
## $ end   <chr> "00:00:02,539", "00:00:04,414"
## $ text  <chr> "Frank, pick up!", "Pick up, buddy, pick up, pick\nup, p...

What can it do?

At least for now, the package only supports working with the SubRip format. The main methods are as follows:

Installation

For now, the package is only available via GitHub. Install it using devtools as follows:

if (packageVersion("subtitler") < 0.1) {
  devtools::install_github("benjcunningham/subtitler")
}

library(subtitler)

Quick Examples

library(tidyverse)
library(tidytext)
library(stringr)

Synchronizing Subtitles

Imagine you have a subtitle file that is always half a second ahead. You could read it in, adjust the times, and write it back out like this:

df <- read_srt("data/charlie_work.srt")

df %>%
  mutate(
    start = as_milliseconds(start) %>% `+`(500) %>% as_timestamp(),
    end   = as_milliseconds(end)   %>% `+`(500) %>% as_timestamp()
  ) %>%
  write_srt("data/new_charlie_work.srt")

Catalog of Tarantino Curses

I have also used subtitler for simple, but slightly more interesting, analyses. Using a simplified list of curses, I was able to reproduce part of this article by Oliver Roeder of FiveThirtyEight, cataloguing all of the times someone swore in one of Quentin Tarantino’s movies.

srt <- list.files("data/srt", full.names = TRUE)

bad <- read_csv("data/bad_words.csv")
map <- read_csv("data/file_mapping.csv")

df <- map_df(srt, function(x) {
  
  message(x)
  
  read_srt(x) %>%
    mutate(file = str_extract(x, "[^/]*$"))
  
})
df %>%
  unnest_tokens(word, text) %>%
  filter(str_detect(word, paste(bad$word, collapse = "|"))) %>%
  left_join(map, by = "file") %>%
  mutate(
    min = floor(as_milliseconds(start) / 60000),
    film = factor(film, levels = map$film)
  ) %>%
  group_by(film, min) %>%
  summarize(count = n()) %>%
  ggplot(aes(min, count)) +
    geom_col() +
    facet_wrap(~ film, ncol = 1) +
    labs(x = "Minute", y = "Profanities") +
    scale_x_continuous(breaks = seq(0, 180, 60), limits = c(0, 180)) +
    scale_y_continuous(breaks = seq(0, 20, 10), limits = c(0, 20))

plot of chunk unnamed-chunk-7

What’s next?

Hopefully you can find some use for subtitler too. If you are interested in contributing, feel free to open a pull request or submit an issue on GitHub. I don’t often collaborate with others on software, so all feedback on the project is welcome.