The goal of srt is to read SubRip text files as tabular data for easy analysis and manipulation.

Installation

You can install the development version of srt from GitHub with:

# install.packages("remotes")
remotes::install_github("kiernann/srt")

Example

The .srt standard is used to identify the subtitle components for the columns of a data frame:

  1. A numeric counter identifying each sequential subtitle
  2. The time that the subtitle should appear followed by --> and the time it should disappear
  3. Subtitle text itself on one or more lines
  4. A blank line containing no text, indicating the end of this subtitle
library(srt)
library(tidyverse)
library(tidytext)
srt <- srt_example()
#> 1
#> 00:01:25,210 --> 00:01:28,004
#> I owe everything to George Bailey.
#> 
#> 2
#> 00:01:28,422 --> 00:01:30,298
#> Help him, dear Father.
#> 
#> 3
#> 00:01:30,674 --> 00:01:33,718
#> Joseph, Jesus and Mary,

These subtitle files are parsed as data frames with separate columns.

(wonderful_life <- read_srt(path = srt, collapse = " "))
#> # A tibble: 2,268 x 4
#>        n start   end subtitle                           
#>    <int> <dbl> <dbl> <chr>                              
#>  1     1  85.2  88.0 I owe everything to George Bailey. 
#>  2     2  88.4  90.3 Help him, dear Father.             
#>  3     3  90.7  93.7 Joseph, Jesus and Mary,            
#>  4     4  93.8  96.4 help my friend Mr. Bailey.         
#>  5     5  96.9  99.5 Help my son George tonight.        
#>  6     6 100.  102.  He never thinks about himself, God.
#>  7     7 102.  104.  That's why he's in trouble.        
#>  8     8 104.  105.  George is a good guy.              
#>  9     9 106.  108.  Give him a break, God.             
#> 10    10 108.  110.  I love him, dear Lord.             
#> # … with 2,258 more rows

This makes it easy to perform various text analysis on the subtitles.

wonderful_life %>% 
  unnest_tokens(word, subtitle) %>% 
  count(word, sort = TRUE) %>% 
  anti_join(stop_words)
#> # A tibble: 1,651 x 2
#>    word       n
#>    <chr>  <int>
#>  1 george   216
#>  2 mary      85
#>  3 bailey    74
#>  4 hey       56
#>  5 harry     53
#>  6 yeah      50
#>  7 gonna     45
#>  8 potter    45
#>  9 home      34
#> 10 money     34
#> # … with 1,641 more rows

Or uniformly manipulate the numeric time stamps:

wonderful_life <- srt_shift(wonderful_life, seconds = 9.99)

The subtitle data frames can be easily re-written as valid SubRip files.

tmp <- tempfile(fileext = ".srt")
write_srt(wonderful_life, tmp, wrap = FALSE)
#> 1
#> 00:01:35,200 --> 00:01:37,994
#> I owe everything to George Bailey.
#> 
#> 2
#> 00:01:38,412 --> 00:01:40,288
#> Help him, dear Father.
#> 
#> 3
#> 00:01:40,664 --> 00:01:43,708
#> Joseph, Jesus and Mary,