rregextext-parsingsrt

Parse text into table with R


I have some txt files which were originally srt's, the format subtitles are published. The pattern they usually follow is like the following:

Subtitle_number
Beginning_min --> Ending_min
Text

As an example, this might be the structure of an srt file:

1
00:00:00,100 --> 00:00:01,500
This is the first subtitle

2
00:00:01,700 --> 00:00:02,300
of the movie

Now, I have some "modified" srt's, which differ from normal ones because of them having the name of the character right after the subtitle number. Here is an example:

1 Matt
00:00:00,100 --> 00:00:01,500
This is said by Matt

2 Lucas
00:00:01,700 --> 00:00:02,300
While this is said by Lucas

What I would like to do is to parse these files in order to create a data.frame like the following:

+---------------------------------------------+
| CHARACTER    |  TEXT                        |
|--------------+------------------------------|
| Matt         |  This is said by Matt        | 
|--------------+------------------------------|
| Lucas        |  While this is said by Lucas |
+---------------------------------------------+

So, I do not want the number or the minute of the subtitle. I have been able to read the text with the readtext library, resulting in something like this:

1 Matt\n00:00:00,100 --> 00:00:01,500\nThis is said by Matt.\n\n2 Lucas\n00:00:01,700 --> 00:00:02,300\nWhile this is said by Lucas

Note that there might be \n also inside of the texts, as well as any other (readable) character

Here is where I am stuck, I guess I would have to use some kind of Regex to extract all names and then all texts, but I have no clue on how to do this.

Any help is highly appreciated!


Solution

  • Here is a step-by-step way to do this without regex. It's a bit sloppy, but its to show the logic on how to approach a file like this. End result is a data frame where you can grab the info you want.

    txt <- "1 Matt
    00:00:00,100 --> 00:00:01,500
    This is said by Matt
    
    2 Lucas
    00:00:01,700 --> 00:00:02,300
    While this is said by Lucas\nand another line
    
    3
    00:00:01,700 --> 00:00:02,300
    While this is said by nobody"
    
    library(readr)
    library(tidyr)
    library(tibble)
    library(dplyr)
    library(purrr)
    
    df <- tibble(txt = read_lines(txt))
    
    df %>% 
      rowid_to_column("row") %>% 
      group_by(group = cumsum(txt == "")) %>% 
      filter(!(txt == "")) %>% 
      mutate(field = pmin(row_number(), 3)) %>% 
      group_by(group, field) %>% 
      summarize(txt = paste(txt, collapse = "\n"), .groups = "drop") %>% 
      pivot_wider(names_from = "field",
                  values_from = "txt") %>% 
      select(-group) %>% 
      set_names(c("Col1", "Col2", "Col3")) %>% 
      separate(Col1, c("Col1A", "Col1B"), extra = "merge", fill = "right")
    

    And you get this data frame. You can name things whatever you want, of course.

    # A tibble: 3 x 4
      Col1A Col1B Col2                          Col3                                           
      <chr> <chr> <chr>                         <chr>                                          
    1 1     Matt  00:00:00,100 --> 00:00:01,500 "This is said by Matt"                         
    2 2     Lucas 00:00:01,700 --> 00:00:02,300 "While this is said by Lucas\nand another line"
    3 3     NA    00:00:01,700 --> 00:00:02,300 "While this is said by nobody"
    

    EDIT

    Here is a more streamlined way using a bit of tidyverse.

    library(tidyr)
    library(dplyr)
    
    tibble(txt = txt) %>% 
      separate_rows(txt, sep = "\\n\\n") %>% 
      separate(txt, c("A", "B", "C"), sep = "\n", extra = "merge") %>% 
      separate(A, c("A1", "B2"), extra = "merge", fill = "right")