I have some txt
files which were originally srt
's, the format subtitles are published.
The pattern they usually follow is like the following:
Subtitle_number
Beginning_min --> Ending_min
Text
As an example, this might be the structure of an srt
file:
1
00:00:00,100 --> 00:00:01,500
This is the first subtitle
2
00:00:01,700 --> 00:00:02,300
of the movie
Now, I have some "modified" srt
's, which differ from normal ones because of them having the name of the character right after the subtitle number. Here is an example:
1 Matt
00:00:00,100 --> 00:00:01,500
This is said by Matt
2 Lucas
00:00:01,700 --> 00:00:02,300
While this is said by Lucas
What I would like to do is to parse these files in order to create a data.frame
like the following:
+---------------------------------------------+
| CHARACTER | TEXT |
|--------------+------------------------------|
| Matt | This is said by Matt |
|--------------+------------------------------|
| Lucas | While this is said by Lucas |
+---------------------------------------------+
So, I do not want the number or the minute of the subtitle.
I have been able to read the text with the readtext
library, resulting in something like this:
1 Matt\n00:00:00,100 --> 00:00:01,500\nThis is said by Matt.\n\n2 Lucas\n00:00:01,700 --> 00:00:02,300\nWhile this is said by Lucas
Note that there might be \n
also inside of the texts, as well as any other (readable) character
Here is where I am stuck, I guess I would have to use some kind of Regex
to extract all names and then all texts, but I have no clue on how to do this.
Any help is highly appreciated!
Here is a step-by-step way to do this without regex. It's a bit sloppy, but its to show the logic on how to approach a file like this. End result is a data frame where you can grab the info you want.
txt <- "1 Matt
00:00:00,100 --> 00:00:01,500
This is said by Matt
2 Lucas
00:00:01,700 --> 00:00:02,300
While this is said by Lucas\nand another line
3
00:00:01,700 --> 00:00:02,300
While this is said by nobody"
library(readr)
library(tidyr)
library(tibble)
library(dplyr)
library(purrr)
df <- tibble(txt = read_lines(txt))
df %>%
rowid_to_column("row") %>%
group_by(group = cumsum(txt == "")) %>%
filter(!(txt == "")) %>%
mutate(field = pmin(row_number(), 3)) %>%
group_by(group, field) %>%
summarize(txt = paste(txt, collapse = "\n"), .groups = "drop") %>%
pivot_wider(names_from = "field",
values_from = "txt") %>%
select(-group) %>%
set_names(c("Col1", "Col2", "Col3")) %>%
separate(Col1, c("Col1A", "Col1B"), extra = "merge", fill = "right")
And you get this data frame. You can name things whatever you want, of course.
# A tibble: 3 x 4
Col1A Col1B Col2 Col3
<chr> <chr> <chr> <chr>
1 1 Matt 00:00:00,100 --> 00:00:01,500 "This is said by Matt"
2 2 Lucas 00:00:01,700 --> 00:00:02,300 "While this is said by Lucas\nand another line"
3 3 NA 00:00:01,700 --> 00:00:02,300 "While this is said by nobody"
Here is a more streamlined way using a bit of tidyverse.
library(tidyr)
library(dplyr)
tibble(txt = txt) %>%
separate_rows(txt, sep = "\\n\\n") %>%
separate(txt, c("A", "B", "C"), sep = "\n", extra = "merge") %>%
separate(A, c("A1", "B2"), extra = "merge", fill = "right")