rstringsrt

How to combine rows, separated by returns, that start and end with specific characters?


I'm working with video transcript data. The data was automatically exported with a return mid-sentence. I'd like to combine the spoken lines into a single row. The data is formatted as such:

data$transcript<-as.data.frame(c("00:00:03.990 --> 00:00:05.270",
 "<v Bill>I'm here to take some notes. I've",
 "heard this will be interesting.</v>",
 "00:00:05.770 --> 00:00:07.370",
 "<v Charlie>I believe you'll be correct",
 "about that, Bill.</v>",
 "00:00:10.810 --> 00:00:11.170",
 "<v Bill>Awesome.</v>"))

Intended output:

intendedData$transcript<-as.data.frame(c("00:00:03.990 --> 00:00:05.270",
 "<v Bill>I'm here to take some notes. I've heard this will be interesting.</v>",
 "00:00:05.770 --> 00:00:07.370",
 "<v Charlie>I believe you'll be correct about that, Bill.</v>",
 "00:00:10.810 --> 00:00:11.170",
 "<v Bill>Awesome.</v>"))

I've tried conditional statements for rows that start with <v and end with , but that didn't work. Any ideas will be greatly appreciated. Thank you!


Solution

  • You could paste the transcript together as a single long string, then use regular expressions to extract the timestamps and speech. Personally, I would want to keep these as distinct variables, but if you want you can interleave them together to give the desired output:

    transcript <- c("00:00:03.990 --> 00:00:05.270",
                    "<v Bill>I'm here to take some notes. I've",
                    "heard this will be interesting.</v>",
                    "00:00:05.770 --> 00:00:07.370",
                    "<v Charlie>I believe you'll be correct",
                    "about that, Bill.</v>",
                    "00:00:10.810 --> 00:00:11.170",
                    "<v Bill>Awesome.</v>")
    
    transcript <- paste(transcript, collapse = " ")
    timestamp_regex <- "\\d+:\\d+:\\d+.\\d+ --> \\d+:\\d+:\\d+.\\d+"
    speech_regex <- "<v .*?</v>"
    
    timestamps <- stringr::str_extract_all(transcript, timestamp_regex)[[1]]
    speech <- stringr::str_extract_all(transcript, speech_regex)[[1]]
    
    vctrs::vec_interleave(timestamps, speech)
    #> [1] "00:00:03.990 --> 00:00:05.270"                                                
    #> [2] "<v Bill>I'm here to take some notes. I've heard this will be interesting.</v>"
    #> [3] "00:00:05.770 --> 00:00:07.370"                                                
    #> [4] "<v Charlie>I believe you'll be correct about that, Bill.</v>"                 
    #> [5] "00:00:10.810 --> 00:00:11.170"                                                
    #> [6] "<v Bill>Awesome.</v>"