How to combine rows, separated by returns, that start and end with specific characters?

I'm working with video transcript data. The data was automatically exported with a return mid-sentence. I'd like to combine the spoken lines into a single row. The data is formatted as such:

data$transcript<-as.data.frame(c("00:00:03.990 --> 00:00:05.270",
 "<v Bill>I'm here to take some notes. I've",
 "heard this will be interesting.</v>",
 "00:00:05.770 --> 00:00:07.370",
 "<v Charlie>I believe you'll be correct",
 "about that, Bill.</v>",
 "00:00:10.810 --> 00:00:11.170",
 "<v Bill>Awesome.</v>"))

Intended output:

intendedData$transcript<-as.data.frame(c("00:00:03.990 --> 00:00:05.270",
 "<v Bill>I'm here to take some notes. I've heard this will be interesting.</v>",
 "00:00:05.770 --> 00:00:07.370",
 "<v Charlie>I believe you'll be correct about that, Bill.</v>",
 "00:00:10.810 --> 00:00:11.170",
 "<v Bill>Awesome.</v>"))

I've tried conditional statements for rows that start with <v and end with , but that didn't work. Any ideas will be greatly appreciated. Thank you!

Solution

You could paste the transcript together as a single long string, then use regular expressions to extract the timestamps and speech. Personally, I would want to keep these as distinct variables, but if you want you can interleave them together to give the desired output:

transcript <- c("00:00:03.990 --> 00:00:05.270",
                "<v Bill>I'm here to take some notes. I've",
                "heard this will be interesting.</v>",
                "00:00:05.770 --> 00:00:07.370",
                "<v Charlie>I believe you'll be correct",
                "about that, Bill.</v>",
                "00:00:10.810 --> 00:00:11.170",
                "<v Bill>Awesome.</v>")

transcript <- paste(transcript, collapse = " ")
timestamp_regex <- "\\d+:\\d+:\\d+.\\d+ --> \\d+:\\d+:\\d+.\\d+"
speech_regex <- "<v .*?</v>"

timestamps <- stringr::str_extract_all(transcript, timestamp_regex)[[1]]
speech <- stringr::str_extract_all(transcript, speech_regex)[[1]]

vctrs::vec_interleave(timestamps, speech)
#> [1] "00:00:03.990 --> 00:00:05.270"                                                
#> [2] "<v Bill>I'm here to take some notes. I've heard this will be interesting.</v>"
#> [3] "00:00:05.770 --> 00:00:07.370"                                                
#> [4] "<v Charlie>I believe you'll be correct about that, Bill.</v>"                 
#> [5] "00:00:10.810 --> 00:00:11.170"                                                
#> [6] "<v Bill>Awesome.</v>"