I'm trying to extract TV show name from txt file using R.
I have loaded the txt and assigned it to a variable called txt. Now I'm trying to use regular expression to extract just the information I want.
The pattern of information I want to extract is likes of
SHOW: Game of Thrones 7:00 PM EST
SHOW: The Outsider 3:00 PM EST
SHOW: Don't Be a Menace to South Central While Drinking Your Juice In The Hood 10:00 AM EST
and so on. There are about 320 shows and I want to extract all 320 of them.
So far, I've come up with this.
pattern <- "SHOW:\\s\\w*"
str_extract_all(txt, pattern3)
However, it doesn't extract the entire title name like I intended. (ex: it will extract "SHOW: Game" instead of "SHOW: Game of Thrones". If I were to extract that one show, I would just use "SHOW:\\s\\w*\\s\\w*\\s\\w*
to match the word count, but I want to extract all shows in txt, including the longer and shorter names.
How should I write the regular expression to get the intended result?
You could get the value without using lookarounds by matching SHOW:
and capturing the data in group 1 matching as least as possible until the first occurrence of AM or PM.
\bSHOW:\s+(.*?)\s+\d{1,2}:\d{1,2}\s+[AP]M\b
Explanation
\bSHOW:\s+
A word boundary, match SHOW:
and 1+ whitspace chars(.*?)
Capture group 1, match as least as possible (non greedy)\s+\d{1,2}:\d{1,2}
Match 1+ whitespace chars, 1-2 digits :
1-2 digits\s+[AP]M\b
Match 1+ whitespace chars followed by either AM or PM and a word boundarylibrary(stringr)
txt <- c("SHOW: Game of Thrones 7:00 PM EST", "SHOW: The Outsider 3:00 PM EST", "SHOW: Don't Be a Menace to South Central While Drinking Your Juice In The Hood 10:00 AM EST")
pattern <- "\\bSHOW:\\s+(.*?)\\s+\\d{1,2}:\\d{1,2}\\s+[AP]M\\b"
str_match(txt, pattern)[,2]
Output
[1] "Game of Thrones"
[2] "The Outsider"
[3] "Don't Be a Menace to South Central While Drinking Your Juice In The Hood"
If you want to include SHOW, it can be part of the capturing group.
\b(SHOW:.*?)\s+\d{1,2}:\d{1,2}\s+[AP]M\b