I have a data.frame that contains a column named movies_name. this column contain data as in this format: City of Lost Children, The (Cité des enfants perdus, La) (1995) I want to separate the year from the rest of the movie name without losing the text inside the brackets. to be more precise I want to create a new column holding the year and another one for the movie name alone.
I tried this approach but now I cannot gather back the movie name:
thanks
Try the function extract
from tidyr
(part of the tidyverse
):
library(tidyverse)
df %>%
extract(movies_name,
into = c("title", "year"),
regex = "(\\D+)\\s\\((\\d+)\\)")
title year
1 City of Lost Children, The (Cité des enfants perdus, La) 1995
2 another film 2020
How the regex works:
(\\D+)
: first capture group, matching one or more characters that are not digits\\s\\(
: a whitespace and an opening parenthesis (not captured)(\\d+)
: second capture group, matching one or more `dìgits\\)
: closing bracket (not captured)Data 1:
df <- data.frame(
movies_name = c("City of Lost Children, The (Cité des enfants perdus, La) (1995)",
"another film (2020)")
)
EDIT:
Okay, following comment, let's make this a little more complex by including a title with digits (in the title!):
Data 2:
df <- data.frame(
movies_name = c("City of Lost Children, The (Cité des enfants perdus, La) (1995)",
"another film (2020)",
"Under Siege 2: Dark Territory (1995)")
)
Solution - actually easier than the previous one ;)
df %>%
extract(movies_name,
into = c("title", "year"),
regex = "(.+)\\s\\((\\d+)\\)")
title year
1 City of Lost Children, The (Cité des enfants perdus, La) 1995
2 another film 2020
3 Under Siege 2: Dark Territory 1995