rdataframefeature-engineering

Separating text in r


I have a data.frame that contains a column named movies_name. this column contain data as in this format: City of Lost Children, The (Cité des enfants perdus, La) (1995) I want to separate the year from the rest of the movie name without losing the text inside the brackets. to be more precise I want to create a new column holding the year and another one for the movie name alone.

I tried this approach but now I cannot gather back the movie name:

My approach

thanks


Solution

  • Try the function extract from tidyr(part of the tidyverse):

    library(tidyverse)    
    df %>%
      extract(movies_name,
              into = c("title", "year"), 
              regex = "(\\D+)\\s\\((\\d+)\\)")
                                                             title year
        1 City of Lost Children, The (Cité des enfants perdus, La) 1995
        2                                             another film 2020
    

    How the regex works:

    Data 1:

    df <- data.frame(
      movies_name = c("City of Lost Children, The (Cité des enfants perdus, La) (1995)",
                      "another film (2020)")
    )
    

    EDIT:

    Okay, following comment, let's make this a little more complex by including a title with digits (in the title!):

    Data 2:

    df <- data.frame(
      movies_name = c("City of Lost Children, The (Cité des enfants perdus, La) (1995)",
                      "another film (2020)",
                      "Under Siege 2: Dark Territory (1995)")
    )
    

    Solution - actually easier than the previous one ;)

    df %>%
      extract(movies_name,
              into = c("title", "year"), 
              regex = "(.+)\\s\\((\\d+)\\)")
                                                         title year
    1 City of Lost Children, The (Cité des enfants perdus, La) 1995
    2                                             another film 2020
    3                            Under Siege 2: Dark Territory 1995