rstringstringi

Split numbers and dates into separate columns


My data contains text strings with three important features, an id number separated by":" and a starting date and an end date. I need to get these tree numbers into three separate columns. I have tried different solutions, everything from unnest_tokens, grepl/grep, to separate, but can't seem to get it right, I may get one date, but I can't seem to get them in the correct order or into a data frame.

Input data:

input<- data.frame(
  id=c(1,2,3),
  value=c("a long title containing all sorts - off `characters` 2022:03 29.10.2021 
  21.02.2022",
  "but the strings always end with the same - document id, start date: and end date  2022:02 
  30.04.2020 18.02.2022",
  "so I need to split document id, start and end dates into separate columns 2000:01 
  07.10.2000 15.02.2021")
  )

Desired output:

output <-data.frame(
 id=c(1,2,3),
 value=c("a long title containing all sorts - off `characters`",
 "but the strings always end with the same - document id, start date: and end date",
 "so I need to split document id, start and end dates into separate columns"),
 docid=c("2022:03", "2022:02", "2000:01"),
 start=c("29.10.2021", "30.04.2020", "07.10.2000"),
 end=c("21.02.2022", "18.02.2022", "15.02.2021")
  )

Solution

  • This is most conveniently accomplished by extract: in its regex argument we exhaustively describe the strings we want to split into columns as a complex pattern in which the parts that need to go into the columns are wrapped into capture groups (...):

    library(tidyr)
    input %>%
      extract(value,
              into = c("value", "docid", "start", "end"),
              regex = "(.*)\\s(\\d{4}:\\d{2})\\s{1,}(.*)\\s{1,}(.*)")
      id                                                                             value   docid      start
    1  1                              a long title containing all sorts - off `characters` 2022:03 29.10.2021
    2  2 but the strings always end with the same - document id, start date: and end date  2022:02 30.04.2020
    3  3         so I need to split document id, start and end dates into separate columns 2000:01 07.10.2000
             end
    1 21.02.2022
    2 18.02.2022
    3 15.02.2021