My data contains text strings with three important features, an id number separated by":" and a starting date and an end date. I need to get these tree numbers into three separate columns. I have tried different solutions, everything from unnest_tokens, grepl/grep, to separate, but can't seem to get it right, I may get one date, but I can't seem to get them in the correct order or into a data frame.
Input data:
input<- data.frame(
id=c(1,2,3),
value=c("a long title containing all sorts - off `characters` 2022:03 29.10.2021
21.02.2022",
"but the strings always end with the same - document id, start date: and end date 2022:02
30.04.2020 18.02.2022",
"so I need to split document id, start and end dates into separate columns 2000:01
07.10.2000 15.02.2021")
)
Desired output:
output <-data.frame(
id=c(1,2,3),
value=c("a long title containing all sorts - off `characters`",
"but the strings always end with the same - document id, start date: and end date",
"so I need to split document id, start and end dates into separate columns"),
docid=c("2022:03", "2022:02", "2000:01"),
start=c("29.10.2021", "30.04.2020", "07.10.2000"),
end=c("21.02.2022", "18.02.2022", "15.02.2021")
)
This is most conveniently accomplished by extract
: in its regex
argument we exhaustively describe the strings we want to split into columns as a complex pattern in which the parts that need to go into the columns are wrapped into capture groups (...)
:
library(tidyr)
input %>%
extract(value,
into = c("value", "docid", "start", "end"),
regex = "(.*)\\s(\\d{4}:\\d{2})\\s{1,}(.*)\\s{1,}(.*)")
id value docid start
1 1 a long title containing all sorts - off `characters` 2022:03 29.10.2021
2 2 but the strings always end with the same - document id, start date: and end date 2022:02 30.04.2020
3 3 so I need to split document id, start and end dates into separate columns 2000:01 07.10.2000
end
1 21.02.2022
2 18.02.2022
3 15.02.2021