I have a column of inconsistently formatted strings that I am trying to split using tidyr::separate_wider_delim()
. These either have the pattern "text[space]text", or "text[space]text[comma][space]text" (see example below). In the former, I want to split at the space; in the latter I want to split at the comma/space combination (which is always going to be the second occurrence of a space).
Either of these situations is easy using separate_wider_delim()
. However, I can't figure out how to do both at the same time. Doing them in sequence also doesn't work, because the space is consistent between them.
library(tidyr)
df <- data.frame(txt = c("December 2005", "January 2008", "January 3, 2001", "December 18, 2004"))
df %>%
separate_wider_delim(txt, delim = ", ",
names = c("month", "year"),
too_few = "align_start")
What I'm trying to get to is:
month year
December 2005
January 2008
January 3 2001
December 18 2004
I can use regex to identify the location of the second space:
str_locate_all(txt, "\\s")[[1]][2, 2]
But I can't figure out how to use this to define the delimiter. I could probably just grab the last four characters pretty easily, but my real data is messier than this, so I would greatly prefer something that relies on the delimiter (if only one space, then the space; if multiple spaces, then the second space/the space following the comma).
Staying with delim-based splitting you can use a regex to define the delimeter as:
,?
: an optional comma\s+
: followed by the space(?= ... )
: before
\d{4}
: the 4-digit number (year)$
: that comes before the end of stringSee regex demo at regex101
library(tidyr)
library(stringr)
df <- data.frame(
txt = c(
"December 2005",
"January 2008",
"January 3, 2001",
"December 18, 2004"
)
)
df %>%
separate_wider_delim(
txt,
delim = stringr::regex(",?\\s+(?=\\d{4}$)"),
names = c("month", "year"),
too_few = "align_start"
)
# # A tibble: 4 × 2
# month year
# <chr> <chr>
# 1 December 2005
# 2 January 2008
# 3 January 3 2001
# 4 December 18 2004