rregexstringr

R / stringr: split string, but keep the delimiters in the output


I tried to search for the solution, but it appears that there is no clear one for R.
I try to split the string by the pattern of, let's say, space and capital letter and I use stringr package for that.

x <- "Foobar foobar, Foobar foobar"
str_split(x, " [:upper:]")

Normally I would get:

[[1]]
[1] "Foobar foobar," "oobar foobar"  

The output I would like to get, however, should include the letter from the delimiter:

[[1]]
[1] "Foobar foobar," "Foobar foobar"

Probably there is no out of box solution in stringr like back-referencing, so I would be happy to get any help.


Solution

  • You may split with 1+ whitespaces that are followed with an uppercase letter:

    > str_split(x, "\\s+(?=[[:upper:]])")
    [[1]]
    [1] "Foobar foobar," "Foobar foobar" 
    

    Here,

    Note that \s matches various whitespace chars, not just plain regular spaces. Also, it is safer to use [[:upper:]] rather than [:upper:] - if you plan to use the patterns with other regex engines (like PCRE, for example).