rregexstringrstring-substitution

Different behavior of base R gsub and stringr::str_replace_all?


I would expect gsub and stringr::str_replace_all to return the same result in the following, but only gsub returns the intended result. I am developing a lesson to demonstrate str_replace_all so I would like to know why it returns a different result here.

txt <- ".72   2.51\n2015**   2.45   2.30   2.00   1.44   1.20   1.54   1.84   1.56   1.94   1.47   0.86   1.01\n2016**   1.53   1.75   2.40   2.62   2.35   2.03   1.25   0.52   0.45   0.56   1.88   1.17\n2017**   0.77   0.70   0.74   1.12   0.88   0.79   0.10   0.09   0.32   0.05   0.15   0.50\n2018**   0.70   0"

gsub(".*2017|2018.*", "", txt)

stringr::str_replace_all(txt, ".*2017|2018.*", "")

gsub returns the intended output (everything before and including 2017, and after and including 2018, has been removed).

output of gsub (intended)

[1] "**   0.77   0.70   0.74   1.12   0.88   0.79   0.10   0.09   0.32   0.05   0.15   0.50\n"

However str_replace_all only replaces the 2017 and 2018 but leaves the rest, even though the same pattern is used for both.

output of str_replace_all (not intended)

[1] ".72   2.51\n2015**   2.45   2.30   2.00   1.44   1.20   1.54   1.84   1.56   1.94   1.47   0.86   1.01\n2016**   1.53   1.75   2.40   2.62   2.35   2.03   1.25   0.52   0.45   0.56   1.88   1.17\n**   0.77   0.70   0.74   1.12   0.88   0.79   0.10   0.09   0.32   0.05   0.15   0.50\n"

Why is this the case?


Solution

  • Base R relies on two regex libraries. As default R uses TRE. We can specify perl = TRUE to use PCRE (perl like regular expressions). The {stringr} package uses ICU (Java like regular expressions).

    In your case the problem is that the dot . doesn’t match line breaks in PCRE and ICU, while it does match line breaks in TRE:

    library(stringr)
    
    txt <- ".72   2.51\n2015**   2.45   2.30   2.00   1.44   1.20   1.54   1.84   1.56   1.94   1.47   0.86   1.01\n2016**   1.53   1.75   2.40   2.62   2.35   2.03   1.25   0.52   0.45   0.56   1.88   1.17\n2017**   0.77   0.70   0.74   1.12   0.88   0.79   0.10   0.09   0.32   0.05   0.15   0.50\n2018**   0.70   0"
    
    (base_tre <- gsub(".*2017|2018.*", "", txt))
    #> [1] "**   0.77   0.70   0.74   1.12   0.88   0.79   0.10   0.09   0.32   0.05   0.15   0.50\n"
    (base_perl <- gsub(".*2017|2018.*", "", txt, perl = TRUE))
    #> [1] ".72   2.51\n2015**   2.45   2.30   2.00   1.44   1.20   1.54   1.84   1.56   1.94   1.47   0.86   1.01\n2016**   1.53   1.75   2.40   2.62   2.35   2.03   1.25   0.52   0.45   0.56   1.88   1.17\n**   0.77   0.70   0.74   1.12   0.88   0.79   0.10   0.09   0.32   0.05   0.15   0.50\n"
    (string_r <- str_replace_all(txt, ".*2017|2018.*", ""))
    #> [1] ".72   2.51\n2015**   2.45   2.30   2.00   1.44   1.20   1.54   1.84   1.56   1.94   1.47   0.86   1.01\n2016**   1.53   1.75   2.40   2.62   2.35   2.03   1.25   0.52   0.45   0.56   1.88   1.17\n**   0.77   0.70   0.74   1.12   0.88   0.79   0.10   0.09   0.32   0.05   0.15   0.50\n"
    
    identical(base_perl, string_r)
    #> [1] TRUE
    

    We can use modifiers to change the behavior of PCRE and ICU regex so that line breaks are matched by .. This will produce the same output as with base R TRE:

    (base_perl <- gsub("(?s).*2017|2018(?s).*", "", txt, perl = TRUE))
    #> [1] "**   0.77   0.70   0.74   1.12   0.88   0.79   0.10   0.09   0.32   0.05   0.15   0.50\n"
    
    (string_r <- str_replace_all(txt, "(?s).*2017|2018(?s).*", ""))
    #> [1] "**   0.77   0.70   0.74   1.12   0.88   0.79   0.10   0.09   0.32   0.05   0.15   0.50\n"
    
    identical(base_perl, string_r)
    #> [1] TRUE
    

    Finally, unlike TRE, PCRE and ICU allow us to use look arounds which are also an option to solve the problem

    str_match(txt, "(?<=2017).*.(?=\\n2018)")
    #>      [,1]                                                                                    
    #> [1,] "**   0.77   0.70   0.74   1.12   0.88   0.79   0.10   0.09   0.32   0.05   0.15   0.50"
    

    Created on 2021-08-10 by the reprex package (v0.3.0)