I would expect gsub
and stringr::str_replace_all
to return the same result in the following, but only gsub
returns the intended result. I am developing a lesson to demonstrate str_replace_all
so I would like to know why it returns a different result here.
txt <- ".72 2.51\n2015** 2.45 2.30 2.00 1.44 1.20 1.54 1.84 1.56 1.94 1.47 0.86 1.01\n2016** 1.53 1.75 2.40 2.62 2.35 2.03 1.25 0.52 0.45 0.56 1.88 1.17\n2017** 0.77 0.70 0.74 1.12 0.88 0.79 0.10 0.09 0.32 0.05 0.15 0.50\n2018** 0.70 0"
gsub(".*2017|2018.*", "", txt)
stringr::str_replace_all(txt, ".*2017|2018.*", "")
gsub
returns the intended output (everything before and including 2017
, and after and including 2018
, has been removed).
output of gsub (intended)
[1] "** 0.77 0.70 0.74 1.12 0.88 0.79 0.10 0.09 0.32 0.05 0.15 0.50\n"
However str_replace_all
only replaces the 2017
and 2018
but leaves the rest, even though the same pattern
is used for both.
output of str_replace_all (not intended)
[1] ".72 2.51\n2015** 2.45 2.30 2.00 1.44 1.20 1.54 1.84 1.56 1.94 1.47 0.86 1.01\n2016** 1.53 1.75 2.40 2.62 2.35 2.03 1.25 0.52 0.45 0.56 1.88 1.17\n** 0.77 0.70 0.74 1.12 0.88 0.79 0.10 0.09 0.32 0.05 0.15 0.50\n"
Why is this the case?
Base R relies on two regex libraries.
As default R uses TRE.
We can specify perl = TRUE
to use PCRE (perl like regular expressions).
The {stringr} package uses ICU (Java like regular expressions).
In your case the problem is that the dot .
doesn’t match line breaks in PCRE and ICU, while it does match line breaks in TRE:
library(stringr)
txt <- ".72 2.51\n2015** 2.45 2.30 2.00 1.44 1.20 1.54 1.84 1.56 1.94 1.47 0.86 1.01\n2016** 1.53 1.75 2.40 2.62 2.35 2.03 1.25 0.52 0.45 0.56 1.88 1.17\n2017** 0.77 0.70 0.74 1.12 0.88 0.79 0.10 0.09 0.32 0.05 0.15 0.50\n2018** 0.70 0"
(base_tre <- gsub(".*2017|2018.*", "", txt))
#> [1] "** 0.77 0.70 0.74 1.12 0.88 0.79 0.10 0.09 0.32 0.05 0.15 0.50\n"
(base_perl <- gsub(".*2017|2018.*", "", txt, perl = TRUE))
#> [1] ".72 2.51\n2015** 2.45 2.30 2.00 1.44 1.20 1.54 1.84 1.56 1.94 1.47 0.86 1.01\n2016** 1.53 1.75 2.40 2.62 2.35 2.03 1.25 0.52 0.45 0.56 1.88 1.17\n** 0.77 0.70 0.74 1.12 0.88 0.79 0.10 0.09 0.32 0.05 0.15 0.50\n"
(string_r <- str_replace_all(txt, ".*2017|2018.*", ""))
#> [1] ".72 2.51\n2015** 2.45 2.30 2.00 1.44 1.20 1.54 1.84 1.56 1.94 1.47 0.86 1.01\n2016** 1.53 1.75 2.40 2.62 2.35 2.03 1.25 0.52 0.45 0.56 1.88 1.17\n** 0.77 0.70 0.74 1.12 0.88 0.79 0.10 0.09 0.32 0.05 0.15 0.50\n"
identical(base_perl, string_r)
#> [1] TRUE
We can use modifiers
to change the behavior of PCRE and ICU regex so that line breaks are matched
by .
. This will produce the same output as with base R TRE:
(base_perl <- gsub("(?s).*2017|2018(?s).*", "", txt, perl = TRUE))
#> [1] "** 0.77 0.70 0.74 1.12 0.88 0.79 0.10 0.09 0.32 0.05 0.15 0.50\n"
(string_r <- str_replace_all(txt, "(?s).*2017|2018(?s).*", ""))
#> [1] "** 0.77 0.70 0.74 1.12 0.88 0.79 0.10 0.09 0.32 0.05 0.15 0.50\n"
identical(base_perl, string_r)
#> [1] TRUE
Finally, unlike TRE, PCRE and ICU allow us to use look arounds which are also an option to solve the problem
str_match(txt, "(?<=2017).*.(?=\\n2018)")
#> [,1]
#> [1,] "** 0.77 0.70 0.74 1.12 0.88 0.79 0.10 0.09 0.32 0.05 0.15 0.50"
Created on 2021-08-10 by the reprex package (v0.3.0)