edit I encounter this on R version 3.6.1, appearently in newer versions this issue does not exist and the functions do behave similar.
Consider this vector, where the first element is in the Latin-1 Supplement unicode block, the second element is in the Latin Extended Additional unicode block, and element 3-7 are in the Latin Extended D unicode block (Same I see for the Latin Extended E unicode block). The regular expression used is ^[\\p{L} ]+$
which is supposed to match a string with any kind of letter from any language. I see that grepl
and stri_detect_regex
interpret p{L}
differently.
v <- c("é", "Ḃ", "Ꞵ", "ꞵ", "Ꞷ", "ꞷ","keepme", "remove$me", "remove.me")
v[grepl("^[\\p{L} ]+$", v, perl = T)]
# [1] "é" "Ḃ" "keepme"
v[stri_detect_regex(v, "^[\\p{L} ]+$")]
# [1] "é" "Ḃ" "\ua7b4" "\ua7b5" "\ua7b6" "\ua7b7" "keepme"
Is there any documentation on why they behave different on this expression?
This happens on older R versions, R version 3.6.1 base grepl does not recognize all unicode blocks using regex p{L}
, however as @Oliver commented, it does as expected in later versions of R as he tested in R 4.2.1. For me the question is answered. Thanks!