rnlpqdapqdapregex

qdapRegex::rm_nchar_words returns different results when non English letters involved?


Please help me with the following confusion:

qdapRegex::rm_nchar_words("è ûé", "1,2")
[1] "è ûé"

qdapRegex::rm_nchar_words('k ku ppp d', "1,2")
[1] "ppp"

Why in the first code line it doesn't respond with "" but in the second one it works as expected. What do I miss here? The only thing I can think that in the first line of code the string is built from non English letters.

Any solution?

enter image description here


Solution

  • As mentioned by the author of the package:

    It uses \w to define letters which is defined as [A-Za-z0-9_]. You would need to write your own custom regex to handle the non-ascii letters

    UPDATE:

    On my Win 7 machine the output is as expected.

    One of the possible ways to solve it using pattern "[\\pL_]" (any word in any language)

    rm_nchar_words("è ûé", "1,2", pattern = "[\\pL_]")
    

    Locale on Win machine:

    locale:
    [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
    [4] LC_NUMERIC=C                           LC_TIME=English_United States.1252  
    

    I will keep investigate this and post updates for my answer.

    UPDATE 2:

    rm_nchar_words("è ûé", "1,2", pattern = "[\\pL_]")
    ""
    

    works on my Ubuntu 18.04.