rsubstitutionnon-ascii

Use ascii character in (g)sub


In order for the gsub function to pass R CMD CHECK, I need to use only ASCII characters. In one place of my package I use a dash, which is non-ASCII character like follows:

sub("–", "to", x = "–")

which of course works.

However, I want to use the ASCII (or other) code in the substitution in order to avoid warnings from R CMD CHECK, like follows:

stringi::stri_enc_toascii("–")
[1] "\032"
 
sub("\\032", "to", x = "–")

which does not work.

How can I match on a character in ascii format?


Solution

  • This is an en dash:

    The en dash, en rule, or nut dash – is traditionally half the width of an em dash.

    A note on ASCII

    Here is an example straight from the shQuote() docs:

    ## Backslashes followed by up to three numbers are interpreted as
    ## octal notation for ASCII characters.
    "\110\145\154\154\157\40\127\157\162\154\144\41"
    # [1] "Hello World!"
    

    However, the octal representation of en dash is more than three digits (it's 20023), so you cannot use octal notation for this character.

    Use Unicode

    You can check its Unicode representation as follows:

    as.hexmode(utf8ToInt("–"))
    # [1] "2013"
    

    R takes its Unicode escape sequences in the following formats:

    ‘⁠\unnnn⁠’ Unicode character with given code (1--4 hex digits)

    ‘⁠\Unnnnnnnn⁠’ Unicode character with given code (1--8 hex digits)

    In this case you could use either, but I tend to use the upper case variant as I don't need to worry about the number of digits. Simply supply this to sub():

    sub("\U2013", "to", x = "–")
    # [1] "to"