rubystringunicodeunicode-escapessupplementary

Ruby string escape for supplementary plane Unicode characters


I know that I can escape a basic Unicode character in Ruby with the \uNNNN escape sequence. For example, for a smiling face U+263A (☺) I can use the string literal "\u2603".

How do I escape Unicode characters greater than U+FFFF that fall outside the basic multilingual plane, like a winking face: U+1F609 (😉)?

Using the surrogate pair form like in Java doesn't work; it results in an invalid string that contains the individual surrogate code points:

s = "\uD83D\uDE09" # => "\xED\xA0\xBD\xED\xB8\x89"
s.valid_encoding? # => false

Solution

  • You can use the escape sequence \u{XXXXXX}, where XXXXXX is between 1 and 6 hex digits:

    s = "\u{1F609}" # => "๐Ÿ˜‰"
    

    The braces can also contain multiple runs separated by single spaces or tabs to encode multiple characters:

    s = "\u{41f 440 438 432 435 442 2c 20 43c 438 440}!" # => "ะŸั€ะธะฒะตั‚, ะผะธั€!"
    

    You could also use byte escapes to write a literal that contains the UTF-8 encoding of the character, though that's not very convenient, and doesn't necessarily result in a UTF-8-encoded string, if the file encoding differs:

    # encoding: utf-8
    s = "\xF0\x9F\x98\x89" # => "๐Ÿ˜‰"
    s.length # => 1
    

    # encoding: iso-8859-1
    s = "\xF0\x9F\x98\x89" # => "\xF0\x9F\x98\x89"
    s.length # => 4