I know that I can escape a basic Unicode character in Ruby with the \uNNNN
escape sequence. For example, for a smiling face U+263A (☺) I can use the string literal "\u2603"
.
How do I escape Unicode characters greater than U+FFFF that fall outside the basic multilingual plane, like a winking face: U+1F609 (😉)?
Using the surrogate pair form like in Java doesn't work; it results in an invalid string that contains the individual surrogate code points:
s = "\uD83D\uDE09" # => "\xED\xA0\xBD\xED\xB8\x89"
s.valid_encoding? # => false
You can use the escape sequence \u{XXXXXX}
, where XXXXXX
is between 1 and 6 hex digits:
s = "\u{1F609}" # => "๐"
The braces can also contain multiple runs separated by single spaces or tabs to encode multiple characters:
s = "\u{41f 440 438 432 435 442 2c 20 43c 438 440}!" # => "ะัะธะฒะตั, ะผะธั!"
You could also use byte escapes to write a literal that contains the UTF-8 encoding of the character, though that's not very convenient, and doesn't necessarily result in a UTF-8-encoded string, if the file encoding differs:
# encoding: utf-8
s = "\xF0\x9F\x98\x89" # => "๐"
s.length # => 1
# encoding: iso-8859-1
s = "\xF0\x9F\x98\x89" # => "\xF0\x9F\x98\x89"
s.length # => 4