javascriptgofromcharcoderune

To make Golang rune to utf-8 result same as js string.fromCharCode


go

var int32s = []int32{
  8, 253, 80, 56, 30, 220, 217, 42, 235, 33, 211, 23, 231, 216, 234, 26,
}

fmt.Println("word: ", string(int32s))

js

let int32s = [8, 253, 80, 56, 30, 220, 217, 42, 235, 33, 211, 23, 231, 216, 234, 26]
str = String.fromCharCode.apply(null, int32s);
console.log("word: " + String.fromCharCode.apply(null, int32s))

2 results above are not the same for some empty characters.
Is there any solution for modify go code to generate same result to the js one?


Solution

  • To cite the docs on String.fromCharCode:

    The static String.fromCharCode() method returns a string created from the specified sequence of UTF-16 code units.

    So each number in your int32s array is interpreted as a 16-bit integer providing a Unicode code unit, so that the whole sequence is interpreted as a series of code units forming an UTF-16-encoded string.
    I'd stress the last point because judging from the naming of the variable—int32s,—whoever is the author of the JS code, they appear to have incorrect idea about what is happening there.

    Now back to the Go counterpart. Go does not have built-in support for UTF-16 encodings; its strings are normally encoded using UTF-8 (though they are not required to, but let's not digress), and also Go provides the rune data type which is an alias to int32. A rune is a Unicode code point, that is, a number which is able to contain a complete Unicode character. (I'll get back to this fact and its relation to the JS code in a moment.)

    Now, what's wrong with your string(int32s) is that it interpets your slice of int32s in the same way as []rune (remember that a rune is an alias to int32), so it takes each number in the slice to represent a single Unicode character and produces a string of them. (This string is internally encoded as UTF-8 but this fact is not really relevant to the problem.)

    In other words, the difference is this:

    The Go standard library produces a package to deal with UTF-16 encoding: encoding/utf16, and we can use it to do what the JS code codes—to decode an UTF-16-encoded string into a sequence of Unicode code points, which we can then convert to a Go string:

    package main
    
    import (
        "fmt"
        "unicode/utf16"
    )
    
    func main() {
        var uint16s = []uint16{
            8, 253, 80, 56, 30, 220, 217, 42, 235, 33, 211, 23, 231, 216, 234, 26,
        }
    
        runes := utf16.Decode(uint16s)
    
        fmt.Println("word: ", string(runes))
    }
    

    Playground.

    (Note that I've change the type of the slice to []unit16 and renamed it accordingly. Also, I've decoded the source slice to an explicitly named variable; this is done for clarity—to highlight what's happening.)

    This code produces the same gibberish as the JS code does in the Firefox console.

    Update on the

    2 results above are not the same for some empty characters.

    bit which I did not touch.

    The problem, as I understand it, is that your Go code prints something like
    ýP8ÜÙ*ë!ÓçØê
    while the JS code prints
    �ýP8�ÜÙ*ë!Ó�çØê�
    right?

    The problem here is in the different interpretation of the resulting string fmt.Println and console.log do.

    Let me first state that your Go code happens to work correctly without using proper decoding as I've suggested—because all the integers in the slice are UTF-16 code units in the "basic" range, so "dumb" conversion works, and produces the same string as the JS code does.
    To see the both strings "as is" you could do this:

    1. For Go, use fmt.Printf with the %q verb to see "special" Unicode (and ASCII) characters "escaped" using the Go rules in the printout:

      fmt.Println("%q\n", string(int32s))
      produces
      "\býP8\x1eÜÙ*ë!Ó\x17çØê\x1a"

      Notice these '\b', '\x1e' and other escapes:

      • '\b' is ASCII BS (backspace) control character, code 0x08 — see http://man-ascii.com/.
      • '\x1e' is a byte with the code 0x1E, which is ASCII RS (record separator).
      • …and so on.

      As you can see, these are control characters, which are not printable.

    2. For JS, print the value of the resulting string without using console.log—just save its value in a variable then enter its name at the console and hit Enter—to have its value printed "as is":

      > let int32s = [8, 253, 80, 56, 30, 220, 217, 42, 235, 33, 211, 23, 231, 216, 234, 26]
      > str = String.fromCharCode.apply(null, int32s);
      > str
      "\u0008ýP8\u001eÜÙ*ë!Ó\u0017çØê\u001a"
      

      Note that the string contains the "\uXXXX" escapes. They define Unicode code points (BTW Go supports the same syntax), and these escapes define the same code points as can be seen in the Go example:

      • "\u0008" is a character with code 8, or 0x08.
      • "\u001e" is a character with code 0x1E.
      • …and so on.

    As you can see, the strings produced are the same, with the only difference is that Go's string is encoded in UTF-8, and because of this, peering into its contents using fmt.Printf and %q looks at the encoded bytes, and that's why Go prints their "escapes" using "minimal" encoding, but we could use escaping from the JS example as well: you can check than running
    fmt.Println("\býP8\x1eÜÙ*ë!Ó\x17çØê\x1a" == "\u0008ýP8\u001eÜÙ*ë!Ó\u0017çØê\u001a")
    prints true.

    So, as you can see by now, console.log replaces each non-printable character with the special Unicode code point U+FFFD, which is called Unicode replacement character, usually rendered as a black rhombus with a white question mark in it.
    Go's fmt.Println does not do that: it merely sends these bytes "as is" to the output.

    Hope this explains the observed difference.