Encoding.Unicode.GetBytes(string)
returns the UTF-16 byte representation of a string but since strings are encoded as UTF-16 in C#, is it equivalent to get the raw bytes of the string using MemoryMarshal.Cast<char, byte>(str.AsSpan())
?
Looking at the implementation of Encoding.Unicode.GetBytes it does look more complicated so what am I missing?
They are mostly, but not quite equivalent.
.NET strings are merely a sequence of UTF-16 code points regardless of whether they are valid Unicode strings. In particular, this means that unpaired surrogates produce different behaviour. This test string produces differing byte results あ Hello!\uDDDD
(the non-ASCII Japanese character didn't end up causing a difference, but I always like to check for that):
Encoding.Unicode.GetBytes:
66, 48, 32, 0, 72, 0, 101, 0, 108, 0, 108, 0, 111, 0, 33, 0, 253, 255
MemoryMarshal.Cast:
66, 48, 32, 0, 72, 0, 101, 0, 108, 0, 108, 0, 111, 0, 33, 0, 221, 221
While .NET strings can be invalid Unicode, the Encoding.Unicode.GetBytes deals with valid Unicode, and as such replaced the unpaired surrogate with the Unicode replacement character � , U+FFFD (it can be configured to throw an exception instead, but off the top of my head I don't think it can be easily configured to mimic the MemoryMarshal.Cast behaviour).
Additionally, there may be system-dependent behaviour with MemoryMarshal.Cast
. Char
s in .NET are defined to be a UTF-16 code unit, but the documentation does not specify endianness. It is possible that a big-endian architecture might have MemoryMarshal.Cast
swap each pair of bytes according to the endianness (though from my quick searching, .NET implementations on big-endian are rare). Additionally, while I'm not too experienced in memory management code, it is possible that MemoryMarshal.Cast
could fail on some platforms per the documentation:
This method is supported only on platforms that support misaligned memory access or when the memory block is aligned by other means.
Here's the test code for the unpaired surrogate string:
string s = "あ Hello!\uDDDD";
var arr1 = System.Text.Encoding.Unicode.GetBytes(s);
var arr2 = System.Runtime.InteropServices.MemoryMarshal.Cast<char, byte>(s.AsSpan());
Console.WriteLine("Encoding.Unicode.GetBytes: ");
foreach (var x in arr1){
Console.Write(x);
Console.Write(", ");
}
Console.WriteLine();
Console.WriteLine("MemoryMarshal.Cast: ");
foreach (var x in arr2){
Console.Write(x);
Console.Write(", ");
}
Console.WriteLine();