stringrustutf-8byteascii

The from_utf8 Rust function cannot read some ASCII strings (invalid utf-8 sequence of 1 bytes)


I am trying to convert a vector of ASCII bytes into a rust string. I found the std::str::from_utf8() function, that should be able to handle all ASCII strings. For some reason it cannot read the copyright symbol, as shown in this code sample:

let buf = vec![0xA9, 0x41, 0x52, 0x54]; //©ART
println!(
    "{}",
    match std::str::from_utf8(&buf) {
        Ok(x) => x,
        Err(x) => {
            println!("ERROR: {}", x);
            "failed"
        }
    }
);
// > ERROR: invalid utf-8 sequence of 1 bytes from index 0

According to https://www.ascii-code.com/CP1252/169 0xA9 is a valid ASCII character, and according to https://www.compart.com/en/unicode/U+00A9 also a valid UTF-8 character.

I also tried String::from_utf8_lossy(), but that gave me �ART as a result, which is not what the string should be.

Am I missing something here or is this a bug with the way rust handles ASCII?


Solution

  • 0xA9 is not ASCII; ASCII is only a 7-bit encoding and this value has the 8th bit set.

    It can be interpreted as extended ASCII which means it requires pre-knowledge of a character set to interpret it as "©". You can see in your link that it is "©" in the Windows-1252 character set but another link shows that 0xA9 is "⌐" in the Code page 437 character set. And there are many other character sets.

    Since 0xA9 is not ASCII it is not UTF8 - at least not on its own. The 8th bit set indicates it is part of a multi-byte sequence and more importantly the bit representation of 0xA9 starts with 10xxxxxx which means it is the middle of a multi-byte sequence (see UTF8 on wikipedia). So any UTF8 decoder encountering that (without a preceding multi-byte start character) is going to reject it.

    If you want to use an extended ASCII character set and decode it to a Rust string, you'd need to decode that differently. A crate like encoding-rs could probably do that.