I am trying to convert a vector of ASCII bytes into a rust string. I found the std::str::from_utf8()
function, that should be able to handle all ASCII strings. For some reason it cannot read the copyright symbol, as shown in this code sample:
let buf = vec![0xA9, 0x41, 0x52, 0x54]; //©ART
println!(
"{}",
match std::str::from_utf8(&buf) {
Ok(x) => x,
Err(x) => {
println!("ERROR: {}", x);
"failed"
}
}
);
// > ERROR: invalid utf-8 sequence of 1 bytes from index 0
According to https://www.ascii-code.com/CP1252/169 0xA9
is a valid ASCII character, and according to https://www.compart.com/en/unicode/U+00A9 also a valid UTF-8 character.
I also tried String::from_utf8_lossy()
, but that gave me �ART
as a result, which is not what the string should be.
Am I missing something here or is this a bug with the way rust handles ASCII?
0xA9
is not ASCII; ASCII is only a 7-bit encoding and this value has the 8th bit set.
It can be interpreted as extended ASCII which means it requires pre-knowledge of a character set to interpret it as "©". You can see in your link that it is "©" in the Windows-1252 character set but another link shows that 0xA9
is "⌐" in the Code page 437 character set. And there are many other character sets.
Since 0xA9
is not ASCII it is not UTF8 - at least not on its own. The 8th bit set indicates it is part of a multi-byte sequence and more importantly the bit representation of 0xA9
starts with 10xxxxxx
which means it is the middle of a multi-byte sequence (see UTF8 on wikipedia). So any UTF8 decoder encountering that (without a preceding multi-byte start character) is going to reject it.
If you want to use an extended ASCII character set and decode it to a Rust string, you'd need to decode that differently. A crate like encoding-rs could probably do that.