utf-8character-encodingiso-8859-1codepoint

Are there examples of ISO 8859-1 text files which are valid, but different in UTF-8?


I know that UTF-8 supports way more characters than Latin-1 (even with the extensions). But are there examples of files that are valid in both, but the characters are different? So essentially that the content changes, depending on how you think the file is encoded?

I also know that a big chunk of Latin-1 maps 1:1 to the same part in UTF-8. The question is: which code points could change the value if interpreted differently (not invalid, but different)?


Solution

  • Latin-1 is a single-byte encoding (meaning 1 character = 1 byte), which uses all possible byte values. So any byte maps to something in Latin-1. So literally any file is "valid" in Latin-1. So you can interpret any file as Latin-1 and you'll get… something… as a result.

    So yes, interpret any valid UTF-8 file in Latin-1. It's valid both in UTF-8 and Latin-1. The first 128 characters are the same for both encodings and both based on ASCII; but if your UTF-8 file uses any non-ASCII characters, those will be interpreted as gibberish (yet valid) Latin-1.

    bytes encoding text
    e6bc a2e5 ad97 UTF-8 漢字
    e6bc a2e5 ad97 Latin-1 æ¼¢å­ 👈 valid but nonsensical