htmlcharacter-encodingcyrillic

HTML5 Encoding & Cyrillic


Something that made me curious - supposedly the default character encoding in HTML5 is UTF-8. However if I have a plain simple HTML file with an HTML5 doctype like the code below, I get:

"hello" in Russian: "ЗдраÑтвуйте"

In Chrome 33+, Safari 6, IE11, etc.

<!DOCTYPE html>

<html>

<head></head>

<body>
    <p>"hello" in Russian is "здраствуйте"</p>
</body>

</html>

What gives? Shouldn't the browser utilize the UTF-8 unicode standard and display the text correctly? I'm using Coda which is set to save html files with UTF-8 encoding by default so that's not the problem.


Solution

  • The text data in the example is UTF-8 encoded text misinterpreted as window-1252 encoded. The reason is that the encoding has not been specified and browsers are forced to make a guess. To fix this, specify the encoding; see the W3C page Character encodings. Two simple ways that work independently of server settings, as long as the server does not send wrong encoding information in HTTP headers:

    1) Save the file as UTF-8 with BOM (there is probably an option for this in your authoring program.

    2) Add the following tag into the head part:

    <meta charset=utf-8>
    

    There is no single default encoding specified for HTML5. On the contrary, browsers are expected to make guesses when no encoding has been declared. This is a fairly complex process, described in 8.2.2.2 Determining the character encoding.