Many resources exist describing character encoding best practices and bit sequences, but without an accurate map of the content’s journey, I’m struggling to understand and apply them.
But my mental model is missing so many steps!
I’ve included a diagram to illustrate. Purple is the server; red is the browser; green is the OS (Windows XP in the diagram, but could be anything).
I think an important piece your mental model may be missing is the distinction between bytes and characters. At different steps and different levels, text is either treated as opaque, meaningless bytes, or the computer is aware of the text as characters.
When the computer treats text as characters, it will be stored in some form of byte representation in memory, yes, but that is an irrelevant implementation detail and how exactly it's represented in memory may differ between different programs. The important part is that the computer is aware that "漢字" is "漢字", and can produce a byte representation of these characters in any valid encoding at any time.
The browser is character aware. With anything happening inside it, the browser is treating text as text. When it gets any files from the server, it looks at the HTTP headers or other fallback indicators to figure out what encoding that file is in, decodes it from that encoding, and treats all text as known, specific characters henceforth.
When entering text into a form, the OS takes care of the underlying details, including receiving key codes from the keyboard, mapping those through the chosen keyboard layout, perhaps involving an IME for text transformation (e.g. to enter 日本語), and provides the browser with characters.
When it comes time to send those characters to the server, the browser determines what encoding needs to be used, based on various factors like the form's accept-encoding
attribute or fallbacks like the site's determined encoding. It then represents the text as bytes in that encoding. At this point, characters may be substituted by HTML entities, if the target encoding cannot represent the character. It may then apply another transport encoding like URL-percent encoding to those bytes. This then gets sent to the server.
PHP doesn't by default do anything with encodings. It is not text-aware and treats all data as mere meaningless bytes. So you have to make sure in your code that you know what encoding any received text is in and treat it accordingly. PHP will decode URL-percent encoding for populating $_GET
and $_POST
, but these variables will just contain the transport-decoded bytes, not text.
Whatever you output from PHP will be output as is. What that is depends on where it came from. Anything that comes from (source code) files on disk depends on how it was saved in the text editor. Anything coming from a database depends on how you established the database connection; databases are generally text-aware and will provide you the text in the encoding you request, which you can configure. It's usually best to ensure everything is in UTF-8 all the way.
PHP and/or the web server should make sure to output the correct headers which correctly denote what encoding the content you're outputting is in, so the browser can correctly determine it.