php unicode normalization web-standards unicode-normalization

Normalizing Unicode according to the W3C in PHP

While validating my website's HTML code in the W3C validator I got the following warning:

Line 157, Column 220: Text run is not in Unicode Normalization Form C.

…i͈̭̋ͥ̂̿̄̋̆ͣv̜̺̋̽͛̉͐̀͌̚e͖̼̱ͣ̓ͫ͆̍̄̍͘-̩̬̰̮̯͇̯͆̌ͨ́͌ṁ̸͖̹͎̱̙̱͟͡i̷̡͌͂͏̘̭̥̯̟n̏͐͌̑̄̃͘͞…

I'm developing it in PHP 5.3.x, so I can use the Normalizer class.

So, in order to fix this, should I use Normalizer::normalize($output) when displaying any input made by a user (e.g. a comment) or should I use Normalizer::normalize($input) for any user input before storing it in the database?

tl;dr: should I use Unicode normalization before storing user input in the database or just when it's displayed?

Solution

It is up to you to decide, on the basis of the purpose and nature of your application, whether you apply normalization upon reading user input, or storing it to a database, or when writing it, or at all. To summarize the long thread mentioned in the comments to the question, also available in the official list archive at http://validator.w3.org/feedback.html

The warning message comes from the experimental “HTML5 validation” (which is really a linter, applying subjective rules in addition to some formal tests).
The message is not based on any requirement in HTML5 drafts but on opinions on what might cause problems in some software.
The opinions originally made “HTML5 validation” issue an error message, now a warning.

It is certainly possible, though uncommon, to get unnormalized data as user input. This does not depend on normalization carried out by browsers (they don’t do such things, though they conceivably might in the future) but on input methods and habits. For example, methods of typing the letter ü (u umlaut, or u with diaeresis) tend to produce the character in precomposed form, as normalized. People can produce it as unnormalized, in decomposed form, as letter u followed by combining diaeresis, but they usually have no reason to do so, and most people wouldn’t even know how to do that.

If you do string comparisons in your software, they may or may not (depending on comparison routines used) treat e.g. a precomposed ü as equal to the decomposed presentation. Simple implementations treat them as different, as they are definitely distinct at the simple character level (Unicode code points).

One reason to normalize at some point, in the writing phase at the latest, is that precomposed characters generally get displayed more reliably. To present a normalized ü, a program just has to pick up a glyph from a font. To present a decomposed ü, a program must either recognize it as canonically equivalent to the normalized ü or write the letter u with a diaeresis symbol properly placed above it, with due attention to the graphic properties of the glyph for u, and many programs fail in this.

On the other hand, in the rare cases where unnormalized data is received as user input, the user may well have a reason to have produced it. He may have the idea that normalized ü and unnormalized ü are distinct and need to be treated as such.