phpunicodenormalizationweb-standardsunicode-normalization

Normalizing Unicode according to the W3C in PHP


While validating my website's HTML code in the W3C validator I got the following warning:

Line 157, Column 220: Text run is not in Unicode Normalization Form C.

…i͈̭̋ͥ̂̿̄̋̆ͣv̜̺̋̽͛̉͐̀͌̚e͖̼̱ͣ̓ͫ͆̍̄̍͘-̩̬̰̮̯͇̯͆̌ͨ́͌ṁ̸͖̹͎̱̙̱͟͡i̷̡͌͂͏̘̭̥̯̟n̏͐͌̑̄̃͘͞…

I'm developing it in PHP 5.3.x, so I can use the Normalizer class.

So, in order to fix this, should I use Normalizer::normalize($output) when displaying any input made by a user (e.g. a comment) or should I use Normalizer::normalize($input) for any user input before storing it in the database?

tl;dr: should I use Unicode normalization before storing user input in the database or just when it's displayed?


Solution

  • It is up to you to decide, on the basis of the purpose and nature of your application, whether you apply normalization upon reading user input, or storing it to a database, or when writing it, or at all. To summarize the long thread mentioned in the comments to the question, also available in the official list archive at http://validator.w3.org/feedback.html

    It is certainly possible, though uncommon, to get unnormalized data as user input. This does not depend on normalization carried out by browsers (they don’t do such things, though they conceivably might in the future) but on input methods and habits. For example, methods of typing the letter ü (u umlaut, or u with diaeresis) tend to produce the character in precomposed form, as normalized. People can produce it as unnormalized, in decomposed form, as letter u followed by combining diaeresis, but they usually have no reason to do so, and most people wouldn’t even know how to do that.

    If you do string comparisons in your software, they may or may not (depending on comparison routines used) treat e.g. a precomposed ü as equal to the decomposed presentation. Simple implementations treat them as different, as they are definitely distinct at the simple character level (Unicode code points).

    One reason to normalize at some point, in the writing phase at the latest, is that precomposed characters generally get displayed more reliably. To present a normalized ü, a program just has to pick up a glyph from a font. To present a decomposed ü, a program must either recognize it as canonically equivalent to the normalized ü or write the letter u with a diaeresis symbol properly placed above it, with due attention to the graphic properties of the glyph for u, and many programs fail in this.

    On the other hand, in the rare cases where unnormalized data is received as user input, the user may well have a reason to have produced it. He may have the idea that normalized ü and unnormalized ü are distinct and need to be treated as such.