htmlunicodespecificationsnoncharacter

Why are certain characters prohibited in the HTML5 spec?


According to the HTML5 spec (just after the table), the following characters are prohibited:

Otherwise, return a character token for the Unicode character whose code point is that number. Additionally, if the number is in the range 0x0001 to 0x0008, 0x000D to 0x001F, 0x007F to 0x009F, 0xFDD0 to 0xFDEF, or is one of 0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF, 0x2FFFE, 0x2FFFF, 0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE, 0x5FFFF, 0x6FFFE, 0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF, 0x9FFFE, 0x9FFFF, 0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE, 0xCFFFF, 0xDFFFE, 0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF, 0x10FFFE, or 0x10FFFF, then this is a parse error.

What was the reasoning or motivation behind this exclusion?


Solution

  • They're code points that cause interoperability problems, either with XML/XHTML documents or with extant HTML parsers. As none of them have any obvious valid use they should be avoided.

    The noncharacters (U+FDD0–FDEF and U+NFFFE–F) and control characters U+0000–8;0D–1F are invalid in XML 1.0. Character references in the range 0x80–0x9F produce different results in XML and HTML parsers due to the substitutions in the immediately-preceding table (and there are also many non-browser HTML parsers that do not implement this weird historical quirk).