I built a parser for html, but I worked under the assumption that it would follow the rule that there were only two forms:
<foo> </foo>
<foo/>
Obviously that is wrong. Tags such as base
,meta
, and link
do not need this.
I kind of wish this wasn't the case, because I have found things like this in a script:
for(var d=b.length,e=b[a];a<d>>1;)
Oh look, the mythical <d>
tag.
So I need to make myself a whitelist of tags to ignore. Is there a comprehensive list for tags that do not require a solidus or closing tag? If not, I'll have to rewrite my parser.
Thanks
You can extract a list from the WHATWG HTML Living Standard. Or, if you prefer, the W3C's HTML 5 Specification or the subsequent draft. According to Wikipedia, the conflict has somewhat recently been resolved in favour of WHATWG, so you probably want to go with the first one.
In any case, pay particular attention to the subheading "Tag omission in text/html" in each element description. But you need to read the document carefully to understand the ins and outs of HTML parsing.
Note: It's not just that end tags can be omitted. There are also elements whose open tag can be omitted. (The classic example is <tbody>
, which is hardly ever physically present in an HTML document, but there are lots of others. <head>
, for example.) The mere fact that an element's open tag has been omitted does not force the omission of the element's close tag, although it's pretty commonly the case. So you can't do it with just a list of omitable tags; you need to take element containment rules into account, too.
Also, even though the full parsing algorithm is suprisingly complicated even for valid documents, the standard algorithm and real-world HTML parsers are even more complicated, because they try to deal gracefully with web pages which don't conform to the standard.