html parsing xml-parsing html-parsing solidus

When parsing html, do I need to watch out for self closing tags which do not need a solidus, having a closing tag?

There are certain tags in HTML which can be self closing without a solidus. For example:

<link rel="shortcut icon" href="//www.google.com/favicon.ico">

is valid. As such, this is not needed:

<link rel="shortcut icon" href="//www.google.com/favicon.ico"/>

<link rel="shortcut icon" href="//www.google.com/favicon.ico">foo</link>

With these designated tags that do not need the solidus, suppose I come across:

<link rel="shortcut icon" href="//www.google.com/favicon.ico">

Can I assume that a corresponding </link> is not present, or will I need to parse the rest of the document and determine that for myself?

Solution

I understand that the HTML specification is a pretty intimidating document. But I think it would help you to at least read the overview about elements, following any links which seem relevant.

In particular, you will see there that <link> is a void element, about which that section says:

Void elements only have a start tag; end tags must not be specified for void elements.

So your second example, in which the text foo appears to be the content of the element, is actually deceptive. The element is already closed before the text is encountered, and so the text is content of the parent element (if that is possible). The explicit closing tag is an error, and should be ignored.

Although void elements don't require self-closing in HTML5, they did need to be self-closed in XHTML so it is common to see the <…/> syntax.

Note: (The following was written when I was under the impression that a precise HTML parser was desired. But I'll leave it in place, even though it might seem a bit agressive, because I think it does have some general advice for people who are (attempting to) write HTML parsers.)

I'm aware that the referenced standard is a massive document, sometimes confusing, and always complex. That makes writing an HTML parser a challenge. But there are no short cuts. If you want to write a compliant parser, you must read the standard. If you don't care so much about compliance, you shouldn't ask what constructs are compliant (but then you forfeit the right to complain about content creators who produce non-compliant HTML).

Open source parsers exist, also as libraries, so there is no obvious need to write a new one. On the other hand, nothing will teach you more about the task than writing a parser, and I respect anyone with the commitment to do so. I don't think it's a project I would take on at this point. If you want to fo so, start by reading the standard. Also, consider joining the relevant mailing lists or at least following some of the discussions. And best of luck!

P.S.: Another useful resource is the Mozilla Developer Network (MDN) documentation, linked from the WHATWG document. See, for example, its chapter on the <link> element, particularly the technical specifications section.