.nethtml-parsinghtmltidy

Managed (.NET) library with HTML Tidy-like functionality?


Is there an HTML cleaner for .NET that can parse HTML and (for instance) convert it to a more machine friendly format such as XHTML?

I've tried the HTML Agility Pack, but that fails to correctly parse even fairly simple examples.

To give an example of HTML that should be parsed correctly:

<html><title>test</title>
<body>
    <ul><li>TestElem1
        <li>TestElem2
        <li>TestElem3 List:
            <ul><li>Nested1
                <li>Nested2</li>
                <li>Nested3
            </ul>
        <li>TestElem4
    </ul>
    <p>paragraph 1
    <p>paragraph 2
    <p>paragraph 3
</body></html>

li tags don't need to be closed (see specification), and neither do P tags. In other words, the above sample should be parsed as:

<html><title>test</title>
<body>
    <ul><li>TestElem1</li>
        <li>TestElem2</li>
        <li>TestElem3 List:
            <ul><li>Nested1</li>
                <li>Nested2</li>
                <li>Nested3</li>
            </ul></li>
        <li>TestElem4</li>
    </ul>
    <p>paragraph 1</p>
    <p>paragraph 2</p>
    <p>paragraph 3</p>
</body></html>

Since the aim is to use the library on various machines, it's a big disadvantage to need to fall back to native code (such as a wrapper around HTML Tidy) which would require extra deployment hassle and sacrifice platform independence, not to mention being impossible in sandboxed scenarios.

Any suggestions? To recap, I'm looking for:


Solution

  • Try TidyManaged.​​​​​​​​​​​​​​​​​​