I have a large block of programmatically generated HTML. I ran it through Tidy (version r938) with the following Java code:
StringReader inStr = new StringReader(htmlInput);
StringWriter outStr = new StringWriter();
Tidy tidy = new Tidy();
tidy.setXHTML(true);
tidy.parseDOM(inStr, outStr);
I get the following output:
InputStream: Document content looks like HTML 4.01 Transitional
247 warnings, 3 errors were found!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
Trouble is, Tidy doesn't tell me what 3 errors it found.
I'm fibbing here a little. The output above actually follows a long list of all 247 warnings (mostly trimming out empty div
elements). I can suppress those with tidy.setShowWarnings(false)
; either way, I see no error report, so I can't figure out what I need to fix. 300Kb of HTML is too much for me to eyeball.
I've tried numerous approaches to finding the error. I can't run it through validate.w3.org, sadly, as the HTML file is on a proprietary network. The most informative approach was to open it in IntelliJ IDEA; this revealed a dozen or so duplicate div IDs, which I fixed. Errors still occurred.
I've looked around for other mentions of this problem. While I find plenty of hits on things like "How can I get the error/warning messages out of the parsed HTML using JTidy?", they all appear to be asking for dissimilar things, or assume conditions that simply aren't holding for me. I'm getting warnings just fine, for example; it's the errors I need, and they're not being reported, even if I call setShowErrors(100)
or something.
Am I going to have to dive into Tidy's source code and debug it, starting where it reports errors? Or is there something much simpler I could do?
Here's what I ended up doing to track down the errors:
org.w3.tidy.Report.error()
increments lexer.errors
; error()
is called from many places in the lexer.lexbuf
is a byte array, so your IDE might not show it as text. It might also be large. You probably want to look at what index the lexer was looking at within lexbuf
. If you have to, take that section of the byte array and cross-reference it with an ASCII table to get the text.This was much more involved than it probably should have been. I suspect Report.error()
was being called inappropriately.
In my case, error()
was called with the constant BAD_CDATA_CONTENT
. This constant is used only by Report.warning()
. error()
doesn't know what to do with it, and just exits silently with no message at all. If I change the call in Lexer.getCDATA()
from error()
to warning()
, I get the exact line and column of my error. (I also get what appears to be reasonably well-formed XHTML, instead of an empty document.)
I'd submit a ticket to the JTidy project with some suggestions, but SourceForge isn't letting me log in for some reason. So, here:
script
element; shouldn't have hurt anything. I asked another question about it, just in case.)Report.error()
should have a default case that reports an unhandled error code if it gets one.Hope this helps anyone else having what I'm guessing is a rather esoteric problem.