htmlxmlhtmltidyexpat-parser

How to fix non-compliant HTML so Expat will parse it (htmltidy not working)


I'm trying to scrape information from http://www.nfl.com/scores (in particular, find out when a game is over so my computer can stop recording it). I can download the HTML easily enough, and it makes this claim about compliance with standards:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">

But

  1. An attempt to parse it with Expat produces the error not well-formed (invalid token).

  2. The W3C's online validation service reports 399 Errors and 121 warnings.

  3. I tried to run HTML tidy (just called tidy) on my Linux system with the -xml option, but tidy reports 56 warnings and 117 errors and is unable to recover a good XML file. The errors look like this:

    line 409 column 122 - Warning: unescaped & or unknown entity "&role"
    ...
    line 409 column 172 - Warning: unescaped & or unknown entity "&tabSeq"
    ...
    line 1208 column 65 - Error: unexpected </td> in <br>
    line 1209 column 57 - Error: unexpected </tr> in <br>
    line 1210 column 49 - Error: unexpected </table> in <br>
    

    But when I check the input, the "unknown entities" appear to be part of a properly quoted URL, so I don't know if a double quote is missing somewhere or what.

I know that there is something out there that can parse this stuff because both Firefox and w3m display something reasonable. What tool will fix the non-compliant HTML so that I can parse it with Expat?


Solution

  • There's a Flash-based auto-updating scoreboard thing at the top of nfl.com. Some monitoring of its network traffic finds:

    http://www.nfl.com/liveupdate/scorestrip/ss.xml

    That will probably be a bit easier to parse than the HTML scoreboard.