I'm trying to scrape information from http://www.nfl.com/scores (in particular, find out when a game is over so my computer can stop recording it). I can download the HTML easily enough, and it makes this claim about compliance with standards:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
But
An attempt to parse it with Expat produces the error not well-formed (invalid token)
.
The W3C's online validation service reports 399 Errors and 121 warnings.
I tried to run HTML tidy (just called tidy
) on my Linux system with the -xml
option, but tidy reports 56 warnings and 117 errors and is unable to recover a good XML file. The errors look like this:
line 409 column 122 - Warning: unescaped & or unknown entity "&role"
...
line 409 column 172 - Warning: unescaped & or unknown entity "&tabSeq"
...
line 1208 column 65 - Error: unexpected </td> in <br>
line 1209 column 57 - Error: unexpected </tr> in <br>
line 1210 column 49 - Error: unexpected </table> in <br>
But when I check the input, the "unknown entities" appear to be part of a properly quoted URL, so I don't know if a double quote is missing somewhere or what.
I know that there is something out there that can parse this stuff because both Firefox and w3m display something reasonable. What tool will fix the non-compliant HTML so that I can parse it with Expat?
There's a Flash-based auto-updating scoreboard thing at the top of nfl.com. Some monitoring of its network traffic finds:
http://www.nfl.com/liveupdate/scorestrip/ss.xml
That will probably be a bit easier to parse than the HTML scoreboard.