I use the sgml library of prolog to extract information about a web page. I use this instruction to extract all:
load_structure('file.html', List, [dialect(sgml), shorttag(false), max_errors(-1)])
the system loads the page but i have some warnings, for instance:
WARNING:SGML2PL(sgml): inserted omitted end-tag for "img"
WARNING:SGML2PL(sgml): inserted omitted end-tag for "br"
WARNING:SGML2PL(sgml): entity "amp" does not exist
How can i do to eliminate this warnings?
I use this syntax
get_html_file(FileOrStream, P) :-
dtd(html, DTD),
load_structure(FileOrStream, [P],
[ dtd(DTD),
dialect(sgml),
shorttag(false),
syntax_errors(quiet),
max_errors(-1)
]).
the option syntax_errors(quiet)
should do.
I recall I had some hard time parsing old pages with errors. Error handling can be complicated, some tool like tags soup, being more tolerant, could help in getting the work sone...