htmlprologsgml

Sgml returns some warnings


I use the sgml library of prolog to extract information about a web page. I use this instruction to extract all:

load_structure('file.html', List, [dialect(sgml), shorttag(false), max_errors(-1)])

the system loads the page but i have some warnings, for instance:

WARNING:SGML2PL(sgml): inserted omitted end-tag for "img"
WARNING:SGML2PL(sgml): inserted omitted end-tag for "br"
WARNING:SGML2PL(sgml): entity "amp" does not exist

How can i do to eliminate this warnings?


Solution

  • I use this syntax

    get_html_file(FileOrStream, P) :-
            dtd(html, DTD),
            load_structure(FileOrStream, [P],
                           [ dtd(DTD),
                             dialect(sgml),
                             shorttag(false),
                             syntax_errors(quiet),
                             max_errors(-1)
                           ]).
    

    the option syntax_errors(quiet) should do.

    I recall I had some hard time parsing old pages with errors. Error handling can be complicated, some tool like tags soup, being more tolerant, could help in getting the work sone...