qthtml-parsingqxmlstreamreader

Parse HTML with the QXmlStreamReader in Qt 5.8 with MSVC 2015


I try to get some data from a webpage in Qt. Since QWebKit is unmaintained I would like to use QXmlStreamReader but it I get error messages for some Webpages.

For example: XML Parse Error "Opening and ending tag mismatch." at http://www.google.com

<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>302 Moved</TITLE></HEAD><BODY>
<H1>302 Moved</H1>
The document has moved
<A HREF="http://www.google.de/?gfe_rd=cr&amp;ei=toP_WMrVKoHKXuvxnsAO">here</A>.
</BODY></HTML>

And I get HTML, HEAD, meta and TITLE.

Other error messages on valid html pages:

Here is my Code:

webpage = new QXmlStreamReader(data);

//emit got_webpage(&QString(data));

QStringList test;

while (!webpage->atEnd() && !webpage->hasError())
{
    QXmlStreamReader::TokenType token = webpage->readNext();

    if (token == QXmlStreamReader::StartDocument)
        continue;

    if (token == QXmlStreamReader::StartElement)
    {
        test << webpage->name().toString();
        /*if (webpage->name() == "H1")
        {
            emit got_webpage(webpage)
        }*/
    }
}

emit got_webpage(&test.join("\n"));

if (webpage->hasError())
{
    // TODO: Error handling...
    qDebug() << "XML Parse Error " << webpage->errorString();
}

webpage->clear();
delete webpage;

Solution

  • As the name suggests, QXmlStreamReader is meant for parsing XML. HTML is not based on XML, so it cannot be parsed with QXmlStreamReader.

    That said, if you can convert the HTML into XHTML, you will be able to parse it with QXmlStreamReader. However, Qt has no built-in method of performing this conversion. It is possible to convert arbitrary HTML to XHTML with 3rd party libraries such as tidylib.