pythonqtwebkitqtwebkitqwebelement

Parsing HTML using QWebElement, how to extract an image?


I struggle to use QWebElement. As an exercise, I would like to capture the "Google" logo from page http://www.google.com. The image is in <div id="hplogo" ...>, but I don't know how to extract it. How shall I use the "doc" QWebElement in the following code? ("CSS selector" is obscure jargon to me). Thank you.

from PyQt4.QtGui import QApplication
from PyQt4.QtWebKit import QWebView
from PyQt4.QtCore import QUrl

app = QApplication([])
view = QWebView()
view.load(QUrl("http://google.com"))
view.show()
doc = view.page().currentFrame().documentElement()   # run this after 'loadFinished'

Solution

  • To get the URL of the "Google" logo, do:

    elem = doc.findFirst("div#hplogo")
    qstring = elem.attribute('style')
    regexp = QRegExp("^(.*:)?url\((.*)\)")
    if regexp.indexIn(qstring) > -1:
        imageURL = regexp.capturedTexts()[-1]
    

    It returns imageURL = "/images/srpr/logo1w.png". It's necessary to use a regexp in that case because the URL is a part of a string. To get the image and show it on a label, do:

    request = QNetworkRequest(QUrl("http://www.google.com/images/srpr/logo1w.png"))
    reply = view.page().networkAccessManager().get(request)
    byte_array = reply.readAll()
    image = QImage()
    image.loadFromData(byte_array)
    label = QLabel()
    label.setPixmap(QPixmap(image))
    label.show()