pythoniframepyqtpyqt4qwebkit

How to get the html dom of a webpage and its frames


I would like to get the DOM of a website after js execution. I would also like to get all the content of the iframes in the website, similarly to what I have in Google Chrome's Inspect Element feature.

This is my code:

import sys
from PyQt4 import QtGui, QtCore, QtWebKit

class Sp():
  def save(self):
    print ("call")
    data = self.webView.page().currentFrame().documentElement().toInnerXml()
    print(data.encode('utf-8'))
    print ('finished')
  def main(self):
    self.webView = QtWebKit.QWebView()
    self.webView.load(QtCore.QUrl("http://www.w3schools.com/tags/tryit.asp?filename=tryhtml_iframe_scrolling"))
    QtCore.QObject.connect(self.webView,QtCore.SIGNAL("loadFinished(bool)"),self.save)

app = QtGui.QApplication(sys.argv)
s = Sp()
s.main()
sys.exit(app.exec_())

This gives me the html of the website, but not the html inside the iframes. Is there any way that I could get the HTML of the iframes.


Solution

  • This is a very hard problem to solve in general.

    The main difficulty is that there is no way to know in advance how many frames each page has. And in addition to that, each child-frame may have its own set of frames, the number of which is also unknown. In theory, there could be an infinite number of nested frames, and the page will never finish loading (which seems no exaggeration for sites that have a lot of ads).

    Anyway, below is a version of your script which gets the top-level QWebFrame object of each frame as it loads, and shows how you can access some of the things you are interested in. As you will see from the output, there are a lot of "junk" frames inserted by ads and such like that you will somehow need to filter out.

    import sys, signal
    from PyQt4 import QtGui, QtCore, QtWebKit
    
    class Sp():
      def save(self, ok, frame=None):
        if frame is None:
            print ('main-frame')
            frame = self.webView.page().mainFrame()
        else:
            print('child-frame')
        print('URL: %s' % frame.baseUrl().toString())
        print('METADATA: %s' % frame.metaData())
        print('TAG: %s' % frame.documentElement().tagName())
        print()
    
      def handleFrameCreated(self, frame):
        frame.loadFinished.connect(lambda: self.save(True, frame=frame))
    
      def main(self):
        self.webView = QtWebKit.QWebView()
        self.webView.page().frameCreated.connect(self.handleFrameCreated)
        self.webView.page().mainFrame().loadFinished.connect(self.save)
        self.webView.load(QtCore.QUrl("http://www.w3schools.com/tags/tryit.asp?filename=tryhtml_iframe_scrolling"))
    
    signal.signal(signal.SIGINT, signal.SIG_DFL)
    print('Press Crtl+C to quit\n')
    app = QtGui.QApplication(sys.argv)
    s = Sp()
    s.main()
    sys.exit(app.exec_())
    

    NB: it is important that you connect to the loadFinished signal of the main frame rather than the web-view. If you connect to the latter, it will be called multiple times if the page contains more than one frame.