python-3.xseleniumselenium-webdriverbrowsermob-proxy

Get requests body using selenium and proxies


I want to be able to get a body of the specific subrequest using a selenium behind the proxy.

Now I'm using python + selenium + chromedriver. With logging I'm able to get each subrequest's headers but not body. My logging settings:

caps['loggingPrefs'] = {'performance': 'ALL', 'browser': 'ALL'}

caps['perfLoggingPrefs'] = {"enableNetwork": True, "enablePage": True, "enableTimeline": True}

I know there are several options to form a HAR with selenium:

window.foo = HAR.triggerExport().then(harLog => { return(harLog); }); return window.foo;

Unfortunately, I don't see the body of the response in the returning data.

So the question is: how can I get the body of the specific network response on the request made during the downloading of the webpage with selenium AND use proxies.

UPD: Actually, with har-export-trigger I get the response bodies, but not all of them: the response body I need is in json, it's MIME type is 'text/html; charset=utf-8' and it is missing from the HAR file I generate, so the solution is still missing.

UPD2: After further investigation, I realized that a response body is missing even on my desktop firefox when the har-export-trigger add-on is turned on, so this solution may be a dead-end (issue on Github)

UPD3: This bug can be seen only with the latest version of har-export-trigger. With version 0.6.0. everything works just fine.

So, for future googlers: you may use har-export-trigger v. 0.6.0. or the approach from the accepted answer.


Solution

  • I have actually just finished to implemented a selenium HAR script with tools you are mentioned in the question. Both HAR getting from har-export-trigger and BrowserMob are verified with Google HAR Analyser.

    A class using selenium, gecko driver and har-export-trigger:

    class MyWebDriver(object):
        # a inner class to implement custom wait
        class PageIsLoaded(object):
            def __call__(self, driver):
                state = driver.execute_script('return document.readyState;')
                MyWebDriver._LOGGER.debug("checking document state: " + state)
                return state == "complete"
    
        _FIREFOX_DRIVER = "geckodriver"
        # load HAR_EXPORT_TRIGGER extension
        _HAR_TRIGGER_EXT_PATH = os.path.abspath(
            "har_export_trigger-0.6.1-an+fx_orig.xpi")
        _PROFILE = webdriver.FirefoxProfile()
        _PROFILE.set_preference("devtools.toolbox.selectedTool", "netmonitor")
        _CAP = DesiredCapabilities().FIREFOX
        _OPTIONS = FirefoxOptions()
        # add runtime argument to run with devtools opened
        _OPTIONS.add_argument("-devtools")
        _LOGGER = my_logger.get_custom_logger(os.path.basename(__file__))
    
        def __init__(self, log_body=False):
            self.browser = None
            self.log_body = log_body
    
        # return the webdriver instance
        def get_instance(self):
            if self.browser is None:
                self.browser = webdriver.Firefox(capabilities=
                                                 MyWebDriver._CAP,
                                                 executable_path=
                                                 MyWebDriver._FIREFOX_DRIVER,
                                                 firefox_options=
                                                 MyWebDriver._OPTIONS,
                                                 firefox_profile=
                                                 MyWebDriver._PROFILE)
                self.browser.install_addon(MyWebDriver._HAR_TRIGGER_EXT_PATH,
                                           temporary=True)
                MyWebDriver._LOGGER.info("Web Driver initialized.")
            return self.browser
    
        def get_har(self):
            # JSON.stringify has to be called to return as a string
            har_harvest = "myString = HAR.triggerExport().then(" \
                          "harLog => {return JSON.stringify(harLog);});" \
                          "return myString;"
            har_dict = dict()
            har_dict['log'] = json.loads(self.browser.execute_script(har_harvest))
            # remove content body
            if self.log_body is False:
                for entry in har_dict['log']['entries']:
                    temp_dict = entry['response']['content']
                    try:
                        temp_dict.pop("text")
                    except KeyError:
                        pass
            return har_dict
    
        def quit(self):
            self.browser.quit()
            MyWebDriver._LOGGER.warning("Web Driver closed.")
    

    A subclass adding BrowserMob proxy for your reference as well:

    class MyWebDriverWithProxy(MyWebDriver):
    
        _PROXY_EXECUTABLE = os.path.join(os.getcwd(), "venv", "lib",
                                         "browsermob-proxy-2.1.4", "bin",
                                         "browsermob-proxy")
    
        def __init__(self, url, log_body=False):
            super().__init__(log_body=log_body)
            self.server = Server(MyWebDriverWithProxy._PROXY_EXECUTABLE)
            self.server.start()
            self.proxy = self.server.create_proxy()
            self.proxy.new_har(url,
                               options={'captureHeaders': True,
                                        'captureContent': self.log_body})
            super()._LOGGER.info("BrowserMob server started")
            super()._PROFILE.set_proxy(self.proxy.selenium_proxy())
    
        def get_har(self):
            return self.proxy.har
    
        def quit(self):
            self.browser.quit()
            self.proxy.close()
            MyWebDriver._LOGGER.info("BroswerMob server and Web Driver closed.")