scrapyscrapy-splashsplash-js-render

Splash return embedded response


I am looking to return an embedded response from a website. This website makes it very difficult to reach this embedded response without javascript so I am hoping to use splash. I am not interested in returning the rendered HTML, but rather one embedded response. Below is a screenshot of the exact response that I am looking to get back from splash.

enter image description here

This response returns a JSON object to the site to render, I would like the raw JSON returned from this response, how do I do this in Lua?


Solution

  • Turns out this is a bit tricky. The following is the kludge I have found to do this:

    Splash call with LUA script, called from Scrapy:

    scrpitBusinessUnits = """
                function main(splash, args)
                    splash.request_body_enabled = true
                    splash.response_body_enabled = true
                    assert(splash:go(args.url))
                    assert(splash:wait(18))
                    splash:runjs('document.getElementById("RESP_INQA_WK_BUSINESS_UNIT$prompt").click();')
                    assert(splash:wait(20))
                    return {
                        har = splash:har(),
                    }
                end
            """
            yield SplashRequest(
                url=self.start_urls[0],
                callback=self.parse, 
                endpoint='execute',
                magic_response=True,
                meta={'handle_httpstatus_all': True},
                args={'lua_source': scrpitBusinessUnits,'timeout':90,'images':0}, 
            )
    

    This script works by returning the HAR file of the whole page load, it is key to set splash.request_body_enabled = true and splash.response_body_enabled = true to get the actual response content in the HAR file.

    The HAR file is just a glorified JSON object with a different name... so:

    def parse(self, response):
            harData = json.loads(response.text)
            responseData = harData['har']['log']['entries']
            ...
            # Splash appears to base64 encode large content fields, 
            # you may have to decode the field to load it properly
            bisData = base64.b64decode(bisData['content']['text']) 
    

    From there you can search the JSON object for the exact embedded response.

    I really dont think this is a very efficient method, but it works.