pythonscrapyscrapy-splashsplash-js-render

scrapy-splash returns its own headers and not the original headers from the site


I use scrapy-splash to build my spider. Now what I need is to maintain the session, so I use the scrapy.downloadermiddlewares.cookies.CookiesMiddleware and it handles the set-cookie header. I know it handles the set-cookie header because i set COOKIES_DEBUG=True and this causes the printouts by CookeMiddleware regarding set-cookie header.

The problem: when I also add Splash to the picture the set-cookie printouts disappear, and in fact what I get as response headers is {'Date': ['Sun, 25 Sep 2016 12:09:55 GMT'], 'Content-Type': ['text/html; charset=utf-8'], 'Server': ['TwistedWeb/16.1.1']} Which is related to splash rendering engine which uses TwistedWeb.

Is there any directive to tell the splash also to give me the original response headers?


Solution

  • To get original response headers you can write a Splash Lua script; see examples in scrapy-splash README:

    Use a Lua script to get an HTML response with cookies, headers, body and method set to correct values; lua_source argument value is cached on Splash server and is not sent with each request (it requires Splash 2.1+):

    import scrapy
    from scrapy_splash import SplashRequest
    
    script = """
    function main(splash)
      splash:init_cookies(splash.args.cookies)
      assert(splash:go{
        splash.args.url,
        headers=splash.args.headers,
        http_method=splash.args.http_method,
        body=splash.args.body,
        })
      assert(splash:wait(0.5))
    
      local entries = splash:history()
      local last_response = entries[#entries].response
      return {
        url = splash:url(),
        headers = last_response.headers,
        http_status = last_response.status,
        cookies = splash:get_cookies(),
        html = splash:html(),
      }
    end
    """
    
    class MySpider(scrapy.Spider):
    
    
        # ...
            yield SplashRequest(url, self.parse_result,
                endpoint='execute',
                cache_args=['lua_source'],
                args={'lua_source': script},
                headers={'X-My-Header': 'value'},
            )
    
        def parse_result(self, response):
            # here response.body contains result HTML;
            # response.headers are filled with headers from last
            # web page loaded to Splash;
            # cookies from all responses and from JavaScript are collected
            # and put into Set-Cookie response header, so that Scrapy
            # can remember them.
    

    scrapy-splash also provides built-in helpers for cookie handling; they are enabled in this example as soon as scrapy-splash is configured as described in readme.