pythonscrapyscrapy-splashsplash-js-render

Getting a response body with scrapy splash


enter image description here

I'm working with scrapy 1.6 and splash 3.2 I have:

import scrapy
import random
from scrapy_splash import SplashRequest
from scrapy.utils.response import open_in_browser
from scrapy.linkextractors import LinkExtractor

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:48.0) Gecko/20100101 Firefox/48.0'

class MySpider(scrapy.Spider):


    start_urls = ["http://yahoo.com"]
    name = 'mytest'

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, self.parse, endpoint='render.html', args={'wait': 2.5},headers={'User-Agent': USER_AGENT,'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'})

    def parse(self, response):
        # response.body is a result of render.html call; it
        # contains HTML processed by a browser.
        # from scrapy.http.response.html import HtmlResponse
        # ht = HtmlResponse('jj')
        # ht.body.replace =response
        open_in_browser(response)
        return None

The problem is that when I try to open the response in the browser I get it opening in notepad instead.

looking at https://splash.readthedocs.io/en/stable/scripting-response-object.html. How do I activate the response.body so I can open the response in a browser (I want to be able to then use browser dev tools to get xpaths)?


Solution

  • open_in_browser() cannot detect responses from Splash as HTML responses. This is because Splash HTML response objects are subclasses of Scrapy’s TextResponse instead of HtmlResponse (for now).

    You could reimplement open_in_browser() in a way that works for your use case for the time being.