pythonweb-scrapingscrapyscrapy-splashsplash-js-render

scrapy-splash active content selector works in shell but not with spider


I just started using scrapy-splash to retrieve the number of bookings from opentable.com. The following works fine in the shell:

$ scrapy shell 'http://localhost:8050/render.html?url=https://www.opentable.com/new-york-restaurant-listings&timeout=10&wait=0.5'    
...

In [1]: response.css('div.booking::text').extract()
Out[1]: 
['Booked 59 times today',
 'Booked 20 times today',
 'Booked 17 times today',
 'Booked 29 times today',
 'Booked 29 times today',
  ... 
]

However, this simple spider returns an empty list:

class TableSpider(scrapy.Spider):
    name = 'opentable'
    start_urls = ['https://www.opentable.com/new-york-restaurant-listings']

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url=url,
                                callback=self.parse,
                                endpoint='render.html',
                                args={'wait': 1.5},
                                )

    def parse(self, response):
        yield {'bookings': response.css('div.booking::text').extract()}

when invoked with:

$ scrapy crawl opentable
...
DEBUG: Scraped from <200 https://www.opentable.com/new-york-restaurant-listings>
{'bookings': []}

I've already unsuccessfully tried

docker run -it -p 8050:8050 scrapinghub/splash --disable-private-mode

and increased wait times.


Solution

  • I think your problem is in middlewares, first of all you need to add some settings

    # settings.py
    
    # uncomment `DOWNLOADER_MIDDLEWARES` and add this settings to it
    DOWNLOADER_MIDDLEWARES = {
        'scrapy_splash.SplashCookiesMiddleware': 723,
        'scrapy_splash.SplashMiddleware': 725,
        'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
    }
    
    # url of splash server
    SPLASH_URL = 'http://localhost:8050'
    
    # and some splash variables
    DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
    HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
    

    And now run docker

    sudo docker run -it -p 8050:8050 scrapinghub/splash --disable-private-mode
    

    If i do all these steps a get back:

    scrapy crawl opentable
    
    ...
    
    2018-06-23 11:23:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.opentable.com/new-york-restaurant-listings via http://localhost:8050/render.html> (referer: None)
    2018-06-23 11:23:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.opentable.com/new-york-restaurant-listings>
    {'bookings': [
        'Booked 44 times today',
        'Booked 24 times today',
        'and many others Booked values'
    ]}