
Scrapy selector not working on Splash response

I'm trying to scrape some dynamic content using Scrapy. I have succesfully set up Splash to work along with it. However, the selectors of the following spider yield empty results:

# -*- coding: utf-8 -*- 

import scrapy
from scrapy.selector import Selector
from scrapy_splash import SplashRequest

class CartierSpider(scrapy.Spider):
  name = 'cartier'
  start_urls = ['']

  def start_requests(self):
    for url in self.start_urls:
      yield SplashRequest(url, self.parse, args={'wait': 0.5})

  def parse(self, response):
    yield {
      'title': response.xpath('//title').extract(),
      'link': response.url,
      'productID': Selector(text=response.body).xpath('//span[@itemprop="productID"]/text()').extract(),
      'model': Selector(text=response.body).xpath('//span[@itemprop="model"]/text()').extract(),
      'price': Selector(text=response.body).css('div.price-wrapper').xpath('.//span[@itemprop="price"]/text()').extract(),

The selectors work just fine using the Scrapy shell, so I'm very confused about what is not working.

The only difference I can find among the two situations is that the encoding of the string response.body is treated differently: it's just gibberish if I try to print/decode it from within the parse function.

Any hint or reference would be greatly appreciated.


  • Your spider works fine with me, with Scrapy 1.1, Splash 2.1 and no modification of the code in your question, just using settings suggested in

    As other have mentioned, your parse function can be simplified by using response.css() and response.xpath() directly, without needing to re-build a Selector from the response.

    I tried with:

    import scrapy
    from scrapy.selector import Selector
    from scrapy_splash import SplashRequest
    class CartierSpider(scrapy.Spider):
      name = 'cartier'
      start_urls = ['']
      def start_requests(self):
        for url in self.start_urls:
          yield SplashRequest(url, self.parse, args={'wait': 0.5})
      def parse(self, response):
        yield {
          'title': response.xpath('//title/text()').extract_first(),
          'link': response.url,
          'productID': response.xpath('//span[@itemprop="productID"]/text()').extract_first(),
          'model': response.xpath('//span[@itemprop="model"]/text()').extract_first(),
          'price': response.css('div.price-wrapper').xpath('.//span[@itemprop="price"]/text()').extract_first(),

    and got this:

    $ scrapy crawl cartier
    2016-06-08 17:16:08 [scrapy] INFO: Scrapy 1.1.0 started (bot: stack37701774)
    2016-06-08 17:16:08 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'stack37701774.spiders', 'SPIDER_MODULES': ['stack37701774.spiders'], 'BOT_NAME': 'stack37701774'}
    2016-06-08 17:16:08 [scrapy] INFO: Enabled downloader middlewares:
    2016-06-08 17:16:08 [scrapy] INFO: Enabled spider middlewares:
    2016-06-08 17:16:08 [scrapy] INFO: Enabled item pipelines:
    2016-06-08 17:16:08 [scrapy] INFO: Spider opened
    2016-06-08 17:16:08 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2016-06-08 17:16:08 [scrapy] DEBUG: Telnet console listening on
    2016-06-08 17:16:11 [scrapy] DEBUG: Crawled (200) <GET via http://localhost:8050/render.html> (referer: None)
    2016-06-08 17:16:11 [scrapy] DEBUG: Scraped from <200>
    {'model': u'Ballon Bleu de Cartier watch', 'productID': u'W69017Z4', 'link': '', 'price': None, 'title': u'CRW69017Z4 - Ballon Bleu de Cartier watch - 36 mm, steel, leather - Cartier'}
    2016-06-08 17:16:11 [scrapy] INFO: Closing spider (finished)
    2016-06-08 17:16:11 [scrapy] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 618,
     'downloader/request_count': 1,
     'downloader/request_method_count/POST': 1,
     'downloader/response_bytes': 213006,
     'downloader/response_count': 1,
     'downloader/response_status_count/200': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2016, 6, 8, 15, 16, 11, 201281),
     'item_scraped_count': 1,
     'log_count/DEBUG': 3,
     'log_count/INFO': 7,
     'response_received_count': 1,
     'scheduler/dequeued': 2,
     'scheduler/dequeued/memory': 2,
     'scheduler/enqueued': 2,
     'scheduler/enqueued/memory': 2,
     'splash/render.html/request_count': 1,
     'splash/render.html/response_count/200': 1,
     'start_time': datetime.datetime(2016, 6, 8, 15, 16, 8, 545105)}
    2016-06-08 17:16:11 [scrapy] INFO: Spider closed (finished)