I'm new to scrapy and I can't seem to figure out why I'm having this problem when I run my code. I coded this from a simple tutorial and then added Splash. Splash is up and running.
This is the code:
livros.py
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from olx.items import OlxItem
from scrapy_splash import SplashRequest
class LivrosSpider(CrawlSpider):
name = 'livros'
allowed_domains = ['www.olx.pt']
start_urls = ['https://www.olx.pt/lazer/livros-revistas/historia/']
rules = (
Rule(LinkExtractor(allow=(), restrict_css=('.pageNextPrev',)),
callback="parse_item",
follow=True),)
def parse_item(self, response):
item_links = response.css('.large > .detailsLink::attr(href)').extract()
for a in item_links:
yield SplashRequest(a, callback=self.parse_detail_page)
def parse_detail_page(self, response):
title = response.css('h1::text').extract()[0].strip()
price = response.css('.pricelabel > strong::text').extract()[0]
item = OlxItem()
item['title'] = title
item['price'] = price
item['url'] = response.url
yield item
items.py
import scrapy
class OlxItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
price = scrapy.Field()
url = scrapy.Field()
settings.py
BOT_NAME = 'olx'
SPIDER_MODULES = ['olx.spiders']
NEWSPIDER_MODULE = 'olx.spiders'
FEED_URI = 'data/%(name)s/%(time)s.json'
FEED_FORMAT = 'json'
ROBOTSTXT_OBEY = True
#ScrapySplash settings
SPLASH_URL = 'http://localhost:8050'
DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = { 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
And below is the error that I keep getting on the terminal:
2018-05-15 07:47:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.olx.pt/lazer/livros-revistas/historia/?page=7> (referer: https://www.olx.pt/lazer/livros-revistas/historia/?page=6)
2018-05-15 07:47:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.olx.pt/lazer/livros-revistas/historia/?page=8> (referer: https://www.olx.pt/lazer/livros-revistas/historia/?page=7)
2018-05-15 07:47:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.olx.pt/lazer/livros-revistas/historia/?page=9> (referer: https://www.olx.pt/lazer/livros-revistas/historia/?page=8)
2018-05-15 07:47:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.olx.pt/lazer/livros-revistas/historia/?page=10> (referer: https://www.olx.pt/lazer/livros-revistas/historia/?page=9)
2018-05-15 07:47:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.olx.pt/lazer/livros-revistas/historia/?page=11> (referer: https://www.olx.pt/lazer/livros-revistas/historia/?page=10)
2018-05-15 07:47:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.olx.pt/lazer/livros-revistas/historia/?page=12> (referer: https://www.olx.pt/lazer/livros-revistas/historia/?page=11)
In the end the program was supposed to save the data onto a json file, but the file always come out blank. Can you help me figure out what am I missing?
Below change works for me .x-large
vs .large
:
def parse_item(self, response):
item_links = response.css('.x-large > .detailsLink::attr(href)').extract()
for a in item_links:
yield SplashRequest(a, callback=self.parse_detail_page)