I am receiving this 504 gateway error while using splash with scrapy while learning splash where I was trying to crawl this https://www.lazada.com.my/
Could you help me, please?
Splash is running on a docker container on port 8050
spider file
import scrapy
from scrapy_splash import SplashRequest
class LaptopSpider(scrapy.Spider):
name = 'laptop'
allowed_domains = ['www.lazada.com.my']
def start_requests(self):
url='https://www.lazada.com.my/shop-laptops/?spm=a2o4k.home.cate_2.2.75f82e7eO7Jbgl'
yield SplashRequest(url=url)
def parse(self, response):
all_rows=response.xpath("//div[@class='_17mcb']/div").getall()
print(all_rows)
for row in all_rows:
title=row.xpath(".//div/div/div[2]/div[2]/a/text()")
yield{
'title':title
}
settings
BOT_NAME = 'lazada'
SPIDER_MODULES = ['lazada.spiders']
NEWSPIDER_MODULE = 'lazada.spiders'
ROBOTSTXT_OBEY = False
SPLASH_URL = 'http://localhost:8050'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
The url you are trying to scrape takes long to load. Even if you try out in the browser, you will note that it takes time to fully load and stop spinning.
Splash therefore times out before the page is fully loaded and returned.
You need to do two things.
First increase the max timeout value when starting the splash server like below.
docker run -p 8050:8050 scrapinghub/splash --max-timeout 3600
Second, in the spider, you can provide a timeout value which is less than or equal to the max-timeout value of the splash server.
yield SplashRequest(url=url, args={"timeout": 3000})