pythonweb-scrapingluascrapyscrapy-splash

Scrapy Splash ERROR: Gave up retrying 504 Gateway Time-out


I am receiving this 504 gateway error while using splash with scrapy while learning splash where I was trying to crawl this https://www.lazada.com.my/

Could you help me, please?

Splash is running on a docker container on port 8050

spider file

import scrapy
from scrapy_splash import SplashRequest

class LaptopSpider(scrapy.Spider):
    name = 'laptop'
    allowed_domains = ['www.lazada.com.my']

    def start_requests(self):
        url='https://www.lazada.com.my/shop-laptops/?spm=a2o4k.home.cate_2.2.75f82e7eO7Jbgl'
        yield SplashRequest(url=url)

    def parse(self, response):
        all_rows=response.xpath("//div[@class='_17mcb']/div").getall()
        print(all_rows)
        for row in all_rows:
            title=row.xpath(".//div/div/div[2]/div[2]/a/text()")
            yield{
                'title':title
            }

settings

BOT_NAME = 'lazada'
SPIDER_MODULES = ['lazada.spiders']
NEWSPIDER_MODULE = 'lazada.spiders'
ROBOTSTXT_OBEY = False
SPLASH_URL = 'http://localhost:8050'
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

scrapy crawl lazada


Solution

  • The url you are trying to scrape takes long to load. Even if you try out in the browser, you will note that it takes time to fully load and stop spinning.

    Splash therefore times out before the page is fully loaded and returned.

    You need to do two things.

    First increase the max timeout value when starting the splash server like below.

    docker run -p 8050:8050 scrapinghub/splash --max-timeout 3600
    

    Second, in the spider, you can provide a timeout value which is less than or equal to the max-timeout value of the splash server.

    yield SplashRequest(url=url, args={"timeout": 3000})