pythonscrapyzyte

scrapy spider working locally but resulting in 403 error when running on Zyte


The spider is setup in a way where it reads the links to scrape and finally, makes a post request, and the data is parsed.

The spider is able to collect data locally, but when deployed to ZYTE it results in the error shown below..

```
              yield scrapy.Request(
                    url=STORE_URL.format(zip_code),
                    headers=headers_1,
                    meta={"item_id": item_id, "zip_code": zip_code},
                    dont_filter=True,
                    callback=self.parse_a
                )
```
                yield scrapy.Request(
                       url=API_URL,
                       method="POST",
                       headers=headers,
                 body=json.dumps(payload(item_id,zip_code, store_id)),
                       meta={"prod_code": item_id,    "zip_code": zip_code},
                       dont_filter=True,
                       callback=self.parse)
    
 USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36'
14: 2023-06-18 03:10:58 INFO    [scrapy.extensions.telnet] Telnet console listening on 0.0.0.0:6023
15: 2023-06-18 03:10:58 INFO    [scrapy.spidermiddlewares.httperror] Ignoring response <403 https://www.homedepot.com/StoreSearchServices/v2/storesearch?address=30308&radius=50&pagesize=30>: HTTP status code is not handled or not allowed
16: 2023-06-18 03:11:04 INFO    [scrapy.spidermiddlewares.httperror] Ignoring response <403 https://www.homedepot.com/StoreSearchServices/v2/storesearch?address=2125&radius=50&pagesize=30>: HTTP status code is not handled or not allowed
17: 2023-06-18 03:11:11 INFO    [scrapy.spidermiddlewares.httperror] Ignoring response <403 https://www.homedepot.com/StoreSearchServices/v2/storesearch?address=60607&radius=50&pagesize=30>: HTTP status code is not handled or not allowed

Solution

  • Short Answer:

    Use Proxy. Checkout Zyte Proxy Manager, ScrapingBee, BrightData

    Long Answer:

    403 HTTP code means your access is forbidden. If your local scraper is working but the cloud doesn't, most of the time the problem is the target site blocking the Zyte IP.

    Your local is working because you're using residential IP from your Internet Service Provider. Residential IPs have a good reputation for most websites, so you can scrape them locally.

    While Zyte uses a Datacenter IP, where most of the scrapers are in. In this category of IP, most of the websites marked these IPs from a bot, and their access is blocked.

    The only solution in your case, is to access the website behind a trustworthy IP. proxy visualization

    In this case, your target website wouldn't know that the access is actually from Zyte Scraper since the IP that accesses the target website is coming from a Proxy

    You can find a lot of proxy providers offering the trusted IPs, but you may have to try one by one and see which one is not blocked by your target website, also a proxy provider that fits in your budget.

    You can check how to integrate proxy in scrapy in this question or if you use Zyte's proxy manager check this