web-scrapingbots

How to ensure that a bot/scraper does not get blocked


I coded a simple scraper , who's job is to go on several different pages of a site. Do some parsing , call some URL's that are otherwise called via AJAX , and store the data in a database.

Trouble is , that sometimes my ip is blocked after my scraper executes. What steps can I take so that my ip does not get blocked? Are there any recommended practices? I have added a 5 second gap between requests to almost no effect. The site is medium-big(need to scrape several URLs)and my internet connection slow, so the script runs for over an hour. Would being on a faster net connection(like on a hosting service) help ?

Basically I want to code a well behaved bot.

lastly I am not POST'ing or spamming .

Edit: I think I'll break my script into 4-5 parts and run them at different times of the day.


Solution

  • Write your bot so that it is more polite, i.e. don't sequentially fetch everything, but add delays in strategic places.