seleniumweb-crawlergoogle-crawlers

How to crawl websites without getting blocked?


I crawl websites very often at the rate of hundreds of requests in an hour.

  1. How to make crawlers behavior more like a human?
  2. How to not get on radar by detection bots?

Currently crawling site with selenium, chrome.

Kindly suggest.


Solution

  • Well, you will have to pause the script between loops.

    import time
    time.sleep(1)
    time.sleep(N)
    

    So, it could hypothetically work like this.

    import json,urllib.request
    import requests
    import pandas as pd
    from string import ascii_lowercase
    import time
    
    alldata = []
    for c in ascii_lowercase:
        response = requests.get('https://reservia.viarail.ca/GetStations.aspx?q=' + c)
        json_data = response.text.encode('utf-8', 'ignore') 
        df = pd.DataFrame(json.loads(json_data), columns=['sc', 'sn', 'pv'])  # etc., 
        time.sleep(3)
        alldata.append(df)
    

    Or, look for an API to grab data from the URL you are targeting. You didn't post an actual URL, so it's impossible to say for sure if an API is exposed or not.