pythonweb-scrapingproxypython-requestshttp-proxy

How to rotate proxies on a Python requests


I'm trying to do some scraping, but I get blocked every 4 requests. I have tried to change proxies but the error is the same. What should I do to change it properly?

Here is some code where I try it. First I get proxies from a free web. Then I go do the request with the new proxy but it doesn't work because I get blocked.

from fake_useragent import UserAgent
import requests

def get_player(id,proxy):
    ua=UserAgent()
    headers = {'User-Agent':ua.random}

    url='https://www.transfermarkt.es/jadon-sancho/profil/spieler/'+str(id)

    try:
        print(proxy)
        r=requests.get(u,headers=headers,proxies=proxy)
    execpt:

....
code to manage the data
....

Getting proxies

def get_proxies():
    ua=UserAgent()
    headers = {'User-Agent':ua.random}
    url='https://free-proxy-list.net/'

    r=requests.get(url,headers=headers)
    page = BeautifulSoup(r.text, 'html.parser')

    proxies=[]

    for proxy in page.find_all('tr'):
        i=ip=port=0

    for data in proxy.find_all('td'):
        if i==0:
            ip=data.get_text()
        if i==1:
            port=data.get_text()
        i+=1

    if ip!=0 and port!=0:
        proxies+=[{'http':'http://'+ip+':'+port}]

return proxies

Calling functions

proxies=get_proxies()
for i in range(1,100):
    player=get_player(i,proxies[i//4])

....
code to manage the data  
....

I know that proxies scrape is well because when i print then I see something like: {'http': 'http://88.12.48.61:42365'} I would like to don't get blocked.


Solution

  • I recently had this same issue, but using proxy servers online as recommended in other answers is always risky (from privacy standpoint), slow, or unreliable.

    Instead, you can use my requests-ip-rotator Python library to proxy traffic through AWS API Gateway, which gives you a new IP each time: pip install requests-ip-rotator

    This can be used as follows (for your site specifically):

    import requests
    from requests_ip_rotator import ApiGateway, EXTRA_REGIONS
    
    gateway = ApiGateway("https://www.transfermarkt.es")
    gateway.start()
    
    session = requests.Session()
    session.mount("https://www.transfermarkt.es", gateway)
    
    response = session.get("https://www.transfermarkt.es/jadon-sancho/profil/spieler/your_id")
    print(response.status_code)
    
    # Only run this line if you are no longer going to run the script, as it takes longer to boot up again next time.
    gateway.shutdown() 
    

    Combined with multithreading/multiprocessing, you'll be able to scrape the site in no time.

    The AWS free tier provides you with 1 million requests per region, so this option will be free for all reasonable scraping.