web-scrapingpython-requestscloudflarehttp-status-code-403

Website always returning a <Response 403>


I'm trying to fetch html pages from a website (cardmarket) using the requests library, but I always get a 403 response. I point out that they added a "connection security check" page , which reappears every time I delete cookies. I think they just added Cloudflare.

I saw on the internet that it could be due to sent headers, and that it would be better to use proxies, but nothing changed. Here is the code I used (I use a file containing differents headers):

import requests
import pandas as pd
import yaml

with open("headers.yml") as f_headers:
    browser_headers = yaml.safe_load(f_headers)

proxy_list = pd.read_html(response.text)[0]
proxy_list["url"] = "http://" + proxy_list["IP Address"] + ":" + proxy_list["Port"].astype(str)
print(proxy_list.head())

https_proxies = proxy_list[proxy_list["Https"] == "yes"]
https_proxies.count()

url = "https://httpbin.org/ip"
good_proxies = set()
headers = browser_headers["Chrome"]
for proxy_url in https_proxies["url"]:
    proxies = {
        "http": proxy_url,
        "https": proxy_url,
    }
    
    try:
        response = requests.get(url, headers=headers, proxies=proxies, timeout=2)
        good_proxies.add(proxy_url)
        print(f"Proxy {proxy_url} OK, added to good_proxy list")
    except Exception:
        pass


url = "https://www.cardmarket.com/en/Magic"
for browser, headers in browser_headers.items():
    print(f"\n\nUsing {browser} headers\n")
    for proxy_url in good_proxies:
        proxies = proxies = {
            "http": proxy_url,
            "https": proxy_url,
        }
        #print(requests.get(url, headers=headers, proxies=proxies, timeout=2))
        try:
            response = requests.get(url, headers=headers, proxies=proxies, timeout=2)
            print(response.json())
        except Exception:
            print(f"Proxy {proxy_url} failed, trying another one")

I'm new to scraping, so I don't understand why it keeps returning this response to me. Can someone explain why and how to solve this problem?


Solution

  • if anyone is struggling with this problem, I found a solution, using undetected_chromedriver, which I found here. Here's the code I use:

    import undetected_chromedriver as uc
    import time 
    
    text_file = open("output.txt", "w")
    
    options = uc.ChromeOptions() 
    options.add_argument('--headless')
    driver = uc.Chrome(use_subprocess=True, options=options) 
    page = driver.get("https://www.cardmarket.com/en/Magic/Products/Singles/March-of-the-Machine/Ozolith-the-Shattered-Spire") 
    driver.maximize_window() 
    time.sleep(6)
    html_code = driver.page_source  # get the HTML code of the page
    text_file.write(html_code)  # write the HTML code to the text file
    driver.save_screenshot("datacamp.png") 
    driver.close()
    
    text_file.close()