I'm trying to fetch html pages from a website (cardmarket) using the requests library, but I always get a 403 response. I point out that they added a "connection security check" page , which reappears every time I delete cookies. I think they just added Cloudflare.
I saw on the internet that it could be due to sent headers, and that it would be better to use proxies, but nothing changed. Here is the code I used (I use a file containing differents headers):
import requests
import pandas as pd
import yaml
with open("headers.yml") as f_headers:
browser_headers = yaml.safe_load(f_headers)
proxy_list = pd.read_html(response.text)[0]
proxy_list["url"] = "http://" + proxy_list["IP Address"] + ":" + proxy_list["Port"].astype(str)
print(proxy_list.head())
https_proxies = proxy_list[proxy_list["Https"] == "yes"]
https_proxies.count()
url = "https://httpbin.org/ip"
good_proxies = set()
headers = browser_headers["Chrome"]
for proxy_url in https_proxies["url"]:
proxies = {
"http": proxy_url,
"https": proxy_url,
}
try:
response = requests.get(url, headers=headers, proxies=proxies, timeout=2)
good_proxies.add(proxy_url)
print(f"Proxy {proxy_url} OK, added to good_proxy list")
except Exception:
pass
url = "https://www.cardmarket.com/en/Magic"
for browser, headers in browser_headers.items():
print(f"\n\nUsing {browser} headers\n")
for proxy_url in good_proxies:
proxies = proxies = {
"http": proxy_url,
"https": proxy_url,
}
#print(requests.get(url, headers=headers, proxies=proxies, timeout=2))
try:
response = requests.get(url, headers=headers, proxies=proxies, timeout=2)
print(response.json())
except Exception:
print(f"Proxy {proxy_url} failed, trying another one")
I'm new to scraping, so I don't understand why it keeps returning this response to me. Can someone explain why and how to solve this problem?
if anyone is struggling with this problem, I found a solution, using undetected_chromedriver, which I found here. Here's the code I use:
import undetected_chromedriver as uc
import time
text_file = open("output.txt", "w")
options = uc.ChromeOptions()
options.add_argument('--headless')
driver = uc.Chrome(use_subprocess=True, options=options)
page = driver.get("https://www.cardmarket.com/en/Magic/Products/Singles/March-of-the-Machine/Ozolith-the-Shattered-Spire")
driver.maximize_window()
time.sleep(6)
html_code = driver.page_source # get the HTML code of the page
text_file.write(html_code) # write the HTML code to the text file
driver.save_screenshot("datacamp.png")
driver.close()
text_file.close()