"Hello, can you help me? When trying to extract a JSON file from a webpage, it works with some URLs from the same page, but with others, I get a 403 error. The URLs are:"
my sample code:
import requests
import json
from bs4 import BeautifulSoup
session = requests.Session()
session.headers.update({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
})
def extract_json_from_falabella(url):
try:
response = session.get(url)
response.raise_for_status() # Lanza una excepción si la respuesta no es exitosa (código 2xx)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
script_tag = soup.find('script', id='__NEXT_DATA__')
if script_tag:
json_text = script_tag.string.strip()
data = json.loads(json_text)
return data
else:
print("No se encontró el script con id='__NEXT_DATA__'.")
return None
else:
print(f"Error al realizar la solicitud: {response.status_code}")
return None
except requests.exceptions.HTTPError as http_err:
print(f"Error HTTP: {http_err}")
return None
except Exception as err:
print(f"Ocurrió un error: {err}")
return None
url = "https://www.falabella.com/falabella-cl/category/cat7330051/Mujer?facetSelected=true&f.derived.variant.sellerId=FALABELLA%3A%3ASODIMAC&page=1"
data = extract_json_from_falabella(url)
if data:
with open('falabella_data.json', 'w', encoding='utf-8') as json_file:
json.dump(data, json_file, ensure_ascii=False, indent=4)
print("Datos guardados en 'falabella_data.json'")
else:
print("No se pudieron extraer los datos JSON.")
can you see the problem?
This is Cloudflare protection, I don't know why it's only applied on some paths but not others, but this is passive protection and it uses tls/ja3/http2 fingerprinting
to block bots/scraping.
Fortunately it can be bypassed in this scenario by impersonating the browser's fingerprints with curl_cffi which has a requests
like api.
Since this site uses an api, we can retrieve data directly in json format, instead of extracting it from the html.
The code below will retrieve the results for the this page:
https://www.falabella.com/falabella-cl/category/cat7330051/Mujer?facetSelected=true&f.derived.variant.sellerId=FALABELLA&page=1
from curl_cffi import requests
def get_pid():
url = 'https://www.falabella.com/s/geo/v2/districts/cl?politicalId=default'
response = requests.get(url)
data = response.json().get('data', {})
return data.get('politicalId')
api_url = "https://www.falabella.com/s/browse/v1/listing/cl"
# pid does not seem to change/expire so you can replace it with string value
pid = get_pid()
params = {
'f.derived.variant.sellerId': 'FALABELLA',
'facetSelected': True,
'page': 1,
'categoryId': 'cat7330051',
'categoryName': 'Mujer',
'pid': pid,
}
response = requests.get(api_url, params=params, impersonate='chrome')
data = response.json()['data']
pagination = data['pagination']
results = data['results']
print(f'{len(results) = }')
Don't forget to install curl_cffi using pip:
pip install curl_cffi --upgrade
Note: I have removed 2 params (pgid
& zones
) that did not seem to do anything, if you notice any discrepancy between these results and the ones in the html (__NEXT_DATA__
) you could try adding them back (copy from devtools).