I made a python program to scrape data from a couple shopping sites, which was working fine on both, until recently.
URL2 - https://www.continente.pt/produto/papa-infantil-farinha-lactea-6m-cerelac-2004388.html
I use the following simple code:
import requests
from bs4 import BeautifulSoup
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
# ... and then i have my code to parse stuff...
Problem is: on URL1 everything is nice and dandy, and if I print(soup), I get the page HTML as seen on the page source using a browser. But on URL2, I get what seems to be script code (please see the attached image), and of course my parsing code then fails because it can't find the elements. If I open the webpage on a browser, it looks good and I can view the source code as expected.
I am obviously a newbie, but seems some kind of protection against scrapping; is there anything I can do?
Thanks!
The "script language" you're seeing is minimized JS. I assume it makes a request to a central server at Continente and then populates the page. The easiest way to do this would be to use a chromedriver
which executes the code and populates the page for you functioning almost identically to that of a browser.
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get("https://www.continente.pt/produto/papa-infantil-farinha-lactea-6m-cerelac-2004388.html")
soup = BeautifulSoup(driver.page_source, "html.parser")
# ...