javascriptpythonweb-scrapingbeautifulsoupminify

Beautiful Soup returns script language instead of HTML


I made a python program to scrape data from a couple shopping sites, which was working fine on both, until recently.

URL1 - https://www.auchan.pt/pt/alimentacao/alimentacao-bebe-e-crianca/papa-e-farinha-lactea/farinha-cerelac-lactea-500g/70511.html

URL2 - https://www.continente.pt/produto/papa-infantil-farinha-lactea-6m-cerelac-2004388.html

I use the following simple code:

import requests
from bs4 import BeautifulSoup

response = requests.get(url)

soup = BeautifulSoup(response.content, "html.parser")

# ... and then i have my code to parse stuff...

Problem is: on URL1 everything is nice and dandy, and if I print(soup), I get the page HTML as seen on the page source using a browser. But on URL2, I get what seems to be script code (please see the attached image), and of course my parsing code then fails because it can't find the elements. If I open the webpage on a browser, it looks good and I can view the source code as expected.

image

I am obviously a newbie, but seems some kind of protection against scrapping; is there anything I can do?

Thanks!


Solution

  • The "script language" you're seeing is minimized JS. I assume it makes a request to a central server at Continente and then populates the page. The easiest way to do this would be to use a chromedriver which executes the code and populates the page for you functioning almost identically to that of a browser.

    from selenium import webdriver
    from bs4 import BeautifulSoup
    
    driver = webdriver.Chrome()
    
    driver.get("https://www.continente.pt/produto/papa-infantil-farinha-lactea-6m-cerelac-2004388.html")
    
    soup = BeautifulSoup(driver.page_source, "html.parser")
    
    # ...