pythonselenium-webdriverweb-scrapingpython-requestspython-requests-html

Difficulty to scrape HTML page from a dynamic generated website with Python


I'm trying to retrieve some data from a website with python. The website seems to generate its content with Javascript so I cannot use the standard requests library. I tried the module requests-html and Selenium that both handle javascript content, but the problem is that I still cannot get the html page of this website.

I'm expecting to get the exact same view as what I have when I'm inspecting the page with my browser. For instant, I can clearly see all the information about the open positions. But when I fetch the page source with requests-html or Selenium, I get a page without any information of the open position.

For instance, if I want to retrieve the name of the open position, it is located in the span with the class "ais-Highlight-nonHighlighted". I can see it in my browser, but I am not able to get this data with python.
HTML page when inspecting through my browser, showing the data to retrieve (the job position name)

What I want is to get the html of the webpage, just like requests, and then process is with BeautifulSoup.

I tried with requests-html :

from requests_html import HTMLSession
url = "https://www.lvmh.com/en/join-us/our-job-offers?PRD-en-us-timestamp-desc%5BrefinementList%5D%5Bmaison%5D%5B0%5D=Kendo"

session = HTMLSession()
r = session.get(url)
r.html.render(wait=5)

print(r.html.html)
print(r.html.text)
print(r.text)
job_name = r.html.find('.ais-Highlight-nonHighlighted')

session.close()

--> print does not display the job position name and job_name is empty

I tried with Selenium :

from selenium import webdriver
from selenium.webdriver.common.by import By
url = "https://www.lvmh.com/en/join-us/our-job-offers?PRD-en-us-timestamp-desc%5BrefinementList%5D%5Bmaison%5D%5B0%5D=Kendo"

driver = webdriver.Safari()
driver.get(url)

data_source = driver.page_source
data_execute = driver.execute_script("return document.body.innerHTML")

driver.quit()

--> data_source and data_execute both does not include the job position name

None worked... if anyone can help me on that it will be grateful.


Solution

  • You're approach using selenium is not optimum. Don't try to get the page source but rather use selenium's built-in functionality for navigation.

    For example:

    from selenium import webdriver
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.webdriver.common.by import By
    from selenium.webdriver import ChromeOptions
    
    def text(e):
        if r := e.text:
            return r
        return e.get_attribute("textContent")
    
    options = ChromeOptions()
    options.add_argument("--headless=true")
    
    url = "https://www.lvmh.com/en/join-us/our-job-offers?PRD-en-us-timestamp-desc%5BrefinementList%5D%5Bmaison%5D%5B0%5D=Kendo"
    
    with webdriver.Chrome(options) as driver:
        driver.get(url)
        wait = WebDriverWait(driver, 10)
        selector = By.CSS_SELECTOR, "span.ais-Highlight-nonHighlighted"
        for span in wait.until(EC.presence_of_all_elements_located(selector)):
            print(text(span))
    

    Output (partial):

    ...
    
    COPY OF DIRECTOR, VISUAL DESIGN
    KENDO
    San Francisco
    United States
    Permanent Job
    Minimum 10 years
    Full Time
    PACKAGING CREATIVE DIRECTOR
    KENDO
    San Francisco
    United States
    Permanent Job
    Minimum 10 years
    Full Time