I'm trying to retrieve some data from a website with python. The website seems to generate its content with Javascript so I cannot use the standard requests library. I tried the module requests-html and Selenium that both handle javascript content, but the problem is that I still cannot get the html page of this website.
I'm expecting to get the exact same view as what I have when I'm inspecting the page with my browser. For instant, I can clearly see all the information about the open positions. But when I fetch the page source with requests-html or Selenium, I get a page without any information of the open position.
For instance, if I want to retrieve the name of the open position, it is located in the span with the class "ais-Highlight-nonHighlighted". I can see it in my browser, but I am not able to get this data with python.
HTML page when inspecting through my browser, showing the data to retrieve (the job position name)
What I want is to get the html of the webpage, just like requests, and then process is with BeautifulSoup.
I tried with requests-html :
from requests_html import HTMLSession
url = "https://www.lvmh.com/en/join-us/our-job-offers?PRD-en-us-timestamp-desc%5BrefinementList%5D%5Bmaison%5D%5B0%5D=Kendo"
session = HTMLSession()
r = session.get(url)
r.html.render(wait=5)
print(r.html.html)
print(r.html.text)
print(r.text)
job_name = r.html.find('.ais-Highlight-nonHighlighted')
session.close()
--> print does not display the job position name and job_name is empty
I tried with Selenium :
from selenium import webdriver
from selenium.webdriver.common.by import By
url = "https://www.lvmh.com/en/join-us/our-job-offers?PRD-en-us-timestamp-desc%5BrefinementList%5D%5Bmaison%5D%5B0%5D=Kendo"
driver = webdriver.Safari()
driver.get(url)
data_source = driver.page_source
data_execute = driver.execute_script("return document.body.innerHTML")
driver.quit()
--> data_source and data_execute both does not include the job position name
None worked... if anyone can help me on that it will be grateful.
You're approach using selenium is not optimum. Don't try to get the page source but rather use selenium's built-in functionality for navigation.
For example:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver import ChromeOptions
def text(e):
if r := e.text:
return r
return e.get_attribute("textContent")
options = ChromeOptions()
options.add_argument("--headless=true")
url = "https://www.lvmh.com/en/join-us/our-job-offers?PRD-en-us-timestamp-desc%5BrefinementList%5D%5Bmaison%5D%5B0%5D=Kendo"
with webdriver.Chrome(options) as driver:
driver.get(url)
wait = WebDriverWait(driver, 10)
selector = By.CSS_SELECTOR, "span.ais-Highlight-nonHighlighted"
for span in wait.until(EC.presence_of_all_elements_located(selector)):
print(text(span))
Output (partial):
...
COPY OF DIRECTOR, VISUAL DESIGN
KENDO
San Francisco
United States
Permanent Job
Minimum 10 years
Full Time
PACKAGING CREATIVE DIRECTOR
KENDO
San Francisco
United States
Permanent Job
Minimum 10 years
Full Time