pythonangularselenium-webdriverweb-scrapingdynamic-content

How to programmatically inspect and retrieve dynamic content from an Angular website using Python?


I'm trying to scrape a website built with Angular using Python, but I'm encountering issues with retrieving the dynamically generated content. When I make a direct HTTP request or view the page source, I only get the initial HTML, which contains the

    <app-root>
     <!-- empty app root -->
    </app-root> 

placeholder. However, when I inspect the rendered page in a browser, I can see the full content. Here's what the inspected page returns when i select it from the page rendered in browser:

    <app-root _nghost-ynj-c115 ng-version="14.3.0">
      <!-- Rendered HTML content from browser inspection -->
      ...


    </app-root>

I've tried using Selenium to wait for the content to be rendered, but I'm not sure if I'm using the correct selectors or if there's a better approach. Here's the code I've been using:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager

service = Service(ChromeDriverManager().install())
options = webdriver.ChromeOptions()
options.headless = True
driver = webdriver.Chrome(service=service, options=options)

try:
    driver.get("https://www.fedlex.admin.ch/de/cc/international-law/0.1")
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, "app-root ng-version"))
    )
    page_source = driver.page_source
finally:
    driver.quit()

print(page_source)

This code doesn't seem to retrieve the dynamic content as expected. How can I programmatically inspect the page and retrieve the full content that's rendered by Angular? Is there a specific way to interact with Angular applications using Selenium, or is there another tool or method I should consider for this task?


Solution

  • Your problem is that "app-root" is presented at start but is empty

    Change this line, this element is where data is presented

    EC.presence_of_element_located((By.XPATH, "//div[@id='content']"))