pythonselenium-webdriverweb-scrapingshadow-dom

Scrape Shadow root elements using Python Selenium


I am trying to scrape a website for a mini project and the data that I needed are hidden under the #Shadow-root tag of the HTML. I tried accessing it with selenium with the code below:

def expand_shadow_element(element):
  shadow_root = driver.execute_script('return arguments[0].shadowRoot', element)
  return shadow_root

url = "https://new.abb.com/products/SK615502-D"

#Initializing the webdriver
options = webdriver.ChromeOptions()

driver = webdriver.Chrome(executable_path="/Users/ritchevy/Desktop/scraping-glassdoor/chromedriver", options=options)
timeout = 10
wait = WebDriverWait(driver, timeout)

driver.set_window_size(1120, 1000)
driver.get(url)

root1 = driver.find_element(By.CSS_SELECTOR,"pis-products-details-attribute-groups")
shadow_root1 = expand_shadow_element(root1)
shadow_container_root = shadow_root1.find_element(By.CSS_SELECTOR,"div")

Upon execution its giving me this error

---> 35 shadow_container_root = shadow_root1.find_element(By.CSS_SELECTOR,"div")
     36 

AttributeError: 'dict' object has no attribute 'find_element'

Any idea how to resolve this?


Solution

  • I didn't have any issues running your original code, so not sure why it didn't work for you. Since you are not running headless, did you see the required page being opened in the browser? You might have to insert a time.sleep() call after driver.get(url) to ensure that you can see the browser window before you encounter the error.

    I made some minor tweaks and then grabbed the data from the tables in the shadow root node (assuming that this was the data that you are after).

    from selenium import webdriver
    from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    
    url = "https://new.abb.com/products/SK615502-D"
    
    options = webdriver.ChromeOptions()
    # * Use local Chrome.
    # driver = webdriver.Chrome(options=options)
    # * Use remote Chrome in Docker container.
    driver = webdriver.Remote(
      "http://127.0.0.1:4444/wd/hub",
      DesiredCapabilities.CHROME,
      options=options
    )
    
    wait = WebDriverWait(driver, 10)
    
    driver.get(url)
    
    # Find element enclosing the shadow root DOM.
    #
    root = driver.find_element(By.CSS_SELECTOR, "pis-products-details-attribute-groups")
    
    # Extract the shadow root content.
    #
    shadow_root = driver.execute_script('return arguments[0].shadowRoot', root)
    print(shadow_root)
    
    for table in shadow_root.find_elements(By.CSS_SELECTOR, ".ext-attr-group .ext-attr-group-inner"):
        title = table.find_element(By.CSS_SELECTOR, "h4")
        print("====================================================")
        print("šŸŸ¦ "+title.text)
        for row in table.find_elements(By.CSS_SELECTOR, ".ext-attr-group-content > div"):
            key = row.find_element(By.CSS_SELECTOR, ".col-md-4")
            value = row.find_element(By.CSS_SELECTOR, ".col-md-8")
            print(str(key.text)+" "+str(value.text))
    

    I generally use a remote Selenium instance, but you can just comment that out and use webdriver.Chrome(options=options) instead.

    This is what some of the data look like:

    ====================================================
    šŸŸ¦ Ordering
    Minimum Order Quantity: 1 piece
    Customs Tariff Number: 85389099
    Product Main Type: Accessories
    ====================================================
    šŸŸ¦ Popular Downloads
    Data Sheet, Technical Information: 1SFC151007C02__
    Instructions and Manuals: 1SFC151011M0201
    CAD Dimensional Drawing: 2CDC001079B0201
    ====================================================
    šŸŸ¦ Dimensions
    Product Net Width: 0.038 m
    Product Net Depth / Length: 0.038 m
    Product Net Height: 0.038 m
    Product Net Weight: 0.08 kg