javascriptselenium-webdriverbeautifulsoupscreen-scraping

Can't scrape links from the dynamic website using Selenium


I'm trying to collect links to personal profiles and contacts from the following website: https://www.dlapiper.com/en-us/people#t=All&sort=relevancy&numberOfResults=100&f:CountriesID=[United%20Kingdom]

I'm using Selenium to do scraping via chromedriver and normally it works just fine - however, for this particular website I can't get to the source html where all the links to people's profiles would be visible.

I wrote a standard script which would normally work for any other dynamic website.

links = []
driver = webdriver.Chrome()
driver.get('https://www.dlapiper.com/en-gb/people#t=All&sort=%40lastname%20ascending&f:CountriesID=[United%20Kingdom]')
time.sleep(5)
cookies_button = driver.find_element(By.ID, "onetrust-reject-all-handler")
cookies_button.click()
time.sleep(5)
html = driver.page_source
time.sleep(5)
soup = BeautifulSoup(html, 'html.parser')
parse = soup.find_all('a')
for item in parse:
    links.append(item.get('href'))
print(links)

However, links from the people search block can't get into the driver.page_source - even though I can find all the link elements when I press "inspect" in Chrome. I have tried increasing the time.sleep(), did not help.

I understand that there are lots of javascripts being executed on this page - maybe I need to activate some of them manually? Help would be much appreciated as I don't know Javascript.


Solution

  • The lawyer's contact details are in an iframe...

    1   Frame ID    myIframe
    2   Frame Name  Unused
    3   Frame Title People Index Hosted Search Page
    4   Frame Source    https://www.dlapiper.com/en-US/coveosearchpages/people%20index%20hosted%20search%20page#t=All&sort=relevancy&f:CountriesID=[United%20States]
    5   Frame Domain    www.dlapiper.com
    6   Type    text/html
    7   Mode    CSS1Compat
    8   Language    en
    9   Encoding    UTF-8
    10  Modified    06/03/2024 19:25:06
    11  Load Time   2.52 seconds
    12  Source Size 361 bytes
    13  Position    0 - 291 pixels
    14  Viewport    1903 x 1500 pixels
    

    Is your script scanning inside child frames?