pythonselenium-webdriverweb-scraping

Selenium: Loop trough links on webpage and switch to the next page after collecting the data


from bs4 import BeautifulSoup

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions
from selenium.webdriver.support.ui import WebDriverWait

elemental_list = []

driver = webdriver.Chrome()

for page in range(1, 21):
    page_url = "https://www.fastexpert.com/top-real-estate-agents/florida/?page=" + str(page)
    driver.get(page_url)
    WebDriverWait(driver, 60).until(expected_conditions.presence_of_all_elements_located((By.CSS_SELECTOR, "section.TOP25AGENTSE10.RTAGENTCOLUMN div.TOPREPT_RRT")))
    for agent in range(1,25):
        driver.find_element(By.TAG_NAME, "h3").click() #Finding the link on the page
        driver.find_element(By.TAG_NAME, "h1") #finding the name on the personal page
        driver.find_element(By.ID, "my_map_adress") #finding the location on the personal page

    agents = BeautifulSoup(driver.page_source, 'lxml').find('section', {'class': 'TOP25AGENTSE10 RTAGENTCOLUMN'}).find_all('div', {'class': 'TOPREPT_RRT'})

    for agent in agents:
        elemental_list.append((agent.find('h1').text.strip(), agent.find({'id': 'my_map_adress'}).text.strip())) if agent.find({'class': 'my_map_adress'}) else elemental_list.append((agent.find('h1').text, ''))

for element in elemental_list:
    print(element)

driver.quit()

I try to scrap data from that website. The goal is to click the links and scrap the name and the address. After clicking all 25 links on page 1, it should loop through all 20 pages and do the same. I think my logic is right, but I'm kinda stuck. What do I not see what breaks my code?

selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"[id="my_map_adress"]"}
  (Session info: chrome=114.0.5735.199); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
Backtrace:
    GetHandleVerifier [0x00AAA813+48355]
    (No symbol) [0x00A3C4B1]
    (No symbol) [0x00945358]
    (No symbol) [0x009709A5]
    (No symbol) [0x00970B3B]
    (No symbol) [0x0099E232]
    (No symbol) [0x0098A784]
    (No symbol) [0x0099C922]
    (No symbol) [0x0098A536]
    (No symbol) [0x009682DC]
    (No symbol) [0x009693DD]
    GetHandleVerifier [0x00D0AABD+2539405]
    GetHandleVerifier [0x00D4A78F+2800735]
    GetHandleVerifier [0x00D4456C+2775612]
    GetHandleVerifier [0x00B351E0+616112]
    (No symbol) [0x00A45F8C]
    (No symbol) [0x00A42328]
    (No symbol) [0x00A4240B]
    (No symbol) [0x00A34FF7]
    BaseThreadInitThunk [0x770000C9+25]
    RtlGetAppContainerNamedObjectPath [0x777D7B4E+286]
    RtlGetAppContainerNamedObjectPath [0x777D7B1E+238]

Solution

  • Your code has semantic issues in multiple places.

    for agent in range(1,25):
        driver.find_element(By.TAG_NAME, "h3").click()
    

    At these lines, you're trying to click on the first person's profile, 25 times, which in turn has multiple issues,

    Your code for fetching details is also outside this loop, so it'd have taken the details from the last opened page only.

    Issues aside, I noticed that the website is static, so you need not use selenium unless there's some constraint, so I rewrote the code in requests.

    from bs4 import BeautifulSoup
    import re
    import requests
    
    elemental_list = []
    headers = {
        'Host': 'www.fastexpert.com'
    }
    
    for page in range(1, 21):
        page_url = f"https://www.fastexpert.com/top-real-estate-agents/florida/?page={page}"
        print(page_url)
    
        page = BeautifulSoup(requests.get(page_url, headers=headers).text, 'lxml')
        
        agents = page.find_all('a', {'class': 'profileLink', 'href':re.compile('\/agents\/')})
    
        
        for agent in agents:
            agent_url = agent.get('href')
            print(agent_url)
            agent_page = BeautifulSoup(requests.get(agent_url, headers=headers).text, 'lxml')
    
            if agent_page.find('h1') and (agent_page.find('h1').text.strip() != 'Not Found'):
                elemental_list.append((agent_page.find('h1').text.strip(), agent_page.find('address', {'id': 'my_map_adress'}).text.strip())) if agent_page.find('address', {'id': 'my_map_adress'}) else elemental_list.append((agent_page.find('h1').text, ''))
                print(elemental_list[-1])
    

    At the line page.find_all('a', {'class': 'profileLink', 'href':re.compile('\/agents\/')}) I am fetching all the matching anchors having the given class and matching the given href pattern.

    This code outputs:

    ('DeWayne Carpenter', '605 Lincoln Rd #7th Floor, Miami, FL, 33139')
    ('Kirk Kessel', '719 Pine Tree, Satellite Beach, FL, 32937')
    ('Chris Creegan', '439 Lake Howell Rd, Maitland, FL, 32751')
    ('Jeff Tricoli Team', '2005 Vista Parkway Suite 100, West Palm Beach, FL, 33411')
    ('Jeffrey Borham', '2945 Alt 19 N #Ste A, Palm Harbor, FL, 34683')
    ('Brenda Wade', '1709 Bloomingdale Ave, Valrico, FL, 33596')
    ('Sandy Blanton', '1225 W. Gregory St., Pensacola, FL, 32502')
    ('Welch Team', '301 Kingsley Lake Drive, Saint Augustine, FL, 32092')
    ('Liz Piedra', '11534 Spring Hill Dr., Spring Hill, FL, 34609')
    ('Mike Gagliardi', '5889 S. Williamson Blvd, Suite 1401, Port Orange, FL, 32128')
    ('David & Toni Zarghami', '3355 Clark Rd, Suite 103 #Suite 308, Sarasota, FL, 34232')
    ('Jeffrey G. Funk', '422 Main Street, Windermere, FL, 34786')
    ('Sarah and Tim Caudill', '223 N Causeway, New Smyrna Beach, FL, 32169')
    ('Kevin Bartlett', '21301 S Tamiami trail suite 340, Estero, FL,')
    ('Christian Bennett', '11541 Trinity Blvd., Trinity, FL, 34655')
    ('Ben Laube', '422 Main StreetÂ\xa0|Â\xa0Suite 1, Windermere, FL, 34786')
    ('Nichole Roberts', '1705 E Fort King St , Ocala, FL, 34471')
    ('Sandra Rathe', '1625 N Commerce Pkwy Unit 100, Weston, FL, 33326')
    ('Cynthia Fazzini', '4175 Woodlands Pkwy, Palm Harbor, FL, 34684')
    ('Jayson Burtch', '3251 Tamiami trail, Port Charlotte, FL, 33952')
    ('Lisa Carroll', '2440 Land O lakes Blvd, Land O Lakes, FL, 34639')
    ('Jennifer Fieo', '2020 W Brandon Blvd #Ste 145, Brandon, FL, 33510')
    ('Barrett Spray', '59 Alafaya Woods Blvd, Oviedo, FL, 32765')
    ('JoAnn and Tom Jacobs', '3303 Thomasville Road #Suite 201, Tallahassee, FL, 32308')
    ('Jennifer Wemert', '650 N. Alafaya Trail, Suite 105, Orlando, FL, 32828')
    

    You can change the number of pages in the loop definition.