pythonselenium-webdriverweb-scrapingdopostback

using selenium to display 'next' search results using jscript _doPostBack links


In search results of jobquest site (http://jobquest.detma.org/JobQuest/Training.aspx), I would like to use selenium to click the "next" link so that the next paginated results table of 20 records would load. I can only scrape as far as the first 20 results. Here are my steps that got me that far:

Step1: I load the opening page.

import requests, re
from bs4 import BeautifulSoup
from selenium import webdriver

browser = webdriver.Chrome('../chromedriver')
url ='http://jobquest.detma.org/JobQuest/Training.aspx'
browser.get(url)

Step2: I find the search button and click it to request a search with no search criteria. After this code, the search results page loads with the first 20 records in a table:

submit_button = browser.find_element_by_id('ctl00_ctl00_bodyMainBase_bodyMain_btnSubmit')
submit_button.click()

Step3: Now on the search results page, I create some soup and "find_all" to get the correct rows

html = browser.page_source
soup = BeautifulSoup(html, "html.parser")

rows = soup.find_all("tr",{"class":"gvRow"})

At this point, I can fetch my data (job ids) from the first page of results using rows object like this:

id_list=[]

for row in rows:
    temp = str(row.find("a"))[33:40]
    id_list.append(temp)

QUESTION - Step4 Help!! To reload the table with the next 20 results, I have to click the "next" link on the results page. I used Chrome to inspect it and got these details:

<a href="javascript:__doPostBack('ctl00$ctl00$bodyMainBase$bodyMain$egvResults$ctl01$ctl08','')">Next</a>

I need code to programmatically click on Next and remake the soup with the next 20 records. I expect that if I could figure this out, I can figure out how to loop the code to get all ~1515 IDs in the database.

UPDATE The line that worked for me, suggested in the answer is:

WebDriverWait(browser, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '[href*=ctl08]'))).click()

Thank you, this was very useful.


Solution

  • You can use an attribute = value selector to target the href. In this case I use the substring at the end via contains (*) operator.

    WebDriverWait(browser, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '[href*=ctl08]'))).click()
    

    I add in a wait for clickable condition as a precautionary measure. You could probably remove that.

    Additional imports

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.webdriver.common.by import By
    

    Without wait condition:

    browser.find_element_by_css_selector('[href*=ctl08]'),click()
    

    Another way:

    Now, instead, you could initially set the page results count to 100 (the max) and then loop through the dropdown for the pages of results to load each new page (then you don't need to work about how many pages)

    import requests, re
    from bs4 import BeautifulSoup
    from selenium import webdriver
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.webdriver.common.by import By
    
    browser = webdriver.Chrome()
    url ='http://jobquest.detma.org/JobQuest/Training.aspx'
    browser.get(url)
    submit_button = browser.find_element_by_id('ctl00_ctl00_bodyMainBase_bodyMain_btnSubmit')
    submit_button.click()
    WebDriverWait(browser, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '[value="100"]'))).click()
    html = browser.page_source
    soup = BeautifulSoup(html, "html.parser")
    rows = soup.find_all("tr",{"class":"gvRow"})
    id_list=[]
    
    for row in rows:
        temp = str(row.find("a"))[33:40]
        id_list.append(temp)
    
    elems = browser.find_elements_by_css_selector('#ctl00_ctl00_bodyMainBase_bodyMain_egvResults select option')
    i = 1
    while i < len(elems) / 2:
        browser.find_element_by_css_selector('#ctl00_ctl00_bodyMainBase_bodyMain_egvResults select option[value="' + str(i) + '"]').click()
        #do stuff with new page
        i+=1
    

    You decide what to do with the extracting rows info from each page. This was to give you an easy framework for looping all the pages.