pythonselenium-webdriverweb-scrapingwebdriverwaitwindow-handles

WebScraping JavaScript-Rendered Content using Selenium in Python


I am very new to web scraping and have been trying to use Selenium's functions to simulate a browser accessing the Texas public contracting webpage and then download embedded PDFs. The website is this: http://www.txsmartbuy.com/sp.

So far, I've successfully used Selenium to select an option in one of the dropdown menus "Agency Name" and to click the search button. I've listed my Python code below.

import os
os.chdir("/Users/fsouza/Desktop") #Setting up directory

from bs4 import BeautifulSoup #Downloading pertinent Python packages
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

chromedriver = "/Users/fsouza/Desktop/chromedriver" #Setting up Chrome driver
driver = webdriver.Chrome(executable_path=chromedriver)
driver.get("http://www.txsmartbuy.com/sp")
delay = 3 #Seconds

WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.XPATH, "//select[@id='agency-name-filter']/option[69]")))    
health = driver.find_element_by_xpath("//select[@id='agency-name-filter']/option[68]")
health.click()
search = driver.find_element_by_id("spBtnSearch")
search.click()

Once I get to the results page, I get stuck.

First, I can't access any of the resulting links using the html page source. But if I manually inspect individual links in Chrome, I do find the pertinent tags (<a href...) relating to individual results. I'm guessing this is because of JavaScript-rendered content.

Second, even if Selenium were able to see these individual tags, they have no class or id. The best way to call them, I think, would be by calling <a tags by the order shown (see code below) but this didn't work either. Instead, the link calls some other 'visible' tag (something in the footer, which I don't need).

Third, assuming these things did work, how can I figure out the number of <a> tags showing on the page (in order to loop this code over an over for every single result)?

driver.execute_script("document.getElementsByTagName('a')[27].click()")

I would appreciate your attention to this––and please excuse any stupidity on my part, considering that I'm just starting out.


Solution

  • To scrape the JavaScript-Rendered Content using Selenium you need to:

    scrape


    References

    You can find a couple of relevant detailed discussions in: