pythonselenium-webdriver

Python Selenium extracting href links for elements satisfying conditions


I'm learning how to use Selenium in Python for Chrome and I'm trying to make a tool for downloading pdf files from a website. My code goes over a table and I have been able to access each row individually. The row is structured in the HTML source code as follows:

<TR>
    <TD ALIGN="center" nowrap="nowrap">
        <span class=''>
        Main ID
        </span>
    </TD>
    <TD ALIGN="center">
        <span class=''>
        Date of upload
        </span>
    </TD>
    <td align="center">
        <span class=''>
        Time of upload
        </span>
    </td>
    <TD>
        <span class=''>
        Description 
        </span>
    </TD>
    <td align="center" id='detailElement348' class="hidden" >
        <span class=''>
        Secondary ID
        </span>
    </td>        
    <TD>
        <CENTER>
                <a title="Open main document as PDF" class='' href="url/pdf1" target="_blank" >
                </a>
        </CENTER>
    </TD>
    <TD>
        <CENTER>
                <a title="Open accessory document as PDF" class='' href="url/pdf2" target="_blank" >
                </a>
        </CENTER>
    </TD>
    <TD>
        <CENTER>
        <span class=''>
            Date of effect
            &nbsp;
        </span>
        </CENTER>
    </TD>

    <TD>
        <span class='' >Company name</span>
        &nbsp;
    </TD>

    <TD>
        <CENTER>
        <span class=''>
            Signature
            &nbsp;
        </span>
        </CENTER>
    </TD>
</TR>

The goal is to get the pdf href links for main documents, but only if the upload date is 1.5.2025.

This is my code currently

from selenium import webdriver
from selenium.webdriver.common.by import By

PATH = "C:\\Program Files (x86)\\chromedriver.exe"
cService = webdriver.ChromeService(executable_path= PATH)
driver = webdriver.Chrome(service = cService)
href_links = []
date = "1.5.2025"
driver.get("https://mydata.com")
tableID = driver.find_element(By.CLASS_NAME,"DetailTable")
tbody = tableID.find_element(By.TAG_NAME,"tbody")
tr_elements = [tbody.find_elements(By.TAG_NAME,"tr") for tbody in tbodies]

for tr_element in tr_elements:
    for td_element in tr_element:
        pass

I tried doing this by looping through each tr_element as above, but that doesnt really work and seems impractical.


Solution

  • For anyone interested, I managed to make it work by looping over tr_elements in the html source file

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    
    PATH = "C:\\Program Files (x86)\\chromedriver.exe"
    cService = webdriver.ChromeService(executable_path= PATH)
    driver = webdriver.Chrome(service = cService)
    href_links = []
    date = "1.5.2025"
    driver.get("https://mydata.com")
    tableID = driver.find_element(By.CLASS_NAME,"DetailTable")
    tbody = tableID.find_element(By.TAG_NAME,"tbody")
    tr_elements = [tbody.find_elements(By.TAG_NAME,"tr") for tbody in tbodies]
    
    links = [[] for i in range(len(tr_elements))]
    
    for i in range(len(links)):
        for tr_element in tr_elements[i]:
            td_elements = tr_element.find_elements(By.TAG_NAME,'td')
            temp = td_elements[1].find_element(By.TAG_NAME,"span").get_attribute("innerHTML")
            temp = temp.strip()
            if temp==date:
                temp_link = td_elements[5].find_element(By.TAG_NAME,'center')
                temp_link = temp_link.find_element(By.TAG_NAME,"a")
                links[i].append(temp_link)
    
    

    This just picks apart each tr element into separate td elements and checks the date against the td containing the date and then returns the href value of the td element containing the href I need