I want to extract link which is nested as /html/body/div[1]/div[2]/div[1]/div/div/div/div/div/a
in xpath , also see detailed nesting image
if helpful, these div have some class also.
I tried
from selenium import webdriver
from bs4 import BeautifulSoup
browser=webdriver.Chrome()
browser.get('https://www.visionias.in/resources/daily_current_affairs_programs.php?type=1&m=05&y=2024')
soup=BeautifulSoup(browser.page_source)
element = soup.find_element_by_xpath("./html/body/div[1]/div[2]/div[1]/div/div/div/div/div/a")
href = element.get_attribute('href')
print(href)
this code gave error
line 9, in <module>
element = soup.find_element_by_xpath("./html/body/div[1]/div[2]/div[1]/div/div/div/div/div/a")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: 'NoneType' object is not callable
and also tried other method
from selenium import webdriver
from bs4 import BeautifulSoup
browser=webdriver.Chrome()
browser.get('https://www.visionias.in/resources/daily_current_affairs_programs.php?type=1&m=05&y=2024')
soup=BeautifulSoup(browser.page_source)
href = soup('a')('div')[1]('div')[2]('div')[1]('div')[0]('div')[0]('div')[0]('div')[0]('div')[0][href]
#href = element.get_attribute('href')
print(href)
this gave error
href = soup('a')('div')[1]('div')[2]('div')[1]('div')[0]('div')[0]('div')[0]('div')[0]('div')[0][href]
^^^^^^^^^^^^^^^^
TypeError: 'ResultSet' object is not callable
expected outcome should be : https://www.visionias.in/resources/material/?id=3731&type=daily_current_affairs or material/?id=3731&type=daily_current_affairs
Also some other links have same kind of nesting as above, is there any way to filter the links using the text inside/html/body/div[1]/div[2]/div[1]/div/div/p
, for example text here is 18 may 2024, this p tag has an id also but it is not consisent or doesnt have a pattern, so not quite usuable to me.
I have seen other answers on stackoverflow but that isn't working for me
Also if possible please elaborate the answer, as I have to apply same code to some other sites as well.
Refer the selenium code below to extract all links and print it to console:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.maximize_window()
driver.get("https://www.visionias.in/resources/daily_current_affairs_programs.php?type=1&m=05&y=2024")
wait = WebDriverWait(driver, 10)
links = wait.until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='center']//a")))
for link in links:
print(link.get_attribute("href"))
Console output:
https://www.visionias.in/resources/material?id=3731&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3729&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3727&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3723&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3717&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3715&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3705&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3703&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3701&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3699&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3690&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3688&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3687&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3684&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3682&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3676&type=daily_current_affairs
Process finished with exit code 0
SUGGESTION: I highly recommend you to read about absolute and relative XPaths. And the advantages of using relative over absolute XPaths. Few links below for your reference:
UPDATE: Use the below code if you want to extract the link based on the specific date.
link = wait.until(EC.visibility_of_element_located((By.XPATH, "//p[contains(text(),'18 May 2024')]//following::a[1]")))
print(link.get_attribute("href"))