I'm encountering a challenge while scraping news headlines, subtitles and links from The Sun Football website using Selenium. Despite implementing seemingly correct XPaths to target the desired elements (div[@class="teaser__copy-container"]
for containers, span[@class="teaser__headline teaser__kicker t-p-color"]
for titles, and h3[@class="teaser__subdeck"]
for subtitles), I'm consistently extracting the same data for all news items.
Code Snippet
from selenium import webdriver
from selenium.webdriver.firefox.service import Service # Using Firefox service
import pandas as pd
# Website URL for news scraping
website = "https://www.thesun.co.uk/sport/football/"
# Path to the GeckoDriver executable
path = "/Users/dada/AutomationProjects/drivers/geckodriver.exe"
# Configure Firefox service with GeckoDriver path
service = Service(executable_path=path)
# Initialise Firefox WebDriver using the service
driver = webdriver.Firefox(service=service)
# Open the desired website
driver.get(website)
containers = driver.find_elements(by="xpath", value='//div[@class="teaser__copy-container"]')
titles = []
subtitles = []
links = []
for container in containers:
title = container.find_element(by="xpath", value='//div[@class="teaser__copy-container"]/a/span[@class="teaser__headline teaser__kicker t-p-color"]').get_attribute("data-original-text")
subtitle = container.find_element(by="xpath", value='//div[@class="teaser__copy-container"]/a/h3[@class="teaser__subdeck"]').get_attribute("data-original-text")
link = container.find_element(by="xpath", value='//div[@class="teaser__copy-container"]/a').get_attribute("href")
titles.append(title)
subtitles.append(subtitle)
links.append(link)
dict = {'Titles' : titles, 'Subtitles' : subtitles, 'Links' : links}
headlines_df = pd.DataFrame(dict)
print(headlines_df)
Verified XPaths: I double-checked the XPaths using browser developer tools to ensure they accurately target the intended elements. The problem persists. I'm still extracting the same titles, subtitles and links despite the troubleshooting step.
Selenium version: 4.19.0 | Python version: 3.9.19 | Environment: Jupyter notebook
I'd appreciate any insights or suggestions to help me identify the root cause of this issue and successfully scrape distinct headlines, subtitles and links from The Sun Football website.
Yeah... so there's this "odd" thing you have to do with XPaths when you search from an existing element. Instead of
link = container.find_element(By.XPATH, '//div[@class="teaser__copy-container"]/a')
you need to add a '.' to the start of the XPath, e.g.
link = container.find_element(By.XPATH, './/div[@class="teaser__copy-container"]/a')
^ period added here
This only applies to XPaths and only when you use .find_element()
from an element. For example, driver.find_element()
works fine but element.find_element()
requires the '.'. That should fix your issues.
Turns out your locators for the title, etc. were not correct. I updated and simplified them. Full working code is below.
from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd
# Website URL for news scraping
website = "https://www.thesun.co.uk/sport/football/"
# Initialise Firefox WebDriver using the service
driver = webdriver.Firefox()
driver.maximize_window()
# Open the desired website
driver.get(website)
containers = driver.find_elements(By.XPATH, '//div[@class="teaser__copy-container"][./a]')
titles = []
subtitles = []
links = []
for container in containers:
title = container.find_element(By.CSS_SELECTOR, 'a > span').get_attribute("data-original-text")
subtitle = container.find_element(By.CSS_SELECTOR, 'a > h3').get_attribute("data-original-text")
link = container.find_element(By.CSS_SELECTOR, 'a').get_attribute("href")
titles.append(title)
subtitles.append(subtitle)
links.append(link)
dict = {'Titles' : titles, 'Subtitles' : subtitles, 'Links' : links}
headlines_df = pd.DataFrame(dict)
print(headlines_df)
This outputs
Titles Subtitles Links
0 OF HIS ILK Gundogan's glam wife enters row with Barca st... https://www.thesun.co.uk/sport/27404398/ilkay-...
1 NICE TUCH Bayern president hails Tuchel for 'tactical m... https://www.thesun.co.uk/sport/27402309/bayern...
...
Additional feedback
As of Selenium 4.6, you no longer have to download, configure, and maintain your own drivers. Selenium Manager was added and it will download and setup drivers that match your installed browser for you automatically. So, your initial code can be simplified to
from selenium import webdriver
website = "https://www.thesun.co.uk/sport/football/"
driver = webdriver.Firefox()
driver.get(website)
The preferred way to write a .find_element()
call is
from selenium.webdriver.common.by import By
driver.find_element(By.XPATH, '//div[@class="teaser__copy-container"]/a')
Your way will work but it's prone to typos and your IDE won't know about the typos until you run the script and it fails. Using By.XPATH
, etc. avoids typos in the locator type and your IDE will help you autocomplete it. If there are typos, the IDE will flag them as errors before you run it saving you time.
Since you requested the XPaths, the containers
elements are already found using an XPath. The below code should take care of the rest.
title = container.find_element(By.XPATH, './a/span').get_attribute("data-original-text")
subtitle = container.find_element(By.XPATH, './a/h3').get_attribute("data-original-text")
link = container.find_element(By.XPATH, './a').get_attribute("href")