pythonselenium-webdriverweb-scrapingxpathgeckodriver

Selenium scraping same titles, subtitles and links from The Sun Football webpage


I'm encountering a challenge while scraping news headlines, subtitles and links from The Sun Football website using Selenium. Despite implementing seemingly correct XPaths to target the desired elements (div[@class="teaser__copy-container"] for containers, span[@class="teaser__headline teaser__kicker t-p-color"] for titles, and h3[@class="teaser__subdeck"] for subtitles), I'm consistently extracting the same data for all news items.

Code Snippet

from selenium import webdriver
from selenium.webdriver.firefox.service import Service # Using Firefox service

import pandas as pd

# Website URL for news scraping
website = "https://www.thesun.co.uk/sport/football/"

# Path to the GeckoDriver executable
path = "/Users/dada/AutomationProjects/drivers/geckodriver.exe"

# Configure Firefox service with GeckoDriver path
service = Service(executable_path=path)

# Initialise Firefox WebDriver using the service
driver = webdriver.Firefox(service=service)

# Open the desired website
driver.get(website)

containers = driver.find_elements(by="xpath", value='//div[@class="teaser__copy-container"]')

titles = []
subtitles = []
links = []

for container in containers:
    title = container.find_element(by="xpath", value='//div[@class="teaser__copy-container"]/a/span[@class="teaser__headline teaser__kicker t-p-color"]').get_attribute("data-original-text")
    subtitle = container.find_element(by="xpath", value='//div[@class="teaser__copy-container"]/a/h3[@class="teaser__subdeck"]').get_attribute("data-original-text")
    link = container.find_element(by="xpath", value='//div[@class="teaser__copy-container"]/a').get_attribute("href")
    titles.append(title)
    subtitles.append(subtitle)
    links.append(link)

dict = {'Titles' : titles, 'Subtitles' : subtitles, 'Links' : links}

headlines_df = pd.DataFrame(dict)
print(headlines_df)

Verified XPaths: I double-checked the XPaths using browser developer tools to ensure they accurately target the intended elements. The problem persists. I'm still extracting the same titles, subtitles and links despite the troubleshooting step.

Selenium version: 4.19.0 | Python version: 3.9.19 | Environment: Jupyter notebook

I'd appreciate any insights or suggestions to help me identify the root cause of this issue and successfully scrape distinct headlines, subtitles and links from The Sun Football website.


Solution

  • Yeah... so there's this "odd" thing you have to do with XPaths when you search from an existing element. Instead of

    link = container.find_element(By.XPATH, '//div[@class="teaser__copy-container"]/a')
    

    you need to add a '.' to the start of the XPath, e.g.

    link = container.find_element(By.XPATH, './/div[@class="teaser__copy-container"]/a')
                                             ^ period added here
    

    This only applies to XPaths and only when you use .find_element() from an element. For example, driver.find_element() works fine but element.find_element() requires the '.'. That should fix your issues.


    Turns out your locators for the title, etc. were not correct. I updated and simplified them. Full working code is below.

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    
    import pandas as pd
    
    # Website URL for news scraping
    website = "https://www.thesun.co.uk/sport/football/"
    
    # Initialise Firefox WebDriver using the service
    driver = webdriver.Firefox()
    driver.maximize_window()
    
    # Open the desired website
    driver.get(website)
    
    containers = driver.find_elements(By.XPATH, '//div[@class="teaser__copy-container"][./a]')
    
    titles = []
    subtitles = []
    links = []
    for container in containers:
        title = container.find_element(By.CSS_SELECTOR, 'a > span').get_attribute("data-original-text")
        subtitle = container.find_element(By.CSS_SELECTOR, 'a > h3').get_attribute("data-original-text")
        link = container.find_element(By.CSS_SELECTOR, 'a').get_attribute("href")
        titles.append(title)
        subtitles.append(subtitle)
        links.append(link)
    
    dict = {'Titles' : titles, 'Subtitles' : subtitles, 'Links' : links}
    
    headlines_df = pd.DataFrame(dict)
    print(headlines_df)
    

    This outputs

               Titles                                          Subtitles                                              Links
    0      OF HIS ILK   Gundogan's glam wife enters row with Barca st...  https://www.thesun.co.uk/sport/27404398/ilkay-...
    1       NICE TUCH   Bayern president hails Tuchel for 'tactical m...  https://www.thesun.co.uk/sport/27402309/bayern...
    ...
    

    Additional feedback

    1. As of Selenium 4.6, you no longer have to download, configure, and maintain your own drivers. Selenium Manager was added and it will download and setup drivers that match your installed browser for you automatically. So, your initial code can be simplified to

      from selenium import webdriver
      
      website = "https://www.thesun.co.uk/sport/football/"
      driver = webdriver.Firefox()
      driver.get(website)
      
    2. The preferred way to write a .find_element() call is

      from selenium.webdriver.common.by import By
      
      driver.find_element(By.XPATH, '//div[@class="teaser__copy-container"]/a')
      

      Your way will work but it's prone to typos and your IDE won't know about the typos until you run the script and it fails. Using By.XPATH, etc. avoids typos in the locator type and your IDE will help you autocomplete it. If there are typos, the IDE will flag them as errors before you run it saving you time.


    Since you requested the XPaths, the containers elements are already found using an XPath. The below code should take care of the rest.

    title = container.find_element(By.XPATH, './a/span').get_attribute("data-original-text")
    subtitle = container.find_element(By.XPATH, './a/h3').get_attribute("data-original-text")
    link = container.find_element(By.XPATH, './a').get_attribute("href")