pythonselenium-webdriverweb-scrapingxpath

Xpath - select all columns but the last


I am building a web scraper with Python and Selenium that scrapes the basketball reference website, and am in need of some fine-tuning of the Xpath statements that return the data I'm looking for. Currently, I need some Xpath statement that returns every column but the final one, which is the "awards" column, which sometimes contains text (if a player won any sort of award that year) or is blank if they did not. My code works fine and mostly does select what I need, but every variation of the Xpath statement I've tried either does not return a valid Xpath statement or it simply gives me the all the data including the last column, which I do not need. Here is a snippet of my working code along with the selenium driver code which retrieves every element of a table and returns it.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException, TimeoutException
import pandas as pd

class PlayerPerGameStats():
    def __init__(self, player_name):
        self.player_name = player_name.lower()
        self.options = Options()

        #No popup window when called
        self.options.add_argument("--headless=new")

        #No image loading for performance
        self.options.add_experimental_option(
            "prefs", {
                "profile.managed_default_content_settings.images" : 2,
            }
        )
        self.browser = webdriver.Chrome(options=self.options)
        self.url = f"https://www.basketball-reference.com/players/{self.player_name[0]}/{self.player_name}01.html"
        self.browser.get(self.url)

        #Add wait for page load
        WebDriverWait(self.browser, 10).until(
            EC.presence_of_element_located((By.ID, 'per_game_stats'))
        )

def get_player_row_stats(self) -> list:
        try:
            table = self.browser.find_element(By.ID, 'per_game_stats')
            rows = table.find_elements(By.XPATH, './tbody')
            stat_rows = [row.text for row in rows[0].find_elements(By.XPATH, './tr')]

            #List split to get each stat as it's own index
            player_data = [y for x in stat_rows for y in x.split(' ')]

            print(player_data)

            return player_data

        except Exception as e:
            print(f"Error extracting row stats: {e}")
            return None


#To run it
stats = PlayerPerGameStats("lillada")
player_stats = stats.get_player_row_stats()

And here the snippet of the DOM I am working with.

Snippet of basketball reference DOM

Some xpath variations I have tried include:

stat_rows = [row.text for row in rows[0].find_elements(By.XPATH, './tr[position() < last()]')]
stat_rows = [row.text for row in rows[0].find_elements(By.XPATH, './tr[not(contains(@data-stat, 'awards'))]')]

However these are not sufficient, and instead return every aforementioned column or just nothing at all.

Thank you for taking the time to read this. If any additional information or code is required I am more than happy to provide - this problem has been on my back for weeks now and I just want to figure out how to solve it.


Solution

  • I changed your locators and approach. It excludes the last column in each row.

    Working code:

    from selenium import webdriver
    from selenium.webdriver.support.wait import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
    player_name = "lillada"
    url = f"https://www.basketball-reference.com/players/{player_name[0]}/{player_name}01.html"
    driver = webdriver.Chrome()
    driver.maximize_window()
    driver.get(url)
    
    wait = WebDriverWait(driver, 10)
    for row in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "#per_game_stats tbody tr"))):
        player_data = [cell.text for cell in row.find_elements(By.CSS_SELECTOR, "th,td")[:-1]]
        print(player_data)
    

    Output

    ['2012-13', '22', 'POR', 'NBA', 'PG', '82', '82', '38.6', '6.7', '15.7', '.429', '2.3', '6.1', '.368', '4.5', '9.6', '.469', '.501', '3.3', '3.9', '.844', '0.5', '2.6', '3.1', '6.5', '0.9', '0.2', '3.0', '2.1', '19.0']
    ['2013-14', '23', 'POR', 'NBA', 'PG', '82', '82', '35.8', '6.7', '15.9', '.424', '2.7', '6.8', '.394', '4.1', '9.1', '.447', '.508', '4.5', '5.2', '.871', '0.4', '3.1', '3.5', '5.6', '0.8', '0.3', '2.4', '2.4', '20.7']
    ['2014-15', ...