pythonweb-scraping

Problem scraping table row data into an array


Background

I am looking to measure statistics from a website wotstars for a XBOX and Playstation game, World of Tanks Console.

Initially I tried just using Excel to scrape the site directly for me into Power Query, the immediate issue was that only 5 rows of recent match data (5 games) is available from the website as loaded, a button needs to be click to view the last 100 matches.

example of View More button

I thought this was a good idea to learn more about web-scraping with python.

I successfully have managed to use Selenium to activate the "View More" button to shows all 100 rows and I can return the table of interest.

Problem

I am currently stuck through writing the td data to a 2D array that I can then dump to a csv file, I seem to be writing 100 columns rather than 100 rows?

I have looked at sites such as A Guide to Scraping HTML Tables with Pandas and BeautifulSoup for approaches (but this particular example doesn't show the source table).

Other

If I look at Excel the query connection that attaches to this table is shown below

Excl data

Code

from selenium import webdriver
from selenium.webdriver.common.by import By
import time
from bs4 import BeautifulSoup
import pandas as pd

# Set up the WebDriver
driver = webdriver.Chrome()
# Famous player
url = 'https://www.wotstars.com/xbox/6757320'

# Open the login page
driver.get(url)
time.sleep(5)

login_button = driver.find_elements(By.CLASS_NAME, "_button_1gcqp_2")
for login in login_button:
    print (login.text)

if login_button[3].text == 'Start tracking':
    login_button[4].click()

print("button activated")

time.sleep(2)
#login_button[3].click()

soup = BeautifulSoup(driver.page_source, 'html.parser')

rows = []
for child in soup.find_all('table')[3].children:
    row = []
    for td in child:
        #print (td.text)
        try:
            row.append(td.text.replace('\n\r', ''))
        except:
            continue
        if len(row) > 0:
            #print (row)
            rows.append(row)
print (len(rows[0]))

#not working
#df = pd.DataFrame(rows[1:], columns=rows[0])
#display(HTML(df.to_html()))
        
print("done")

# Close the browser
driver.quit

Object Inspector enter image description here


Solution

  • Since you have to use selenium for the view more button click anyway, you don't need to use beautifulsoup.

    Using only class names for selecting elements is difficult for the page as it has multiple tables and buttons with the same class.

    Instead, use XPath to find the section BattleTracker by text. that way you can access the right table every time.

    To scrape the table using pandas, you have to provide the HTML. You can get that by selecting the table you want and then pass it to pandas.

    Check the following code:

    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.webdriver.common.by import By
    import pandas as pd
    
    # Set up the WebDriver
    chrome_options = Options()
    chrome_options.add_argument("--headless")  # headless mode
    driver = webdriver.Chrome(options=chrome_options)
    wait = WebDriverWait(driver, 30)
    
    # Famous player
    url = 'https://www.wotstars.com/xbox/6757320'
    
    # Open the login page
    driver.get(url)
    
    # wait for the view more button to be available and click
    wait.until(EC.element_to_be_clickable((By.XPATH, "//div[contains(@class,'sessionWrap') and contains(.,'BattleTracker')]//button[text()='View more']"))).click()
    
    # wait for the table to be expanded
    wait.until(EC.visibility_of_element_located((By.XPATH,"//div[contains(@class,'sessionWrap') and contains(.,'BattleTracker')]//table/tbody/tr[6]")))
    
    # getting the table which will be passed to pandas
    table = driver.find_element(By.XPATH,"//div[contains(@class,'sessionWrap') and contains(.,'BattleTracker')]//table")
    
    # parse the table using pandas
    df = pd.read_html(table.get_attribute('outerHTML'))
    
    print(df)
    
    # Close the browser
    driver.quit()
    

    OUTPUT:

    [                         Tank               Result  Kills  Dmg+A   Dmg  Received  ...  Hits     Pens  Spots    WN8  Duration                  When  
    0   IXFraternité Char Mle. 75   Victory (survived)      2   3089  2469         0  ...    15   86.67%      2   3768     5m 1s             2h 4m ago   
    1   IXFraternité Char Mle. 75  Victory (destroyed)      2   3217  2649      1200  ...    16   87.50%      2   4144     5m 5s            2h 12m ago   
    2   IXFraternité Char Mle. 75   Victory (survived)      1   3556  3123       176  ...    17   94.12%      1   3921    3m 58s            2h 19m ago   
    3   IXFraternité Char Mle. 75   Victory (survived)      0   3055  2803       691  ...    16   87.50%      2   2731    4m 31s            2h 26m ago   
    4   IXFraternité Char Mle. 75   Victory (survived)      2   4434  4157         0  ...    22  100.00%      0   6526    5m 41s            2h 36m ago   
    ..                        ...                  ...    ...    ...   ...       ...  ...   ...      ...    ...    ...       ...                   ...   
    95    IXObject 777 Version II   Defeat (destroyed)      0   2948  1357      1850  ...     5   60.00%      3    952    4m 13s  2/3/2025, 2:30:38 AM   
    96    IXVK 45.02 (P) Ausf. B7   Defeat (destroyed)      1   1417  1417      1950  ...     3  100.00%      0   1344        3m  2/3/2025, 2:28:04 AM   
    97                 IXMäuschen   Victory (survived)      3   8778  6818       453  ...    31   63.33%      1  14097    5m 52s  2/3/2025, 2:17:47 AM   
    98                 IXMäuschen   Defeat (destroyed)      0   5529  3846      2150  ...     8   87.50%      1   3256     6m 5s  2/3/2025, 2:09:19 AM   
    99                 VIIType 62   Defeat (destroyed)      0    542   542       880  ...     3  100.00%      3    691    1m 33s  2/3/2025, 2:04:01 AM   
    
    [100 rows x 16 columns]]