Background
I am looking to measure statistics from a website wotstars for a XBOX and Playstation game, World of Tanks Console.
Initially I tried just using Excel to scrape the site directly for me into Power Query, the immediate issue was that only 5 rows of recent match data (5 games) is available from the website as loaded, a button needs to be click to view the last 100 matches.
I thought this was a good idea to learn more about web-scraping with python.
I successfully have managed to use Selenium to activate the "View More" button to shows all 100 rows and I can return the table of interest.
Problem
I am currently stuck through writing the td data to a 2D array that I can then dump to a csv file, I seem to be writing 100 columns rather than 100 rows?
I have looked at sites such as A Guide to Scraping HTML Tables with Pandas and BeautifulSoup for approaches (but this particular example doesn't show the source table).
Other
If I look at Excel the query connection that attaches to this table is shown below
Code
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
from bs4 import BeautifulSoup
import pandas as pd
# Set up the WebDriver
driver = webdriver.Chrome()
# Famous player
url = 'https://www.wotstars.com/xbox/6757320'
# Open the login page
driver.get(url)
time.sleep(5)
login_button = driver.find_elements(By.CLASS_NAME, "_button_1gcqp_2")
for login in login_button:
print (login.text)
if login_button[3].text == 'Start tracking':
login_button[4].click()
print("button activated")
time.sleep(2)
#login_button[3].click()
soup = BeautifulSoup(driver.page_source, 'html.parser')
rows = []
for child in soup.find_all('table')[3].children:
row = []
for td in child:
#print (td.text)
try:
row.append(td.text.replace('\n\r', ''))
except:
continue
if len(row) > 0:
#print (row)
rows.append(row)
print (len(rows[0]))
#not working
#df = pd.DataFrame(rows[1:], columns=rows[0])
#display(HTML(df.to_html()))
print("done")
# Close the browser
driver.quit
Since you have to use selenium
for the view more button click anyway, you don't need to use beautifulsoup
.
Using only class names for selecting elements is difficult for the page as it has multiple tables and buttons with the same class.
Instead, use XPath to find the section BattleTracker
by text. that way you can access the right table every time.
To scrape the table using pandas, you have to provide the HTML. You can get that by selecting the table you want and then pass it to pandas.
Check the following code:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import pandas as pd
# Set up the WebDriver
chrome_options = Options()
chrome_options.add_argument("--headless") # headless mode
driver = webdriver.Chrome(options=chrome_options)
wait = WebDriverWait(driver, 30)
# Famous player
url = 'https://www.wotstars.com/xbox/6757320'
# Open the login page
driver.get(url)
# wait for the view more button to be available and click
wait.until(EC.element_to_be_clickable((By.XPATH, "//div[contains(@class,'sessionWrap') and contains(.,'BattleTracker')]//button[text()='View more']"))).click()
# wait for the table to be expanded
wait.until(EC.visibility_of_element_located((By.XPATH,"//div[contains(@class,'sessionWrap') and contains(.,'BattleTracker')]//table/tbody/tr[6]")))
# getting the table which will be passed to pandas
table = driver.find_element(By.XPATH,"//div[contains(@class,'sessionWrap') and contains(.,'BattleTracker')]//table")
# parse the table using pandas
df = pd.read_html(table.get_attribute('outerHTML'))
print(df)
# Close the browser
driver.quit()
OUTPUT:
[ Tank Result Kills Dmg+A Dmg Received ... Hits Pens Spots WN8 Duration When
0 IXFraternité Char Mle. 75 Victory (survived) 2 3089 2469 0 ... 15 86.67% 2 3768 5m 1s 2h 4m ago
1 IXFraternité Char Mle. 75 Victory (destroyed) 2 3217 2649 1200 ... 16 87.50% 2 4144 5m 5s 2h 12m ago
2 IXFraternité Char Mle. 75 Victory (survived) 1 3556 3123 176 ... 17 94.12% 1 3921 3m 58s 2h 19m ago
3 IXFraternité Char Mle. 75 Victory (survived) 0 3055 2803 691 ... 16 87.50% 2 2731 4m 31s 2h 26m ago
4 IXFraternité Char Mle. 75 Victory (survived) 2 4434 4157 0 ... 22 100.00% 0 6526 5m 41s 2h 36m ago
.. ... ... ... ... ... ... ... ... ... ... ... ... ...
95 IXObject 777 Version II Defeat (destroyed) 0 2948 1357 1850 ... 5 60.00% 3 952 4m 13s 2/3/2025, 2:30:38 AM
96 IXVK 45.02 (P) Ausf. B7 Defeat (destroyed) 1 1417 1417 1950 ... 3 100.00% 0 1344 3m 2/3/2025, 2:28:04 AM
97 IXMäuschen Victory (survived) 3 8778 6818 453 ... 31 63.33% 1 14097 5m 52s 2/3/2025, 2:17:47 AM
98 IXMäuschen Defeat (destroyed) 0 5529 3846 2150 ... 8 87.50% 1 3256 6m 5s 2/3/2025, 2:09:19 AM
99 VIIType 62 Defeat (destroyed) 0 542 542 880 ... 3 100.00% 3 691 1m 33s 2/3/2025, 2:04:01 AM
[100 rows x 16 columns]]