I am trying to scrape the pitching stats on this url and then save the dataframe to a csv file.
https://www.baseball-reference.com/boxes/ARI/ARI202204070.shtml
My current code is below (Python 3.9.7)
_URL = "https://www.baseball-reference.com/boxes/ARI/ARI202204070.shtml"
data = pd.read_html(_URL,attrs={'id': 'ArizonaDiamondbackspitching'},header=1)[0]
data.to_csv('boxscore.csv', index='False')
return data
When I run this code I get the following error:
Traceback (most recent call last):
File "d:\BaseballAlgo\Baseball_WhoWins.py", line 205, in <module>
getBoxScore('ARI','2022-04-07')
File "d:\BaseballAlgo\Baseball_WhoWins.py", line 99, in getBoxScore
data = pd.read_html(_URL,attrs={'id': 'ArizonaDiamondbackspitching'},header=1)[0]
File "D:\BaseballAlgo\.venv\lib\site-packages\pandas\io\html.py", line 1240, in read_html
return _parse(
File "D:\BaseballAlgo\.venv\lib\site-packages\pandas\io\html.py", line 1003, in _parse
raise retained
File "D:\BaseballAlgo\.venv\lib\site-packages\pandas\io\html.py", line 983, in _parse
tables = p.parse_tables()
File "D:\BaseballAlgo\.venv\lib\site-packages\pandas\io\html.py", line 249, in parse_tables
tables = self._parse_tables(self._build_doc(), self.match, self.attrs)
File "D:\BaseballAlgo\.venv\lib\site-packages\pandas\io\html.py", line 598, in _parse_tables
raise ValueError("No tables found")
ValueError: No tables found
Past iterations of code:
session = BRefSession()
_URL = "https://www.baseball-reference.com/boxes/ARI/ARI202204070.shtml"
content =session.get(_URL).content
soup = BeautifulSoup(content, "html.parser")
table = soup.find_all('table', id="ArizonaDiamondbackspitching")
print (table)
data = pd.read_html(StringIO(str(table)))[0]
This code runs and when it prints the table the output is "[]". The same traceback above is also outputted as a result of the last line.
I understand what the error is saying but I simply do not understand how that possible. It seems as if the soup.findall function is not able to find the specific table I need but I am not sure why. How can I fix this issue?
Main issue here is that the table
is hidden in the comments, so you have to bring it up first, before BeautifulSoup respectively pandas, that use it under the hood, could find it - simplest solution in my opinion is to replace the specific characters in this case:
.replace('<!--','').replace('-->','')
Example with requests and pandas:
import requests
import pandas as pd
df = pd.read_html(
requests.get(
'https://www.baseball-reference.com/boxes/ARI/ARI202204070.shtml').text.replace('<!--','').replace('-->',''),
attrs={'id':'ArizonaDiamondbackspitching'}
)[0]
df
Check also Special Strings
in BeautifulSoup docs:
Tag, NavigableString, and BeautifulSoup cover almost everything you’ll see in an HTML or XML file, but there are a few leftover bits. The main one you’ll probably encounter is the Comment.