pythonpandasweb-scrapingbeautifulsoup

How can I scrape a table from baseball reference using pandas and beautiful soup?


I am trying to scrape the pitching stats on this url and then save the dataframe to a csv file.

https://www.baseball-reference.com/boxes/ARI/ARI202204070.shtml

My current code is below (Python 3.9.7)

_URL = "https://www.baseball-reference.com/boxes/ARI/ARI202204070.shtml"
data = pd.read_html(_URL,attrs={'id': 'ArizonaDiamondbackspitching'},header=1)[0]
data.to_csv('boxscore.csv', index='False')
return data

When I run this code I get the following error:

Traceback (most recent call last):
  File "d:\BaseballAlgo\Baseball_WhoWins.py", line 205, in <module>
    getBoxScore('ARI','2022-04-07')
  File "d:\BaseballAlgo\Baseball_WhoWins.py", line 99, in getBoxScore
    data = pd.read_html(_URL,attrs={'id': 'ArizonaDiamondbackspitching'},header=1)[0]
  File "D:\BaseballAlgo\.venv\lib\site-packages\pandas\io\html.py", line 1240, in   read_html
    return _parse(
  File "D:\BaseballAlgo\.venv\lib\site-packages\pandas\io\html.py", line 1003, in _parse
    raise retained
  File "D:\BaseballAlgo\.venv\lib\site-packages\pandas\io\html.py", line 983, in   _parse
    tables = p.parse_tables()
  File "D:\BaseballAlgo\.venv\lib\site-packages\pandas\io\html.py", line 249, in parse_tables
    tables = self._parse_tables(self._build_doc(), self.match, self.attrs)
  File "D:\BaseballAlgo\.venv\lib\site-packages\pandas\io\html.py", line 598, in   _parse_tables
    raise ValueError("No tables found")
ValueError: No tables found

Past iterations of code:

session = BRefSession()
_URL = "https://www.baseball-reference.com/boxes/ARI/ARI202204070.shtml"
content =session.get(_URL).content
soup = BeautifulSoup(content, "html.parser")
table = soup.find_all('table', id="ArizonaDiamondbackspitching")
print (table)
data = pd.read_html(StringIO(str(table)))[0]

This code runs and when it prints the table the output is "[]". The same traceback above is also outputted as a result of the last line.

I understand what the error is saying but I simply do not understand how that possible. It seems as if the soup.findall function is not able to find the specific table I need but I am not sure why. How can I fix this issue?


Solution

  • Main issue here is that the table is hidden in the comments, so you have to bring it up first, before BeautifulSoup respectively pandas, that use it under the hood, could find it - simplest solution in my opinion is to replace the specific characters in this case:

    .replace('<!--','').replace('-->','')
    

    Example with requests and pandas:

    import requests
    import pandas as pd
    
    df = pd.read_html(
        requests.get(
            'https://www.baseball-reference.com/boxes/ARI/ARI202204070.shtml').text.replace('<!--','').replace('-->',''), 
        attrs={'id':'ArizonaDiamondbackspitching'}
        )[0]
    df
    

    Check also Special Strings in BeautifulSoup docs:

    Tag, NavigableString, and BeautifulSoup cover almost everything you’ll see in an HTML or XML file, but there are a few leftover bits. The main one you’ll probably encounter is the Comment.