pythonhtmlpandasweb-scrapingbeautifulsoup

How to scrape data from within a comment block and create a dataframe?


I am trying to pull HTML data from baseball-reference.com. I thought going to their website, viewing the page source, the html tags would be within the html code itself. However, after further investigation, the set of html tags that I care about are within comment blocks.

Example: https://www.baseball-reference.com/leagues/AL/2021-standard-batting.shtml Find the tag by "Viewing Source Code":

<div class="table_container" id="div_players_standard_batting">

The code I am looking for is below this line. And if you look above this line, you will see the comment block start <!-- and doesn't end until almost the end of the HTML file.

I can pull the HTML comments with the following code, but it comes with a few issues.

  1. It is in a list and I care only about the one that has the data
  2. It comes with new line tags
  3. I am struggling on how to take the players standard batting string code and reparse it as html code to use BeautifulSoup to grab the data I want.

Code:

from bs4 import BeautifulSoup
from bs4 import Comment
import pandas as pd
import os.path
import requests

r = requests.get("https://www.baseball-reference.com/leagues/majors/2021-standard-batting.shtml")
soup = BeautifulSoup(r.content, "html.parser") # try lxml

Data=[x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment))]
Data

Current Environment Settings:

dependencies:
  - python=3.9.7
  - beautifulsoup4=4.11.1
  - jupyterlab=3.3.2
  - pandas=1.4.2
  - pyodbc=4.0.32

The end goal: Be able to have a pandas dataframe that has each player's data from this web page.

EDIT:

ANSWER:

Changes made to get to my goal: Installed the lxml package via Anaconda Prompt into my environment. Used the following line of code to pull my html data into a dataframe (Provided by: HedgeHog - Thank You!)

pd.read_html([x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="div_players_standard_batting"' in x][0])[0]

Solution

  • You are on the right track, you just have to put the individual parts together.

    In the ResultSet there should be only one element with id div_players_standard_batting, so filter for it and take this element to transform it with pandas.read_html() to a DataFrame:

    pd.read_html([x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="div_players_standard_batting"' in x][0])[0]
    

    or as alternative create a new bs4 object and iterate over its rows:

    soup = BeautifulSoup([x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="div_players_standard_batting"' in x][0])
    for row in soup.select('table tr'):
        ...
    

    Output:

    Rk Name Age Tm Lg G PA AB R H 2B 3B HR RBI SB CS BB SO BA OBP SLG OPS OPS+ TB GDP HBP SH SF IBB Pos Summary
    0 1 Fernando Abad* 35 BAL AL 2 0 0 0 0 0 0 0 0 0 0 0 0 nan nan nan nan nan 0 0 0 0 0 0 1
    1 2 Cory Abbott 25 CHC NL 8 3 3 0 1 0 0 0 0 0 0 0 1 0.333 0.333 0.333 0.667 81 1 0 0 0 0 0 /1H
    2 3 Albert Abreu 25 NYY AL 3 0 0 0 0 0 0 0 0 0 0 0 0 nan nan nan nan nan 0 0 0 0 0 0 1
    3 4 Bryan Abreu 24 HOU AL 1 0 0 0 0 0 0 0 0 0 0 0 0 nan nan nan nan nan 0 0 0 0 0 0 1
    4 5 José Abreu 34 CHW AL 152 659 566 86 148 30 2 30 117 1 0 61 143 0.261 0.351 0.481 0.831 125 272 28 22 0 10 3 *3D/5
    ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
    1787 1720 Bruce Zimmermann* 26 BAL AL 2 4 4 0 0 0 0 0 0 0 0 0 3 0 0 0 0 -100 0 0 0 0 0 0 1
    1788 1721 Jordan Zimmermann 35 MIL NL 2 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 -100 0 0 0 0 0 0 /1
    1789 1722 Tyler Zuber 26 KCR AL 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 -100 0 0 0 0 0 0 1
    1790 1723 Mike Zunino 30 TBR AL 109 375 333 64 72 11 2 33 62 0 0 34 132 0.216 0.301 0.559 0.86 137 186 7 7 0 1 0 2/H
    1791 nan LgAvg per 600 PA nan nan nan 205 600 535 73 130 26 2 20 69 7 2 52 139 0.243 0.316 0.41 0.726 nan 219 11 7 2 4 2 nan

    EDIT

    To get rid of unwanted rows, exclude in column Rk the NaN and Rk values:

    df1 = df1[(~df1.Rk.isna()) & (df1.Rk != 'Rk')]