python html pandas web-scraping beautifulsoup

How to scrape data from within a comment block and create a dataframe?

I am trying to pull HTML data from baseball-reference.com. I thought going to their website, viewing the page source, the html tags would be within the html code itself. However, after further investigation, the set of html tags that I care about are within comment blocks.

Example: https://www.baseball-reference.com/leagues/AL/2021-standard-batting.shtml Find the tag by "Viewing Source Code":

<div class="table_container" id="div_players_standard_batting">

The code I am looking for is below this line. And if you look above this line, you will see the comment block start <!-- and doesn't end until almost the end of the HTML file.

I can pull the HTML comments with the following code, but it comes with a few issues.

It is in a list and I care only about the one that has the data
It comes with new line tags
I am struggling on how to take the players standard batting string code and reparse it as html code to use BeautifulSoup to grab the data I want.

Code:

from bs4 import BeautifulSoup
from bs4 import Comment
import pandas as pd
import os.path
import requests

r = requests.get("https://www.baseball-reference.com/leagues/majors/2021-standard-batting.shtml")
soup = BeautifulSoup(r.content, "html.parser") # try lxml

Data=[x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment))]
Data

Current Environment Settings:

dependencies:
  - python=3.9.7
  - beautifulsoup4=4.11.1
  - jupyterlab=3.3.2
  - pandas=1.4.2
  - pyodbc=4.0.32

The end goal: Be able to have a pandas dataframe that has each player's data from this web page.

EDIT:

ANSWER:

Changes made to get to my goal: Installed the lxml package via Anaconda Prompt into my environment. Used the following line of code to pull my html data into a dataframe (Provided by: HedgeHog - Thank You!)

pd.read_html([x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="div_players_standard_batting"' in x][0])[0]

Solution

You are on the right track, you just have to put the individual parts together.

In the ResultSet there should be only one element with id div_players_standard_batting, so filter for it and take this element to transform it with pandas.read_html() to a DataFrame:

pd.read_html([x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="div_players_standard_batting"' in x][0])[0]

or as alternative create a new bs4 object and iterate over its rows:

soup = BeautifulSoup([x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="div_players_standard_batting"' in x][0])
for row in soup.select('table tr'):
    ...

Output:

	Rk	Name	Age	Tm	Lg	G	PA	AB	R	H	2B	3B	HR	RBI	SB	CS	BB	SO	BA	OBP	SLG	OPS	OPS+	TB	GDP	HBP	SH	SF	IBB	Pos Summary
0	1	Fernando Abad*	35	BAL	AL	2	0	0	0	0	0	0	0	0	0	0	0	0	nan	nan	nan	nan	nan	0	0	0	0	0	0	1
1	2	Cory Abbott	25	CHC	NL	8	3	3	0	1	0	0	0	0	0	0	0	1	0.333	0.333	0.333	0.667	81	1	0	0	0	0	0	/1H
2	3	Albert Abreu	25	NYY	AL	3	0	0	0	0	0	0	0	0	0	0	0	0	nan	nan	nan	nan	nan	0	0	0	0	0	0	1
3	4	Bryan Abreu	24	HOU	AL	1	0	0	0	0	0	0	0	0	0	0	0	0	nan	nan	nan	nan	nan	0	0	0	0	0	0	1
4	5	José Abreu	34	CHW	AL	152	659	566	86	148	30	2	30	117	1	0	61	143	0.261	0.351	0.481	0.831	125	272	28	22	0	10	3	*3D/5
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1787	1720	Bruce Zimmermann*	26	BAL	AL	2	4	4	0	0	0	0	0	0	0	0	0	3	0	0	0	0	-100	0	0	0	0	0	0	1
1788	1721	Jordan Zimmermann	35	MIL	NL	2	1	1	0	0	0	0	0	0	0	0	0	1	0	0	0	0	-100	0	0	0	0	0	0	/1
1789	1722	Tyler Zuber	26	KCR	AL	1	1	1	0	0	0	0	0	0	0	0	0	1	0	0	0	0	-100	0	0	0	0	0	0	1
1790	1723	Mike Zunino	30	TBR	AL	109	375	333	64	72	11	2	33	62	0	0	34	132	0.216	0.301	0.559	0.86	137	186	7	7	0	1	0	2/H
1791	nan	LgAvg per 600 PA	nan	nan	nan	205	600	535	73	130	26	2	20	69	7	2	52	139	0.243	0.316	0.41	0.726	nan	219	11	7	2	4	2	nan

EDIT

To get rid of unwanted rows, exclude in column Rk the NaN and Rk values:

df1 = df1[(~df1.Rk.isna()) & (df1.Rk != 'Rk')]