I am having some trouble scraping a certain site as most of the info is buried weirdly and also not a consistent table size.
Here is an example of the HTML:
<tbody>
<tr>
<td>
<a href="LINK">Player1</a>
</td>
<td>Position1</td>
<td>
<b>Player1 Injury</b>
<br>
"Date of injury1"
</td>
<td>
<a href="LINK" class="BUTTON"></a>
</td>
</tr>
<tr class="COLLAPSE"></tr>
<tr>
<td>
<a href="LINK">Player2</a>
</td>
<td>Position2</td>
<td>
<b>Player2 Injury</b>
<br>
"Date of injury2"
</td>
<td>
<a href="LINK" class="BUTTON"></a>
</td>
</tr>
<tr class="COLLAPSE"></tr>
</tbody>
Given this data, all I am trying to do is pull the <td>
's with the Player's injuries and the date of their injury.
If I do a
injury.find_all('td')
Of course, I am going to get all the extra data that I am not looking for. All of the data I would want to pull will always be in that 3rd <td>
tag, but I will also need to find the 3rd <td>
tag again when in a new tag. Filtering out the class="COLLAPSE"
should be easily achieved to hopefully not make that an issue.
So, the result of scraping this data, I would like the result:
['Player1 Injury', 'Date of injury1', 'Player2 Injury', 'Date of injury2']
All help is greatly appreciated.
Thanks for posting the html
. Using that as an example, I think we need to iterate over each <tr>
tag within the <tbody>
tag, checking if it has the "COLLAPSE" class or not.
If the <tr>
tag doesn't have the "COLLAPSE" class, then you can find all the <td>
tags inside it and extract the third one (index 2) which contains the player's injury and the date of their injury.
Code below:
from bs4 import BeautifulSoup
# HTML code
html = """
<tbody>
<tr>
<td>
<a href="LINK">Player1</a>
</td>
<td>Position1</td>
<td>
<b>Player1 Injury</b>
<br>
"Date of injury1"
</td>
<td>
<a href="LINK" class="BUTTON"></a>
</td>
</tr>
<tr class="COLLAPSE"></tr>
<tr>
<td>
<a href="LINK">Player2</a>
</td>
<td>Position2</td>
<td>
<b>Player2 Injury</b>
<br>
"Date of injury2"
</td>
<td>
<a href="LINK" class="BUTTON"></a>
</td>
</tr>
<tr class="COLLAPSE"></tr>
</tbody>
"""
# Parse the HTML
soup = BeautifulSoup(html, 'html.parser')
# Find all <tr> tags within the <tbody> tag
trs = soup.tbody.find_all('tr')
# Extract the player's injury and the date of their injury from each <tr> tag
injuries = []
for tr in trs:
if not tr.has_attr('class') or 'COLLAPSE' not in tr['class']:
tds = tr.find_all('td')
injury = tds[2].b.get_text().strip()
date = tds[2].find_all('br')[-1].next_sibling.strip()
injuries.append(injury)
injuries.append(date)
print(injuries)
# Output: ['Player1 Injury', 'Date of injury1', 'Player2 Injury', 'Date of injury2']