python web-scraping beautifulsoup findall

BeautifulSoup - Find <td> while filtering out <a> possibly?

I am having some trouble scraping a certain site as most of the info is buried weirdly and also not a consistent table size.

Here is an example of the HTML:

<tbody>
    <tr>
        <td>
            <a href="LINK">Player1</a>
        </td>
        <td>Position1</td>
        <td>
            <b>Player1 Injury</b>
            <br>
            "Date of injury1"
        </td>
        <td>
            <a href="LINK" class="BUTTON"></a>
        </td>
    </tr>
    <tr class="COLLAPSE"></tr>
    <tr>
        <td>
            <a href="LINK">Player2</a>
        </td>
        <td>Position2</td>
        <td>
            <b>Player2 Injury</b>
            <br>
            "Date of injury2"
        </td>
        <td>
            <a href="LINK" class="BUTTON"></a>
        </td>
    </tr>
    <tr class="COLLAPSE"></tr>
</tbody>

Given this data, all I am trying to do is pull the <td>'s with the Player's injuries and the date of their injury.

If I do a

injury.find_all('td')

Of course, I am going to get all the extra data that I am not looking for. All of the data I would want to pull will always be in that 3rd <td> tag, but I will also need to find the 3rd <td> tag again when in a new tag. Filtering out the class="COLLAPSE" should be easily achieved to hopefully not make that an issue.

So, the result of scraping this data, I would like the result:

['Player1 Injury', 'Date of injury1', 'Player2 Injury', 'Date of injury2']

All help is greatly appreciated.

Solution

Thanks for posting the html. Using that as an example, I think we need to iterate over each <tr> tag within the <tbody> tag, checking if it has the "COLLAPSE" class or not.

If the <tr> tag doesn't have the "COLLAPSE" class, then you can find all the <td> tags inside it and extract the third one (index 2) which contains the player's injury and the date of their injury.

Code below:

from bs4 import BeautifulSoup

# HTML code
html = """
<tbody>
    <tr>
        <td>
            <a href="LINK">Player1</a>
        </td>
        <td>Position1</td>
        <td>
            <b>Player1 Injury</b>
            <br>
            "Date of injury1"
        </td>
        <td>
            <a href="LINK" class="BUTTON"></a>
        </td>
    </tr>
    <tr class="COLLAPSE"></tr>
    <tr>
        <td>
            <a href="LINK">Player2</a>
        </td>
        <td>Position2</td>
        <td>
            <b>Player2 Injury</b>
            <br>
            "Date of injury2"
        </td>
        <td>
            <a href="LINK" class="BUTTON"></a>
        </td>
    </tr>
    <tr class="COLLAPSE"></tr>
</tbody>
"""

# Parse the HTML
soup = BeautifulSoup(html, 'html.parser')

# Find all <tr> tags within the <tbody> tag
trs = soup.tbody.find_all('tr')

# Extract the player's injury and the date of their injury from each <tr> tag
injuries = []
for tr in trs:
    if not tr.has_attr('class') or 'COLLAPSE' not in tr['class']:
        tds = tr.find_all('td')
        injury = tds[2].b.get_text().strip()
        date = tds[2].find_all('br')[-1].next_sibling.strip()
        injuries.append(injury)
        injuries.append(date)

print(injuries) 

# Output: ['Player1 Injury', 'Date of injury1', 'Player2 Injury', 'Date of injury2']