pythonbeautifulsoup

How can I get the contents of the first `td` in a `table`


I want to take the content of the first td of a table in a HTML document. For example, I have this table:

<table class="bp_ergebnis_tab_info">
    <tr>
            <td>
                     This is a sample text
            </td>

            <td>
                     This is the second sample text
            </td>
    </tr>
</table>

How can I use Beautifulsoup to take the text "This is a sample text"? I use soup.findAll('table' ,attrs={'class':'bp_ergebnis_tab_info'}) to get the whole table.

The target is http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=799.601437941842&SchulAdresseMapDO=142323

Note; since the html is a bit invalid - I think that we have to do some cleaning.


Solution

  • First find the table (as you are doing). Using find rather than findall returns the first item in the list (rather than returning a list of all finds - in which case we'd have to add an extra [0] to take the first element of the list):

    table = soup.find('table' ,attrs={'class':'bp_ergebnis_tab_info'})
    

    Then use find again to find the first td:

    first_td = table.find('td')
    

    Then use encode_contents() to extract the textual contents:

    text = first_td.encode_contents()
    

    ... and the job is done (though you may also want to use strip() to remove leading and trailing spaces:

    trimmed_text = text.strip()
    

    This should give:

    >>> print trimmed_text
    This is a sample text
    >>>
    

    as desired.