pythonpython-3.xweb-scrapingbeautifulsoup

Working with broken HTML + BeautifulSoup


I have some wonderfully broken HTML that, long story short, is preventing me from using the normal nested <table>, <tr>, <td> structure that would make it easy to reconstruct tables.

Here's a snippet with line numbers for reference:

1      <td valign="top">   <!-- closing </td> should be on 6 -->
2      <font face="arial" size="1">
3       <center>
4        06-30-95
5       </center>
6       <tr valign="top">
7        <td>
8         <center>
9          <font ,="" arial,="" face="arial" sans="" serif"="" size="1">
10          1382
11          <p>
12           (23)
13          </p>
14         </font>
15        </center>
16       </td>
17       <td>
18        <font ,="" arial,="" face="arial" sans="" serif"="" size="1">
19         <center>
20          06-18-14
21         </center>
22        </font>
23       </td>
24      </tr>
25    </td>    <!-- this should should be on 6 -->

The nesting of trs within tds within trs has no scheme to it whatesover, and is coupled with unclosed tags to boot. The HTML tree in no way resembles how it is structurally rendered. (In this case, I suppose there are technically no missing closing tags, but the actual rendering of the page makes it clear there should be no nested tds.)

However, playing by the following set of rules would work in this case:

Desired result here would be something like:

['06-30-95', '1382\n(23)', '06-18-14']

How can this be addressed in BeautifulSoup? I would show an attempt, but have picked through the docs and some of the source and not found much at all.

Currently this would parse to:

html = """
<td valign="top">
 <font face="arial" size="1">
  <center>
   06-30-95
  </center>
  <tr valign="top">
   <td>
    <center>
     <font ,="" arial,="" face="arial" sans="" serif"="" size="1">
      1382
      <p>
       (23)
      </p>
     </font>
    </center>
   </td>
   <td>
    <font ,="" arial,="" face="arial" sans="" serif"="" size="1">
     <center>
      06-18-14
     </center>
    </font>
   </td>
  </tr>
</td>
"""

from bs4 import BeautifulSoup, SoupStrainer

strainer = SoupStrainer('td')
soup = BeautifulSoup(html, 'html.parser', parse_only=strainer)
[tag.text.replace('\n', '') for tag in soup.find_all('td')]

['   06-30-95        1382             (23)            06-18-14     ',
 '      1382             (23)      ',
 '      06-18-14     ']

And my issue with that result is not the whitespace; it's the repetition of substrings. It almost seems like I'd need to recursively work upwards from the innermost tags, popping off each and working outwards. But I have to guess there's more built-in functionality for dealing with missing closing tags (handle_endtag stands out from the BeautifulSoup constructor?).


Solution

  • For wonderfully broken HTML, there are two ways you can go about this. First is to find the most consistently sets of opened/closed tags at the innermost possible nested level, and only just make use of the first one. In this limited example provided it looks like the <center> tags will satisfy this. Consider the following:

    >>> from bs4 import BeautifulSoup
    >>> soup = BeautifulSoup(html, 'html.parser')
    >>> [t.find('center').text.strip() for t in soup.find_all('td')]
    ['06-30-95', '1382\n      \n       (23)', '06-18-14']
    

    Alternatively, using lxml instead (as the documentation listed that as a method) may actually work better overall:

    >>> soup2 = BeautifulSoup(html, 'lxml')
    >>> [t.text.strip() for t in soup2.find_all('td')]
    ['06-30-95', '1382\n      \n       (23)', '06-18-14']
    

    There are other methods that are covered in this thread: Fast and effective way to parse broken HTML?