pythonhtmlparsingbeautifulsoup

Beautiful Soup - Get all text, but preserve link html?


I have to process a large archive of extremely messy HTML full of extraneous tables, spans and inline styles into markdown.

I am trying to use Beautiful Soup to accomplish this task, and my goal is basically the output of the get_text() function, except to preserve anchor tags with the href intact.

As an example, I would like to convert:

<td>
    <font><span>Hello</span><span>World</span></font><br>
    <span>Foo Bar <span>Baz</span></span><br>
    <span>Example Link: <a href="https://google.com" target="_blank" style="mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #395c99;font-weight: normal;text-decoration: underline;">Google</a></span>
</td>

Into:

Hello World
Foo Bar Baz
Example Link: <a href="https://google.com">Google</a>

My thought process so far was to simply grab all the tags and unwrap them all if they aren't anchors, but this causes the text to be repeated several times as soup.find_all(True) returns recursively nested tags as individual elements:

#!/usr/bin/env python

from bs4 import BeautifulSoup

example_html = '<td><font><span>Hello</span><span>World</span></font><br><span>Foo Bar <span>Baz</span></span><br><span>Example Link: <a href="https://google.com" target="_blank" style="mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #395c99;font-weight: normal;text-decoration: underline;">Google</a></span></td>'

soup = BeautifulSoup(example_html, 'lxml')
tags = soup.find_all(True)

for tag in tags:
    if (tag.name == 'a'):
        print("<a href='{}'>{}</a>".format(tag['href'], tag.get_text()))
    else:
        print(tag.get_text())

Which returns multiple fragments/duplicates as the parser moves down the tree:

HelloWorldFoo Bar BazExample Link: Google
HelloWorldFoo Bar BazExample Link: Google
HelloWorldFoo Bar BazExample Link: Google
HelloWorld
Hello
World

Foo Bar Baz
Baz

Example Link: Google
<a href='https://google.com'>Google</a>

Solution

  • One of the possible ways to tackle this problem would be to introduce some special handling for a elements when it comes to printing out a text of an element.

    You can do it by overriding _all_strings() method and returning a string representation of an a descendant element and skip a navigable string inside an a element. Something along these lines:

    from bs4 import BeautifulSoup, NavigableString, CData, Tag
    
    
    class MyBeautifulSoup(BeautifulSoup):
        def _all_strings(self, strip=False, types=(NavigableString, CData)):
            for descendant in self.descendants:
                # return "a" string representation if we encounter it
                if isinstance(descendant, Tag) and descendant.name == 'a':
                    yield str(descendant)
    
                # skip an inner text node inside "a"
                if isinstance(descendant, NavigableString) and descendant.parent.name == 'a':
                    continue
    
                # default behavior
                if (
                    (types is None and not isinstance(descendant, NavigableString))
                    or
                    (types is not None and type(descendant) not in types)):
                    continue
    
                if strip:
                    descendant = descendant.strip()
                    if len(descendant) == 0:
                        continue
                yield descendant
    

    Demo:

    In [1]: data = """
       ...: <td>
       ...:     <font><span>Hello</span><span>World</span></font><br>
       ...:     <span>Foo Bar <span>Baz</span></span><br>
       ...:     <span>Example Link: <a href="https://google.com" target="_blank" style="mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #395c99;font-weight: normal;tex
       ...: t-decoration: underline;">Google</a></span>
       ...: </td>
       ...: """
    
    In [2]: soup = MyBeautifulSoup(data, "lxml")
    
    In [3]: print(soup.get_text())
    
    HelloWorld
    Foo Bar Baz
    Example Link: <a href="https://google.com" style="mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #395c99;font-weight: normal;text-decoration: underline;" target="_blank">Google</a>