pythonhtmlweb-scrapingtextbeautifulsoup

Converting html to text with Python


I am trying to convert an html block to text using Python.

Input:

<div class="body"><p><strong></strong></p>
<p><strong></strong>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. <a href="http://example.com/" target="_blank" class="source">Some Link</a> Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p></div>

Desired output:

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa

Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa

Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa

Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa

I tried the html2text module without much success:

#!/usr/bin/env python

import urllib2
import html2text
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen('http://example.com/page.html').read())

txt = soup.find('div', {'class' : 'body'})

print(html2text.html2text(txt))

The txt object produces the html block above. I'd like to convert it to text and print it on the screen.


Solution

  • soup.get_text() outputs what you want:

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html)
    print(soup.get_text())
    

    output:

    Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
    Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa
    Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
    Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
    Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
    

    To keep newlines:

    print(soup.get_text('\n'))
    

    To be identical to your example, you can replace a newline with two newlines:

    soup.get_text().replace('\n','\n\n')