pythondjangobeautifulsouphtml5lib

BeautifulSoup - how should I obtain the body contents


I'm parsing HTML with BeautifulSoup. At the end, I would like to obtain the body contents, but without the body tags. But BeautifulSoup adds html, head, and body tags. I this googlegrops discussion one possible solution is proposed:

>>> from bs4 import BeautifulSoup as Soup
>>> soup = Soup('<p>Some paragraph</p>')
>>> soup.body.hidden = True
>>> soup.body.prettify()
u' <p>\n  Some paragraph\n </p>'

This solution is a hack. There should be a better and obvious way to do it.


Solution

  • Do you mean getting everything inbetween the body tags?

    In this case you can use :

    import urllib2
    from bs4 import BeautifulSoup
    page = urllib2.urlopen('some_site').read()
    soup = BeautifulSoup(page)
    body = soup.find('body')
    the_contents_of_body_without_body_tags = body.findChildren(recursive=False)