pythonweb-scrapingtextbeautifulsouphtml-content-extraction

How to scrape only visible webpage text with BeautifulSoup?


Basically, I want to use BeautifulSoup to grab strictly the visible text on a webpage. For instance, this webpage is my test case. And I mainly want to just get the body text (article) and maybe even a few tab names here and there. I have tried the suggestion in this SO question that returns lots of <script> tags and html comments which I don't want. I can't figure out the arguments I need for the function findAll() in order to just get the visible texts on a webpage.

So, how should I find all visible text excluding scripts, comments, css etc.?


Solution

  • Try this:

    from bs4 import BeautifulSoup
    from bs4.element import Comment
    import urllib.request
    
    
    def tag_visible(element):
        if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
            return False
        if isinstance(element, Comment):
            return False
        return True
    
    
    def text_from_html(body):
        soup = BeautifulSoup(body, 'html.parser')
        texts = soup.findAll(text=True)
        visible_texts = filter(tag_visible, texts)  
        return u" ".join(t.strip() for t in visible_texts)
    
    html = urllib.request.urlopen('http://www.nytimes.com/2009/12/21/us/21storm.html').read()
    print(text_from_html(html))