pythonweb-scrapingbeautifulsoup

Python - Easiest way to scrape text from list of URLs using BeautifulSoup


What's the easiest way to scrape just the text from a handful of webpages (using a list of URLs) using BeautifulSoup? Is it even possible?

Best, Georgina


Solution

  • import urllib2
    import BeautifulSoup
    import re
    
    Newlines = re.compile(r'[\r\n]\s+')
    
    def getPageText(url):
        # given a url, get page content
        data = urllib2.urlopen(url).read()
        # parse as html structured document
        bs = BeautifulSoup.BeautifulSoup(data, convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES)
        # kill javascript content
        for s in bs.findAll('script'):
            s.replaceWith('')
        # find body and extract text
        txt = bs.find('body').getText('\n')
        # remove multiple linebreaks and whitespace
        return Newlines.sub('\n', txt)
    
    def main():
        urls = [
            'http://www.stackoverflow.com/questions/5331266/python-easiest-way-to-scrape-text-from-list-of-urls-using-beautifulsoup',
            'http://stackoverflow.com/questions/5330248/how-to-rewrite-a-recursive-function-to-use-a-loop-instead'
        ]
        txt = [getPageText(url) for url in urls]
    
    if __name__=="__main__":
        main()
    

    It now removes javascript and decodes html entities.