pythonamazon-web-servicesbeautifulsoupcommon-crawl

Beautifull soup takes too much time for text extraction in common crawl data


I have to parse html content in common crawl data-set (warc.gz files). I have decided to used bs4 (Beautifulsoup) module as mostly people suggest it. Following is the code snippet to get text:

from bs4 import BeautifulSoup

soup = BeautifulSoup(src, "lxml")
[x.extract() for x in soup.findAll(['script', 'style'])]
txt = soup.get_text().encode('utf8')

without bs4, one file is completely processed in 9 minutes (test case) but If I use bs4 to parse text, then Job is finished in about 4 hours. What this is happening. Is there any better solution other than bs4? Note: bs4 is class that contains many modules like Beautifilsoup.


Solution

  • Here the main time consuming part is the extracting of tags in the list compression. With lxml and python regular expression you can do it like the followings.

    import re
    
    script_pat = re.compile(r'<script.*?<\/script>')
    
    # to find all scripts tags
    script_pat.findall(src)
    
    # do your stuff
    print re.sub(script_pat, '', src)
    

    Using lxml you can do it like this:

    from lxml import html, tostring
    et = html.fromstring(src)
    
    # remove the tags
    [x.drop_tag() for x in et.xpath('//script')]
    
    # do your stuff
    print tostring(et)