I have to parse html content in common crawl data-set (warc.gz files). I have decided to used bs4
(Beautifulsoup) module as mostly people suggest it. Following is the code snippet to get text:
from bs4 import BeautifulSoup
soup = BeautifulSoup(src, "lxml")
[x.extract() for x in soup.findAll(['script', 'style'])]
txt = soup.get_text().encode('utf8')
without bs4
, one file is completely processed in 9 minutes (test case) but If I use bs4
to parse text, then Job is finished in about 4 hours. What this is happening. Is there any better solution other than bs4
?
Note: bs4 is class that contains many modules like Beautifilsoup.
Here the main time consuming part is the extracting of tags in the list compression. With lxml
and python regular expression you can do it like the followings.
import re
script_pat = re.compile(r'<script.*?<\/script>')
# to find all scripts tags
script_pat.findall(src)
# do your stuff
print re.sub(script_pat, '', src)
Using lxml
you can do it like this:
from lxml import html, tostring
et = html.fromstring(src)
# remove the tags
[x.drop_tag() for x in et.xpath('//script')]
# do your stuff
print tostring(et)