I would like to use the jusText implementation found here https://github.com/miso-belica/jusText to get the clean content out of an html page. Basically it works like this:
import requests
import justext
response = requests.get("http://planet.python.org/")
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
if not paragraph.is_boilerplate:
print paragraph.text
I have already downloaded the pages that I would like to parse using this tool (some of them are no longer available online), and I extract the html content out of them. Since jusText appears to be only working on the output of a request (which is a response type object), I am wondering if there is any custom way to set the content of a response object to contain the html text I would like to parse.
response.content
is of <type 'str'>
>>> from requests import get
>>> r = get("http://www.google.com/")
>>> type(r.content)
<type 'str'>
So just call:
justext.justext(my_html_string, justext.get_stoplist("English"))