I'm attempting to extract data from numerous sites that don't have SSL certifications. I'm using the boilerpipe python wrapper to extract the text without HTML and write it to a text file.
I understand how to remove the SSL certification requirement in the requests library, but I can't seem to find a solution when it comes to boilerpipe. Boilerpipe is an amazing Java library for preparing scraped data for NLP so I'd love to be able to use it in Python.
Here's the code I'm running:
for url in urls:
extractor = Extractor(url='http://www.' + url)
extracted_text = extractor.getText()
with open('websitestext.txt', 'a') as webtextfile:
webtextfile.write(extracted_text)
And here's the error I think is causing the problems (the SSL certification):
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:748)>
It seems I found a solution with this:
import ssl
try:
_create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
# Legacy Python that doesn't verify HTTPS certificates by default
pass
else:
# Handle target environment that doesn't support HTTPS verification
ssl._create_default_https_context = _create_unverified_https_context
And by adding an exception:
for url in urls:
try:
extractor = Extractor(url='http://www.' + url)
extracted_text = extractor.getText()
except:
pass
with open('websitestext.txt', 'a') as webtextfile:
webtextfile.write(extracted_text)