pythonboilerpipe

Ignore SSL verification for boilerpipe python wrapper web extractor?


I'm attempting to extract data from numerous sites that don't have SSL certifications. I'm using the boilerpipe python wrapper to extract the text without HTML and write it to a text file.

I understand how to remove the SSL certification requirement in the requests library, but I can't seem to find a solution when it comes to boilerpipe. Boilerpipe is an amazing Java library for preparing scraped data for NLP so I'd love to be able to use it in Python.

Here's the code I'm running:

for url in urls:
    extractor = Extractor(url='http://www.' + url)
    extracted_text = extractor.getText()
    with open('websitestext.txt', 'a') as webtextfile:
        webtextfile.write(extracted_text)

And here's the error I think is causing the problems (the SSL certification):

urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:748)>

Solution

  • It seems I found a solution with this:

    import ssl
    
    try:
            _create_unverified_https_context = ssl._create_unverified_context
        except AttributeError:
            # Legacy Python that doesn't verify HTTPS certificates by default
            pass
        else:
            # Handle target environment that doesn't support HTTPS verification
            ssl._create_default_https_context = _create_unverified_https_context
    

    And by adding an exception:

    for url in urls:
        try:
            extractor = Extractor(url='http://www.' + url)
            extracted_text = extractor.getText()
        except:
            pass
        with open('websitestext.txt', 'a') as webtextfile:
            webtextfile.write(extracted_text)