I want to scrape a Heritrix home page using pythons requests module. When I try to open this page on chrome, I get the error:
This server could not prove that it is 10.100.121.41; its security
certificate is not trusted by your computer's operating system. This
may be caused by a misconfiguration or an attacker intercepting your
connection.
But I can proceed to the page. When I tried to scrape the same page using requests, I got SSL error and after a bit of digging up, I used the following code from a SO question: r=requests.get(url,auth=(username, password),verify=False
. That is giving me the following warning /usr/lib/python2.6/site-packages/requests/packages/urllib3/connectionpool.py:734: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.org/en/latest/security.html
, and the returning status code of 401. How to solve this problem?
The 401 is an indication that you need to authenticate but you're using the wrong method. The other very common method of authentication that requests has built in is Digest Authentication. You can determine if it is looking to use Digest Authentication by looking at:
r.headers.get('www-authenticate')
It should have digest
. (If it doesn't then it isn't expecting Digest Authentication.) You can use Digest Authentication in requests like so:
from requests import auth
r = requests.get(url, auth=auth.HTTPDigestAuth(username, password), verify=False)
The warning you see is not related to the 401, it is simply warning you that the request you're making is to an HTTPS site and your connection could be effectively man-in-the-middle'd by an attacker. If you want to silence that, you can do the following:
from requests.packages import urllib3
urllib3.disable_warnings()