pythonpython-3.4urllibcookiejar

Broken Link Checker Fails Head Requests


I am building a broken link checker using Python 3.4 to help ensure the quality of a large collection of articles that I manage. Initially I was using GET requests to check if a link was viable, however I and trying to be as nice as possible when pinging the URLs I'm checking, so I both ensure that I do not check a URL that is tested as working more than once and I have attempted to do just head requests.

However, I have found a site that causes this to simply stop. It neither throws an error, nor opens:

https://www.icann.org/resources/pages/policy-2012-03-07-en

The link itself is fully functional. So ideally I'd like to find a way to process similar links. This code in Python 3.4 will reproduce the issue:

import urllib
import urllib.request

URL = 'https://www.icann.org/resources/pages/policy-2012-03-07-en'
req=urllib.request.Request(URL, None, {'User-Agent': 'Mozilla/5.0 (X11; Linux i686; G518Rco3Yp0uLV40Lcc9hAzC1BOROTJADjicLjOmlr4=) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3','Accept-Encoding': 'gzip, deflate, sdch','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'}, method='HEAD')>>> from http.cookiejar import CookieJar
cj = CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
response = opener.open(req)

As it does not throw an error, I really do not know how to troubleshoot this further beyond narrowing it down to the link that halted the entire checker. How can I check if this link is valid?


Solution

  • From bs4 import BeautifulSoup,SoupStrainer    
    import urllib2    
    import requests    
    import re    
    import certifi    
    import ssl    
    ssl._create_default_https_context = ssl._create_unverified_context
    
    def getStatus(url):
        a=requests.get(url,verify=False)
        report = str(a.status_code)
        return report
    
    
    alllinks=[]
    passlinks=[]
    faillinks=[]
    html_page = urllib2.urlopen("https://link")
    
    soup = BeautifulSoup(html_page,"html.parser")
    for link in soup.findAll('a', attrs={'href': re.compile("^http*")}):
        #print link.get('href')
        status = getStatus(link.get('href'))
        #print ('URL---->',link.get('href'),'Status---->',status)
        link='URL---->',link.get('href'),'Status---->',status
        alllinks.append(link)
    
        if status == '200':
            passlinks.append(link)
        else:
            faillinks.append(link)
    
    
    print alllinks
    print passlinks
    print faillinks