pythondjangotestingweb-scrapingexternal-links

How to test external url or links in a django website?


Hi I am building a blogging website in django 1.8 with python 3. In the blog users will write blogs and sometimes add external links. I want to crawl all the pages in this blog website and test every external link provided by the users is valid or not.

How can i do this? Should i use something like python scrapy?


Solution

  • import urllib2
    import fnmatch
    
    def site_checker(url):
    
        url_chk = url.split('/')
        if fnmatch.fnmatch(url_chk[0], 'http*'):
            url = url
        else:
            url = 'http://%s' %(url)
        print url
    
        try:
            response = urllib2.urlopen(url).read()
            if response:
                print 'site is legit'
        except Exception:
        print "not a legit site yo!"
    
    site_checker('google') ## not a complete url
    site_checker('http://google.com') ## this works
    

    Hopefully this works. Urllib will read the html of the site and if its not empty. It's a legit site. Else it's not a site. Also I added a url check to add http:// if its not there.