Hi I am building a blogging website in django 1.8 with python 3. In the blog users will write blogs and sometimes add external links. I want to crawl all the pages in this blog website and test every external link provided by the users is valid or not.
How can i do this? Should i use something like python scrapy?
import urllib2
import fnmatch
def site_checker(url):
url_chk = url.split('/')
if fnmatch.fnmatch(url_chk[0], 'http*'):
url = url
else:
url = 'http://%s' %(url)
print url
try:
response = urllib2.urlopen(url).read()
if response:
print 'site is legit'
except Exception:
print "not a legit site yo!"
site_checker('google') ## not a complete url
site_checker('http://google.com') ## this works
Hopefully this works. Urllib will read the html of the site and if its not empty. It's a legit site. Else it's not a site. Also I added a url check to add http:// if its not there.