I am trying to access historical google page rankings or alexa rankings over time to add some weightings on a search engine I am making for fun. This would be a separate function that I would call in Python (ideally) and pass in the paramaters of the URL and how long I wanted to get the average over, measured in days and then I could just use that information to weight my results!
I think it could be fun to work on, but I also feel that this may be easy to do with some trick of the APIs some guru might be able to show me and save me a few sleepless weeks! Can anyone help?
Thanks a lot !
If you look at the Alexa page for stack overflow you can see that next to the global rank it offers change of the site's rank over the past three months. This may not be down to the level of granularity that you would like, but you could scrape out this information relatively easily and I doubt that you would gain much additional information from looking at changes of different lengths of time. The long term answer is to collect and store the rankings yourself so that you have a historical record going forward.
Update: Here is sample code.
import mechanize
import cookielib
from BeautifulSoup import BeautifulSoup
def changerankscrapper(site):
"""
Takes a site url, scrapes that site's Alexa page,
and returns the site's global Alexa rank and the
change in that rank over the past three months.
"""
#Create Alexa URL
url = "http://www.alexa.com/siteinfo/" + site
#Get HTML
cj = cookielib.CookieJar()
mech = mechanize.OpenerFactory().build_opener(mechanize.HTTPCookieProcessor(cj))
request = mechanize.Request(url)
response = mech.open(request)
html = response.read()
#Parse HTML with BeautifulSoup
soup = BeautifulSoup(html)
globalrank = int(soup.find("strong", { "class" : "metricsUrl font-big2 valign" }).text)
changerank = int(soup.find("span", { "class" : "change-wrapper change-up" }).text)
return globalrank, changerank
#Example
site = "http://stackoverflow.com/"
globalrank, changerank = changerankscrapper(site)
print(globalrank)
print(changerank)