pythonapihttp-redirectwikipedia-api

Wikipedia Pageviews of Target Page only


I have a dataframe of 500 endangered species that I am looking to gather digital data on. I am looping through their scientific names, as it is the most consistent naming practice and retrieving the pageview count of each species over a span of 5 years.

However, when I query with, for example, Panthera tigris, I am receiving quite low view counts, 300-4000+ vs when I query with Tiger, they go over 700,000 views. I understand that the official Wikipedia page for Tiger, is titled Tiger, not Panthera tigris. So I would think I need to include redirects as a parameter, however, do not know where to include redirects in the API request? And will simply including redirects as a parameter return me the view count only for the target page? And not the redirect page?

I am not able to call only with common names, as it is impossible (without manual inspection, which I don't have time to do for 500+ species) to know which common name a species' page is titled with or if their page is titled with their scientific name.

An example of my request line. The User agent is not included in this example.

def WikiPageView(name):
    
    # Calling monthly page views of each species 
    address = "https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia.org/all-access/user/" + name + "/monthly/2015010100/2020123100"

    headers = CaseInsensitiveDict()
    headers["Accept"] = "application/json"
    # Personal username for identification for the Wikipedia API
    headers = {'User-Agent': 'username/0.00 (name@email.com)'}
    
    resp = requests.get(address, headers=headers)
    details = resp.json()
    
    return details 

# Loop over the pd series of species names and call the wikipage function
for name in species['scientific_name']:
    # Spaces are replaced with an underscore for the wikipedia API
    name = name.replace(" ", "_")
    result = WikiPageView(name)
      

Searching for the monthly pageviews on the interactive PageViews Analysis sites results in some inconsistencies that I can't seem to figure out either. I have marked to include redirects in settings and the search query, yet searching with only the scientific name gives me much lower pageview count compared to the common name.

https://pageviews.wmcloud.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&redirects=1&start=2016-01&end=2020-12&pages=Panthera_tigris

When I search for Tiger, there is a peak of 700,000 views in Mar 2020.

https://pageviews.wmcloud.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&redirects=1&start=2016-01&end=2020-12&pages=Tiger

These should show the same result no? Since Panthera tigris redirects to Tiger. Most significantly, however, is that when you search the terms together, suddenly Panthera tigris is returning nearly the same amount of page views and peaks as Tiger, that it did not previously.

https://pageviews.wmcloud.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&redirects=1&start=2016-01&end=2020-12&pages=Tiger%7CPanthera_tigris

Does anyone understand the reason for these inconsistencies? Or know-how for including redirects into the API request?


Solution

  • Best way that I figured to solve this issue was by using a python wrapper for Wikipedia's API and retrieving the url of each page by searching it first with the scientific name of each species. The url returned will contain the title of each species which can then be used in the Pageviews API.

    import wikipediaapi
    # only english language pages
    api = wikipediaapi.Wikipedia('en')
    
    wikiurls = []
    # loop through the unique species names
    for name in species['scientific_name'].unique():
        # Spaces to be replaced with underscore for Wiki's API
        name = name.replace(" ", "_")
        p = api.page(name)
        
        # Store the url retrieved and name we called with
        data = {'url': p.fullurl, 'scientific_name': name}
        # Append dictionary to list
        wikiurls.append(data)
    
    for n in range(len(wikiurls)):
        title = wikiurls[n]['url'].removeprefix("https://en.wikipedia.org/wiki/")
        result = WikiPageView(title)