pythonpython-3.xpandaswikipediamediawiki-api

Generate DF from attributes of tags in list


I have a list of revisions from a Wikipedia article that I queried like this:

import urllib
import re

def getRevisions(wikititle):
    url = "https://en.wikipedia.org/w/api.php?action=query&format=xml&prop=revisions&rvlimit=500&titles="+wikititle 
    revisions = []                                        #list of all accumulated revisions
    next = ''                                             #information for the next request

    while True:
        response = urllib.request.urlopen(url + next).read()     #web request

        response = str(response)

        revisions += re.findall('<rev [^>]*>', response)  #adds all revisions from the current request to the list

        cont = re.search('<continue rvcontinue="([^"]+)"', response)
        if not cont:                                      #break the loop if 'continue' element missing
            break

        next = "&rvcontinue=" + cont.group(1)             #gets the revision Id from which to start the next request
    return revisions    

Which results in a list with each element being a rev Tag as a string:

['<rev revid="343143654" parentid="6546465" minor="" user="name" timestamp="2021-12-12T08:26:38Z" comment="abc" />',...]

How can I get generate a DF from this list


Solution

  • An "easy" way without using regex would be splitting the string and then parsing:

    for rev_string in revisions:
        rev_dict = {}
    
        # Skipping the first and last as it's the tag.
        attributes = rev_string.split(' ')[1:-1]
    
        #Split on = and take each value as key and value and convert value to string to get rid of excess ""
        for attribute in attributes:
            key, value = attribute.split("=")            
            rev_dict[key] = str(value) 
        
        df = pd.DataFrame.from_dict(rev_dict)
    

    This sample would create one dataframe per revision. If you would like to gather multiple reivsions in one dictionary then you handle unique attributes (I don't know if these are changing depending on wiki-document) and then after gathering all attributes in the dictionary you convert to a DataFrame.