pythonpython-2.7wikipediawikimedia

python retrieve text from multiple random wikipedia pages


I am using python 2.7 with wikipedia package to retrieve the text from multiple random wikipedia pages as explained in the docs.

I use the following code

def get_random_pages_summary(pages = 0):
    import wikipedia
    page_names = [wikipedia.random(1) for i in range(pages)]
    return [[p,wikipedia.page(p).summary] for p in page_names]

text =  get_random_pages_summary(50)

and get the following error

File "/home/user/.local/lib/python2.7/site-packages/wikipedia/wikipedia.py", line 393, in __load raise DisambiguationError(getattr(self, 'title', page['title']), may_refer_to) wikipedia.exceptions.DisambiguationError: "Priuralsky" may refer to: Priuralsky District Priuralsky (rural locality)

what i am trying to do is to get the text. from random pages in Wikipedia, and I need it to be just regular text, without any markdown

I assume that the problem is getting a random name that has more than one option when searching for a Wikipedia page. when i use it to get one Wikipedia page. it works well.

Thanks


Solution

  • According to the document(http://wikipedia.readthedocs.io/en/latest/quickstart.html) the error will return multiple page candidates so you need to search that candidate again.

    try:
        wikipedia.summary("Priuralsky")
    except wikipedia.exceptions.DisambiguationError as e:
        for page_name in e.options:
            print(page_name)
            print(wikipedia.page(page_name).summary)
    

    You can improve your code like this.

    import wikipedia
    
    def get_page_sumarries(page_name):
        try:
            return [[page_name, wikipedia.page(page_name).summary]]
        except wikipedia.exceptions.DisambiguationError as e:
            return [[p, wikipedia.page(p).summary] for p in e.options]
    
    def get_random_pages_summary(pages=0):
        ret = []
        page_names = [wikipedia.random(1) for i in range(pages)]
        for p in page_names:
            for page_summary in get_page_sumarries(p):
                ret.append(page_summary)
        return  ret
    
    text = get_random_pages_summary(50)