pythonweb-scrapingwikipediapywikibot

Get all names from wikipedia-site?


i try to extract all names from this site -

https://en.wikipedia.org/wiki/Category:Masculine_given_names

(and i want to have all names which are listed on this site and the following pages - but also the subcategories which are listed at the top like Afghan masculine given names, African masculine given names, etc.)

I tried this with the following code:

import pywikibot
from pywikibot import pagegenerators
site = pywikibot.Site()
cat = pywikibot.Category(site,'Category:Masculine_given_names')
gen = pagegenerators.CategorizedPageGenerator(cat)
for idx,page in enumerate(gen):
  text = page.text
  print(idx)
  print(text)

Which generally works fine and gave me at least the detail-page of a single name page. But how can i get all the names / from all the subpages on this site but also from the subcategories?


Solution

  • How to find subcategories and subpages on wikipedia using pywikibot?

    This is already answered here using Category methods but you can also use pagegenerators CategorizedPageGenerator function. All you need is setting the recurse option:

      >>> gen = pagegenerators.CategorizedPageGenerator(cat, recurse=True)
    

    Refer the documentation for it. You may also include pagegenerators options within your script in such a way decribed in this example and call your script with -catr option:

      pwb.py <yourscript> -catr:Masculine_given_names