nlpmediawikiwikipediawikimedia-dumps

Extract wikipedia articles belonging to a category from offline dumps


I have wikipedia article dumps in different languages. I want to filter them with articles which belong to a category(specifically Category:WikiProject_Biography)

I could get a lot of similar questions for example:

  1. Wikipedia API to get articles belonging to a category
  2. How do I get all articles about people from Wikipedia?

However, I would like to do it all offline. That is using dumps, and also for different languages.

Other things which I explored are category table and category link table. MediaWiki_1.28.0_database_schema


Solution

  • Fetch the page and categorylinks tables from the dump, then run

    SELECT
        page_namespace,
        page_title
    FROM
        page
        JOIN categorylinks ON page_id = cl_from
    WHERE
        cl_to = 'WikiProject_Biography'
    ;
    

    to get the list of pages.