I have a Wikipedia dump and I want to filter out the articles that belong to a certain category (e.g., 'Religion'). I know that each article has a list of categories at the bottom of the text
field, but the problem is that only the immediate categories are listed (e.g., 'Jesus' article belongs to 'Religion' category, but 'Religion' itself is not listed as one of its categories at the bottom).
My approach
My first approach was to use the categorylinks
table and given a certain high-level category such as 'Religion', traverse the category graph listing each category or page that lists 'Religion' at the bottom of its page.
The problem
I was doing good until I stumbled upon this scenario:
Religion -> Religon and society -> Religion and science -> Religion in science fiction -> Jedi -> categories and articles about Star Wars.
So, by my algorithm, the article Return of the Jedi belongs to the high-level category Religion (and, well, it is true)... but I don't really want to filter out 'Return of the Jedi' cause it does belong more to other categories
(I guess this is the main problem, how to discern the weight of the different categories of a given article).
Another solutions
One possible solution is to for each category|article I find while traversing the graph, check if it does not belong to any other of the categories I already visited. Problem is that doesn't quite work because the second high-level category after, e.g., 'Religion' could have another high-level parent distinct from 'Religion'.
Another possible solution is to cut the traversal at a certain level, e.g., 3. This would fix the previous example; now the problem is at which level (heuristic?). And, is not an optimal solution, some articles will remain although they belong to the high-level category specified. Using PetScan, and cutting the traversal at depth 2 gives around 12000 articles: very few articles and still some 'false positives' like 'Bertrand Russell'.
Any ideas?
Edit: using the Wikipedia API doesn't seem like an option (I need to filter out many categories).
I think you need to go back to your initial requirement and clarify it. In your question, you've started by stating "I want to filter out the articles that belong to a certain category". You're achieving this result already, but not satisfied that some specific article (i.e. Return of the Jedi) is being returned, even though it fits your stated criteria.
You've correctly identified the source of the problem with your wording of "does belong more to other categories", but this is expressed as a very arbitrary rule, and you'll need something less subjective to solve the problem I think.
In other words, "Return of the Jedi" is a member of the Religion category according to Wikipedia, so you'll need to clarify why you don't want it as a result before you can exclude it via some algorithm. If you can define the additional criteria, you can most likely refine your filter to exclude things you don't want. This might give you "find all articles with the category Religion excluding those with the category Films", for example.
If you can rephrase your English-language requirement in a more precise way, I'm sure it'll lead toward a solution.