First of all I'm sorry for the overly vague title, however I'm unfamiliar with the proper terminology for a problem like this.
I'm attempting to retrieve a list of page titles from Wiktionary (Wikimedia wiki-based dictionary) where the page must be categorized under English_lemmas
, but must not be categorized under English_phrases
, English_slang
, and English_%_forms
(where % is the wildcard).
The two necessary tables are page
which contains page information (page_id
, page_title
), and categorylinks
which contains the categories that each page is categorized under.
The relevant structure of page
is like so:
+---------+------------------+
| page_id | page_title |
+---------+------------------+
| 0 | race |
| 1 | racing |
| 2 | that's all folks |
| 3 | fire |
| 4 | psychédélique |
+---------+------------------+
and the relevant structure of categorylinks
is like so: (I've added line breaks for easier reading)
+---------+-------------------------+
| cl_from | cl_to |
+---------+-------------------------+
| 0 | English_lemmas |
| 0 | English_verbs |
| 1 | English_lemmas |
| 1 | English_verbs |
| 1 | English_non-lemma_forms |
| 2 | English_lemmas |
| 2 | English_phrases |
| 3 | English_lemmas |
| 3 | English_nouns |
| 4 | French_lemmas |
| 4 | French_adjectives |
+---------+-------------------------+
where categorylinks.cl_from
is a direct reference to page.page_id
and categorylinks.cl_to
is the title of the category.
I need to select race
and fire
as they are both categorized under English_lemmas
, but not racing
or that's all folks
as in addition to both being categorized under English_lemmas
they are also categorized under English_%_forms
and English_phrases
respectively, and not psychédélique
as it is not categorized under English_lemmas
.
Therefore the ideal result would be
+---------+------------------+
| page_id | page_title |
+---------+------------------+
| 0 | race |
| 3 | fire |
+---------+------------------+
Is this something that is feasible to achieve efficiently? And if so how can I achieve this?
I have a fairly basic understanding of SQL (basic SELECT, UPDATE, etc statements), so something like this is far beyond my comprehension.
If I understood your requirements correctly:
select *
from page
where page_id not in (select cl_from from categorylinks
where cl_to like 'English_%_forms'
or cl_to like 'English_phrases')
and page_id in (select cl_from from categorylinks
where cl_to like 'English_lemmas')