I have a gigantic indexed (104 million row) table of docs, pages and words:
CREATE TABLE my_table (
doc_id integer,
page_id integer,
word_id integer
);
CREATE INDEX my_table_word ON my_table (word_id);
CREATE INDEX my_table_doc ON my_table (doc_id);
CREATE INDEX my_table_page my_table (page_id);
I want to find pages within the same documents that have both word A and word B. My current attempts are as follows:
Attempt 1 - aggregate things:
SELECT doc_id, page_id
FROM my_table
WHERE word_id in (123, 456)
group by 1,2
having count(distinct word_id) = 2
-- ~39k row result, took 20 seconds
Attempt 2) with CTEs, marginally faster
with foo as (
select doc_id, page_id
from my_table
where word_id = 123 -- foo -- 44k rows
),
bar as (
select doc_id, page_id
from my_table
where word_id = 456 -- bar -- 439k rows
)
select f.doc_id, f.page_id
from foo f
inner join bar b on f.doc_id = b.doc_id and f.page_id = b.page_id
-- same results, takes 15 seconds
Attempt 3) - doing an INTERSECT
between the two CTEs is exactly the same 15 seconds, probably same query plan.
Is there a faster way to do this? I'm hoping to get this down to < 1 second for a web app with somewhat impatient users.
Try a self join:
SELECT DISTINCT doc_id, page_id
FROM my_table a
JOIN my_table b ON b.page_id = a.page_id
AND b.word_id = 456
WHERE a.word_id = 123
With an index
CREATE INDEX ON my_table (word_id, page_id, doc_id);
which is a covering index, allowing an index-only query.