postgresqlindexingquery-optimization

Fastest way to find SQL intersection between 2 subsections of a huge dataset?


I have a gigantic indexed (104 million row) table of docs, pages and words:

CREATE TABLE my_table (
    doc_id integer,
    page_id integer,
    word_id integer
);

CREATE INDEX my_table_word ON my_table (word_id);
CREATE INDEX my_table_doc ON my_table (doc_id);
CREATE INDEX my_table_page my_table (page_id);

I want to find pages within the same documents that have both word A and word B. My current attempts are as follows:

Attempt 1 - aggregate things:

SELECT doc_id, page_id
FROM my_table
WHERE word_id in (123, 456)
group by 1,2 
having count(distinct word_id) = 2

-- ~39k row result, took 20 seconds

Attempt 2) with CTEs, marginally faster

with foo as (
    select doc_id, page_id
    from my_table
    where word_id = 123 -- foo -- 44k rows
),

bar as (
    select doc_id, page_id
    from my_table
    where word_id = 456 -- bar -- 439k rows
)

select f.doc_id, f.page_id
from foo f
inner join bar b on f.doc_id = b.doc_id and f.page_id = b.page_id

-- same results, takes 15 seconds

Attempt 3) - doing an INTERSECT between the two CTEs is exactly the same 15 seconds, probably same query plan.

Is there a faster way to do this? I'm hoping to get this down to < 1 second for a web app with somewhat impatient users.


Solution

  • Try a self join:

    SELECT DISTINCT doc_id, page_id
    FROM my_table a
    JOIN my_table b ON b.page_id = a.page_id
      AND b.word_id = 456
    WHERE a.word_id = 123
    

    With an index

    CREATE INDEX ON my_table (word_id, page_id, doc_id);
    

    which is a covering index, allowing an index-only query.