sqlpostgresqlquery-optimization

PostgreSQL - Performance Issue - Self-join on big table to find record on consecutive days needs table materialize and it is slow


I have a huge table in a PostgresSQL db, called big_table.

The key is composed of multiple columns: a date column, called date_id, and multiple string columns (for simplicity let's say there is only 1 string column called str_id).

The query I want to execute is:

SELECT count (distinct a.str_id)
FROM
    big_table a JOIN big_table b
    ON a.date_id = (b.date_id + interval '1 day') AND a.str_id = b.str_id

The (expensive) explain plan I get is:

Aggregate  (cost=247348767.01..247348767.04 rows=1 width=8)
  ->  Merge Join  (cost=67997554.72..236925087.00 rows=4169472007 width=12)
        Merge Cond: ((a.date_id = ((b.date_id + '1 day'::interval))) AND ((a.str_id)::text = (b.str_id)::text))
        ->  Sort  (cost=33998777.36..34428507.76 rows=171892160 width=16)
              Sort Key: a.date_id, a.str_id
              ->  Seq Scan on big_table a  (cost=0.00..8765967.80 rows=171892160 width=16)
        ->  Materialize  (cost=33998777.36..34858238.16 rows=171892160 width=16)
              ->  Sort  (cost=33998777.36..34428507.76 rows=171892160 width=16)
                    Sort Key: ((b.date_id + '1 day'::interval)), b.str_id
                    ->  Seq Scan on big_table b  (cost=0.00..8765967.80 rows=171892160 width=16)
JIT:
  Functions: 16
  Options: Inlining true, Optimization true, Expressions true, Deforming true

I've already created two indexes (with no changes to the plan):

CREATE INDEX index_1 ON big_table(date_id, str_id);
CREATE INDEX index_2 ON big_table((date_id + interval '1 day'), str_id);

How can I optimize this query?


Solution

  • Creating one of the two index I wrote aboive and one new index with str_id and both date_id and date_id + interval solved the issue, the DBMS understood what I wanted to do and actually started using the new indexes.