I have a huge table in a PostgresSQL db, called big_table
.
The key is composed of multiple columns: a date column, called date_id
, and multiple string columns (for simplicity let's say there is only 1 string column called str_id
).
The query I want to execute is:
SELECT count (distinct a.str_id)
FROM
big_table a JOIN big_table b
ON a.date_id = (b.date_id + interval '1 day') AND a.str_id = b.str_id
The (expensive) explain plan I get is:
Aggregate (cost=247348767.01..247348767.04 rows=1 width=8)
-> Merge Join (cost=67997554.72..236925087.00 rows=4169472007 width=12)
Merge Cond: ((a.date_id = ((b.date_id + '1 day'::interval))) AND ((a.str_id)::text = (b.str_id)::text))
-> Sort (cost=33998777.36..34428507.76 rows=171892160 width=16)
Sort Key: a.date_id, a.str_id
-> Seq Scan on big_table a (cost=0.00..8765967.80 rows=171892160 width=16)
-> Materialize (cost=33998777.36..34858238.16 rows=171892160 width=16)
-> Sort (cost=33998777.36..34428507.76 rows=171892160 width=16)
Sort Key: ((b.date_id + '1 day'::interval)), b.str_id
-> Seq Scan on big_table b (cost=0.00..8765967.80 rows=171892160 width=16)
JIT:
Functions: 16
Options: Inlining true, Optimization true, Expressions true, Deforming true
I've already created two indexes (with no changes to the plan):
CREATE INDEX index_1 ON big_table(date_id, str_id);
CREATE INDEX index_2 ON big_table((date_id + interval '1 day'), str_id);
How can I optimize this query?
Creating one of the two index I wrote aboive and one new index with str_id and both date_id and date_id + interval solved the issue, the DBMS understood what I wanted to do and actually started using the new indexes.