databasepostgresqlindexingquery-optimization

Postgre join query not using expected index


I have a very simple two-table set in Postgre 16.0:

CREATE TABLE fk_table (
  fk_id INT PRIMARY KEY
, fk_name VARCHAR(50)
);
CREATE TABLE main_table (
  id INT PRIMARY KEY
, date TIMESTAMP
, fk INT REFERENCES fk_table(fk_id)
--, tens more columns
);

The main_table has a lot of data (some tens of columns, millions of rows), fk_table just a few values.

I have a query I'm trying to speed up:

SELECT MIN(date) FROM main_table
INNER JOIN fk_table ON fk = fk_id
WHERE fk_name = 'pending';

I would expect that adding an index by fk and date would speed things up, but it's not working:

CREATE INDEX idx__main_table__fk__date ON main_table(fk, date);

The EXPLAIN ANALYZE results:

QUERY PLAN
Finalize Aggregate  (cost=71379.22..71379.23 rows=1 width=8) (actual time=252.046..257.345 rows=1 loops=1)
  ->  Gather  (cost=71379.01..71379.22 rows=2 width=8) (actual time=251.859..257.338 rows=3 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        ->  Partial Aggregate  (cost=70379.01..70379.02 rows=1 width=8) (actual time=192.993..192.995 rows=1 loops=3)
              ->  Hash Join  (cost=8.18..70376.35 rows=1063 width=8) (actual time=0.330..191.925 rows=16564 loops=3)
                    Hash Cond: (fk = fk_id)
                    ->  Parallel Seq Scan on main_table  (cost=0.00..68849.05 rows=574205 width=12) (actual time=0.235..166.496 rows=459364 loops=3)
                    ->  Hash  (cost=8.17..8.17 rows=1 width=4) (actual time=0.030..0.031 rows=1 loops=3)
                          Buckets: 1024  Batches: 1  Memory Usage: 9kB
                          ->  Index Scan using fk_table_fk_name_key on fk_table  (cost=0.15..8.17 rows=1 width=4) (actual time=0.025..0.025 rows=1 loops=3)
                                Index Cond: ((fk_name)::text = 'pending'::text)
Planning Time: 1.630 ms
Execution Time: 257.392 ms

However, making a nested query, the index is used:

SELECT MIN(date) FROM main_table
WHERE fk = (SELECT fk_id 
            FROM fk_table 
            WHERE fk_name = 'pending')

EXPLAIN ANALYZE RETURNS:

QUERY PLAN
Result  (cost=8.80..8.81 rows=1 width=8) (actual time=0.047..0.047 rows=1 loops=1)
  InitPlan 1 (returns $0)
    ->  Index Scan using fk_table_fk_name_key on fk_table  (cost=0.15..8.17 rows=1 width=4) (actual time=0.028..0.028 rows=1 loops=1)
          Index Cond: ((fk_name)::text = 'pending'::text)
  InitPlan 2 (returns $1)
    ->  Limit  (cost=0.43..0.63 rows=1 width=8) (actual time=0.045..0.045 rows=1 loops=1)
          ->  Index Only Scan using idx__main_table__fk__date on main_table  (cost=0.43..92122.34 rows=459364 width=8) (actual time=0.045..0.045 rows=1 loops=1)
                Index Cond: ((fk_name = $0) AND (date IS NOT NULL))
                Heap Fetches: 0
Planning Time: 0.240 ms
Execution Time: 0.070 ms

Why does the first query don't use the index? Is there something wrong in there, or in the index, or am I missing anything else?


Solution

  • In the first query, it cannot be certain only one row of fk_table will match fk_name = 'pending', so it can't use the ordered index scan method to stop early.

    In the second query, it can be certain as it would be an error if there was more than one row for the subquery, which opens up the benefit of the early-stopping index scan.

    Note that even if there were a unique index on fk_name, the planner still would not use that information.