jsonpostgresqldatabase-indexes

Why is postgres not using my index on a json column?


I am trying to figure out why postgres stops using my index on a json field when i have a lot of data. Up to 1 million rows, it works fine and uses the index. But when I try the query on a table with 10 million rows, it switches to a full table scan, which of course is very slow.

I am on postgres 17 (docker image postgres:17.1-alpine3.20)

My table is defined as

CREATE TABLE auction_jsonb_indexed (
  id SERIAL NOT NULL, 
  title VARCHAR(255) NOT NULL, 
  start_date TIMESTAMP(0) WITHOUT TIME ZONE NOT NULL,
  end_date TIMESTAMP(0) WITHOUT TIME ZONE NOT NULL, 
  current_price INT DEFAULT NULL, 
  item JSONB NOT NULL,
  PRIMARY KEY(id)
);
CREATE INDEX jsonb_author ON auction_jsonb_indexed ((item->>'author'));

I then generate test data with

TRUNCATE TABLE auction_jsonb_indexed;
INSERT INTO auction_jsonb_indexed (title, start_date, end_date, item)
SELECT
    'Title ' || cnt,
    '2024-01-01'::timestamp + (INTERVAL '1 minute' * (cnt-1)),
    '2024-01-01'::timestamp + (INTERVAL '1 minute' * (cnt+6)),
    JSON_BUILD_OBJECT(
            'type', 'book',
            'genre', 'Genre '||cnt,
            'author', 'Author ' || cnt % 1000,
            'title', 'Title ' || cnt,
            'description', 'Description ' || cnt,
            'startDate', '2024-01-01'::timestamp + (INTERVAL '1 minute' * (cnt-1)),
            'endDate', '2024-01-01'::timestamp + (INTERVAL '1 minute' * (cnt+6))
    )::json
FROM generate_series(1, 10000000) AS t(cnt);

Now when i query, i end up with a full scan.

EXPLAIN ANALYZE SELECT COUNT(*) FROM auction_jsonb_indexed WHERE item->>'author' = 'Author 1';

Finalize Aggregate  (cost=420329.21..420329.22 rows=1 width=8) (actual time=4248.623..4253.143 rows=1 loops=1)
  ->  Gather  (cost=420329.00..420329.21 rows=2 width=8) (actual time=4248.436..4253.131 rows=3 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        ->  Partial Aggregate  (cost=419329.00..419329.01 rows=1 width=8) (actual time=4237.667..4237.668 rows=1 loops=3)
              ->  Parallel Seq Scan on auction_jsonb_indexed  (cost=0.00..418285.94 rows=417222 width=0) (actual time=4.612..4228.083 rows=333333 loops=3)
                    Filter: ((item ->> 'author'::text) = 'Author 1'::text)
                    Rows Removed by Filter: 3000000
Planning Time: 0.108 ms
JIT:
  Functions: 14
"  Options: Inlining false, Optimization false, Expressions true, Deforming true"
"  Timing: Generation 0.837 ms (Deform 0.195 ms), Inlining 0.000 ms, Optimization 1.722 ms, Emission 12.087 ms, Total 14.646 ms"
Execution Time: 4253.505 ms

If i change the table to 1 million rows instead of 10, the explain shows that the right plan is made

Aggregate  (cost=3453.72..3453.73 rows=1 width=8) (actual time=0.921..0.922 rows=1 loops=1)
  ->  Bitmap Heap Scan on auction_jsonb_indexed  (cost=16.12..3451.24 rows=993 width=0) (actual time=0.698..0.865 rows=1000 loops=1)
        Recheck Cond: ((item ->> 'author'::text) = 'Author 1'::text)
        Heap Blocks: exact=1000
        ->  Bitmap Index Scan on jsonb_author  (cost=0.00..15.87 rows=993 width=0) (actual time=0.446..0.446 rows=1000 loops=1)
              Index Cond: ((item ->> 'author'::text) = 'Author 1'::text)
Planning Time: 0.115 ms
Execution Time: 1.045 ms

If i hack around with the configuration, i can get postgres to use the index.

SET enable_seqscan = OFF;
SET enable_bitmapscan = OFF;

But of course that is not the solution - it just shows the index would be usable for the query.

I tried changing the statistics, but that did not help: ALTER TABLE auction_json_indexed ALTER COLUMN item SET STATISTICS 1000;

I made a db-fiddle, but it runs out of memory.

What do I need to change so postgres understands it can use the index?


Solution

  • Thanks to the input from @FrankHeikens, i found the random_page_cost parameter. Indeed, lowering that value (to tell postgres its not that expensive to access random parts of the disk) makes postgres use the index.

    I found that in my example on my system, for 10M rows, the value needs to be 1.15 or lower.

    ALTER TABLESPACE pg_default SET (random_page_cost=1.1);
    
    EXPLAIN ANALYZE SELECT COUNT(*) FROM auction_jsonb_indexed WHERE item->>'author' = 'Author 1';
    
    Aggregate  (cost=10906.93..10906.94 rows=1 width=8) (actual time=4.152..4.154 rows=1 loops=1)
      ->  Bitmap Heap Scan on auction_jsonb_indexed  (cost=87.60..10882.00 rows=9970 width=0) (actual time=2.803..3.858 rows=10000 loops=1)
            Recheck Cond: ((item ->> 'author'::text) = 'Author 1'::text)
            Heap Blocks: exact=10000
            ->  Bitmap Index Scan on jsonb_author  (cost=0.00..85.11 rows=9970 width=0) (actual time=1.465..1.466 rows=10000 loops=1)
                  Index Cond: ((item ->> 'author'::text) = 'Author 1'::text)
    Planning Time: 0.076 ms
    Execution Time: 4.271 ms
    

    (Btw: Note that the number of different values in the indexed field also changes the plan. If I create the table with cnt % 10, a slightly different plan is used. It still benefits from the index, but is considerably slower.)