I am trying to figure out why postgres stops using my index on a json field when i have a lot of data. Up to 1 million rows, it works fine and uses the index. But when I try the query on a table with 10 million rows, it switches to a full table scan, which of course is very slow.
I am on postgres 17 (docker image postgres:17.1-alpine3.20)
My table is defined as
CREATE TABLE auction_jsonb_indexed (
id SERIAL NOT NULL,
title VARCHAR(255) NOT NULL,
start_date TIMESTAMP(0) WITHOUT TIME ZONE NOT NULL,
end_date TIMESTAMP(0) WITHOUT TIME ZONE NOT NULL,
current_price INT DEFAULT NULL,
item JSONB NOT NULL,
PRIMARY KEY(id)
);
CREATE INDEX jsonb_author ON auction_jsonb_indexed ((item->>'author'));
I then generate test data with
TRUNCATE TABLE auction_jsonb_indexed;
INSERT INTO auction_jsonb_indexed (title, start_date, end_date, item)
SELECT
'Title ' || cnt,
'2024-01-01'::timestamp + (INTERVAL '1 minute' * (cnt-1)),
'2024-01-01'::timestamp + (INTERVAL '1 minute' * (cnt+6)),
JSON_BUILD_OBJECT(
'type', 'book',
'genre', 'Genre '||cnt,
'author', 'Author ' || cnt % 1000,
'title', 'Title ' || cnt,
'description', 'Description ' || cnt,
'startDate', '2024-01-01'::timestamp + (INTERVAL '1 minute' * (cnt-1)),
'endDate', '2024-01-01'::timestamp + (INTERVAL '1 minute' * (cnt+6))
)::json
FROM generate_series(1, 10000000) AS t(cnt);
Now when i query, i end up with a full scan.
EXPLAIN ANALYZE SELECT COUNT(*) FROM auction_jsonb_indexed WHERE item->>'author' = 'Author 1';
Finalize Aggregate (cost=420329.21..420329.22 rows=1 width=8) (actual time=4248.623..4253.143 rows=1 loops=1)
-> Gather (cost=420329.00..420329.21 rows=2 width=8) (actual time=4248.436..4253.131 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=419329.00..419329.01 rows=1 width=8) (actual time=4237.667..4237.668 rows=1 loops=3)
-> Parallel Seq Scan on auction_jsonb_indexed (cost=0.00..418285.94 rows=417222 width=0) (actual time=4.612..4228.083 rows=333333 loops=3)
Filter: ((item ->> 'author'::text) = 'Author 1'::text)
Rows Removed by Filter: 3000000
Planning Time: 0.108 ms
JIT:
Functions: 14
" Options: Inlining false, Optimization false, Expressions true, Deforming true"
" Timing: Generation 0.837 ms (Deform 0.195 ms), Inlining 0.000 ms, Optimization 1.722 ms, Emission 12.087 ms, Total 14.646 ms"
Execution Time: 4253.505 ms
If i change the table to 1 million rows instead of 10, the explain shows that the right plan is made
Aggregate (cost=3453.72..3453.73 rows=1 width=8) (actual time=0.921..0.922 rows=1 loops=1)
-> Bitmap Heap Scan on auction_jsonb_indexed (cost=16.12..3451.24 rows=993 width=0) (actual time=0.698..0.865 rows=1000 loops=1)
Recheck Cond: ((item ->> 'author'::text) = 'Author 1'::text)
Heap Blocks: exact=1000
-> Bitmap Index Scan on jsonb_author (cost=0.00..15.87 rows=993 width=0) (actual time=0.446..0.446 rows=1000 loops=1)
Index Cond: ((item ->> 'author'::text) = 'Author 1'::text)
Planning Time: 0.115 ms
Execution Time: 1.045 ms
If i hack around with the configuration, i can get postgres to use the index.
SET enable_seqscan = OFF;
SET enable_bitmapscan = OFF;
But of course that is not the solution - it just shows the index would be usable for the query.
I tried changing the statistics, but that did not help: ALTER TABLE auction_json_indexed ALTER COLUMN item SET STATISTICS 1000;
I made a db-fiddle, but it runs out of memory.
What do I need to change so postgres understands it can use the index?
Thanks to the input from @FrankHeikens, i found the random_page_cost
parameter. Indeed, lowering that value (to tell postgres its not that expensive to access random parts of the disk) makes postgres use the index.
I found that in my example on my system, for 10M rows, the value needs to be 1.15 or lower.
ALTER TABLESPACE pg_default SET (random_page_cost=1.1);
EXPLAIN ANALYZE SELECT COUNT(*) FROM auction_jsonb_indexed WHERE item->>'author' = 'Author 1';
Aggregate (cost=10906.93..10906.94 rows=1 width=8) (actual time=4.152..4.154 rows=1 loops=1)
-> Bitmap Heap Scan on auction_jsonb_indexed (cost=87.60..10882.00 rows=9970 width=0) (actual time=2.803..3.858 rows=10000 loops=1)
Recheck Cond: ((item ->> 'author'::text) = 'Author 1'::text)
Heap Blocks: exact=10000
-> Bitmap Index Scan on jsonb_author (cost=0.00..85.11 rows=9970 width=0) (actual time=1.465..1.466 rows=10000 loops=1)
Index Cond: ((item ->> 'author'::text) = 'Author 1'::text)
Planning Time: 0.076 ms
Execution Time: 4.271 ms
(Btw: Note that the number of different values in the indexed field also changes the plan. If I create the table with cnt % 10
, a slightly different plan is used. It still benefits from the index, but is considerably slower.)