Did postgres vacuum improve my query plan

I have a table store_record with 45 million records. I want to run a simple count query on the largest database_id. Note I have an index on database_id.

SELECT COUNT(*) FROM store_record WHERE database_id='123';
-- returns ~17.2 million

The query took 3 minutes! See the query plan below, which I generated by prepending explain (analyze, buffers, verbose, settings) to the query:

Finalize Aggregate  (cost=3063219.25..3063219.25 rows=1 width=8) (actual time=178805.800..178899.302 rows=1 loops=1)
  Output: count(*)
  Buffers: shared hit=174202 read=2786089
  I/O Timings: read=336637.165
  ->  Gather  (cost=3063219.15..3063219.25 rows=1 width=8) (actual time=178805.612..178899.288 rows=2 loops=1)
        Output: (PARTIAL count(*))
        Workers Planned: 1
        Workers Launched: 1
        JIT for worker 0:
          Functions: 4
"          Options: Inlining true, Optimization true, Expressions true, Deforming true"
"          Timing: Generation 0.688 ms, Inlining 68.060 ms, Optimization 20.002 ms, Emission 17.390 ms, Total 106.140 ms"
        Buffers: shared hit=174202 read=2786089
        I/O Timings: read=336637.165
        ->  Partial Aggregate  (cost=3062219.15..3062219.15 rows=1 width=8) (actual time=178781.061..178781.062 rows=1 loops=2)
              Output: PARTIAL count(*)
              Buffers: shared hit=174202 read=2786089
              I/O Timings: read=336637.165
              Worker 0: actual time=178756.791..178756.793 rows=1 loops=1
                Buffers: shared hit=86992 read=1397345
                I/O Timings: read=168337.781
              ->  Parallel Seq Scan on public.store_record  (cost=0.00..3056983.48 rows=10471335 width=0) (actual time=140.886..178023.778 rows=8784825 loops=2)
"                    Output: id, key, data, created_at, updated_at, database_id, organization_id, user_id"
                    Filter: (store_record.database_id = '7e28da88-ea52-451a-8611-eb9a60dbc15e'::uuid)
                    Rows Removed by Filter: 14472533
                    Buffers: shared hit=174202 read=2786089
                    I/O Timings: read=336637.165
                    Worker 0: actual time=110.506..177990.918 rows=8816662 loops=1
                      Buffers: shared hit=86992 read=1397345
                      I/O Timings: read=168337.781
"Settings: cpu_index_tuple_cost = '0.001', cpu_operator_cost = '0.0005', cpu_tuple_cost = '0.003', effective_cache_size = '10980000kB', max_parallel_workers_per_gather = '1', random_page_cost = '2', search_path = '""$user"", public, heroku_ext', work_mem = '100MB'"
Planning Time: 0.087 ms
JIT:
  Functions: 10
"  Options: Inlining true, Optimization true, Expressions true, Deforming true"
"  Timing: Generation 1.295 ms, Inlining 152.272 ms, Optimization 86.675 ms, Emission 36.935 ms, Total 277.177 ms"
Execution Time: 178900.033 ms

To test if postgres caching would help, I repeated the same query two more times and got the same result.

Then, just messing around, I ran VACUUM ANALYZE store_record which took 15 min. Then repeated the same query as above. It only took 2.7 seconds and the query plan looks much different.

Finalize Aggregate  (cost=234344.55..234344.55 rows=1 width=8) (actual time=2538.619..2559.099 rows=1 loops=1)
  Output: count(*)
  Buffers: shared hit=270505
  ->  Gather  (cost=234344.44..234344.55 rows=1 width=8) (actual time=2538.472..2559.087 rows=2 loops=1)
        Output: (PARTIAL count(*))
        Workers Planned: 1
        Workers Launched: 1
        JIT for worker 0:
          Functions: 3
"          Options: Inlining false, Optimization false, Expressions true, Deforming true"
"          Timing: Generation 0.499 ms, Inlining 0.000 ms, Optimization 0.193 ms, Emission 3.403 ms, Total 4.094 ms"
        Buffers: shared hit=270505
        ->  Partial Aggregate  (cost=233344.44..233344.45 rows=1 width=8) (actual time=2516.493..2516.494 rows=1 loops=2)
              Output: PARTIAL count(*)
              Buffers: shared hit=270505
              Worker 0: actual time=2494.746..2494.747 rows=1 loops=1
                Buffers: shared hit=131826
              ->  Parallel Index Only Scan using store_record_database_updated_at_a4646b_idx on public.store_record  (cost=0.11..228252.85 rows=10183195 width=0) (actual time=0.045..1749.091 rows=8637277 loops=2)
"                    Output: database_id, updated_at"
                    Index Cond: (store_record.database_id = '7e28da88-ea52-451a-8611-eb9a60dbc15e'::uuid)
                    Heap Fetches: 0
                    Buffers: shared hit=270505
                    Worker 0: actual time=0.068..1732.100 rows=8420237 loops=1
                      Buffers: shared hit=131826
"Settings: cpu_index_tuple_cost = '0.001', cpu_operator_cost = '0.0005', cpu_tuple_cost = '0.003', effective_cache_size = '10980000kB', max_parallel_workers_per_gather = '1', random_page_cost = '2', search_path = '""$user"", public, heroku_ext', work_mem = '100MB'"
Planning Time: 0.092 ms
JIT:
  Functions: 8
"  Options: Inlining false, Optimization false, Expressions true, Deforming true"
"  Timing: Generation 0.981 ms, Inlining 0.000 ms, Optimization 0.326 ms, Emission 6.527 ms, Total 7.835 ms"
Execution Time: 2559.655 ms

The latter plan looks much more optimal: using a Index Only Scan instead of a Sequential Scan.

Some key notes:

I'm on postgres 12.14
The store_record table is read from frequently. New Relic shows ~700 queries per minute
The store_record table is written to frequently. New Relic shows ~350 queries per minute

A few questions:

Why would postgres use a sequential scan in the first case, instead of utilizing the index? That seems so incorrect.
Is the VACUUM ANALYZE responsible for the better plan / performance improvement?
If so, why did I have to manually run it? Why didn't an autovacuum hit it?
Should I look into tuning autovacuum to run more regularly? Note, the following query showed it was run about 20 ago:

SELECT last_autovacuum FROM pg_stat_all_tables
WHERE schemaname = 'public' and relname='store_record';

Solution

Surprising as it looks at first glance, the speed-up is really mostly due to VACUUM.

The reason lies in PostgreSQL's implementation of multi-versioning: each table entry stores visibility information (in the shape of xmin and xmax), because not every transaction can see every entry. This is to maintain a consistent view of the data in the face of concurrent data modifications, a property of the transaction system known as isolation.

As your second execution plan shows, the fastest way to count rows is an index-only scan that doesn't even read the table. But owing to the above, that is not always possible in PostgreSQL: you can only count rows that are visible to your transaction, and that visibility information is only stored in the table and is not available in the index (this is to avoid redundancy and, more importantly, to keep the index as small and, as a consequence, as efficient as possible).

That sounds like an index-only scan should be impossible, but PostgreSQL has a way around that problem: each table has a visibility map, a tiny data structure that contains two bits per 8kB block of the table. One of these bits, the “all-visible bit”, indicates if all active transactions can see all entries in this table block or not. If the block containing a row is all-visible, an index scan can skip fetching the row from the table to test its visibility and can become a true index-only scan.

Now that visibility map is maintained by the process that cleans up “garbage” in a table, and that process is VACUUM. So you can only get an efficient index-only scan if a table has been vacuumed recently. The query optimizer is aware of that and chooses the appropriate plan.

The upshot of all that is that if index-only scans are important for you, you should reduce autovacuum_vacuum_scale_factor for that table, so that it is vacuumed more frequently. On the other hand, a sensible database workload won't count the rows in a large table frequently, so you might choose not to do that to conserve resources.

There are two other aspects that conspire to exacerbate the surprising performance difference:

the second query that you run will find more data cached, so it is typically faster
you have set random_page_cost twice as high as seq_page_cost, so the optimizer estimates that sequential I/O is twice as fast as random I/O