postgresqlindexing

How can I get Postgres use my functional index


I have the following table:

CREATE TABLE items
(
    id NUMERIC(20, 0) NOT NULL DEFAULT NEXTVAL('items_sequence') PRIMARY KEY,
    item_price NUMERIC(19, 2) DEFAULT NULL NULL,
    status NUMERIC(2, 0) NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL
);

with the following index:

CREATE INDEX items_dash_idx ON items (status, DATE(created_at));

I want to group my items per status and per day, for ~ 30 days. This means that I want to get the count and total item price per day per status for the last 30 days, including where the count/amount is 0. I have 5 statuses, one of which (50) is irrelevant and has too many rows (e.g. Status 50 has ~400k rows where statuses 10, 20, 30 and 40 has ~1k, for the last 30 days).

I have the following query:

SELECT COUNT(i.id)                    AS count,
       COALESCE(SUM(i.item_price), 0) AS amount,
       dates_table.status,
       dates_table.created_at
FROM (SELECT created_at::DATE AS created_at, 10 AS status
      FROM GENERATE_SERIES('2024-09-18'::DATE, '2024-10-18'::DATE, INTERVAL '1 DAY') AS created_at
      UNION
      SELECT created_at::DATE AS created_at, 20 AS status
      FROM GENERATE_SERIES('2024-09-18'::DATE, '2024-10-18'::DATE, INTERVAL '1 DAY') AS created_at
      UNION
      SELECT created_at::DATE AS created_at, 30 AS status
      FROM GENERATE_SERIES('2024-09-18'::DATE, '2024-10-18'::DATE, INTERVAL '1 DAY') AS created_at
      UNION
      SELECT created_at::DATE AS created_at, 40 AS status
      FROM GENERATE_SERIES('2024-09-18'::DATE, '2024-10-18'::DATE, INTERVAL '1 DAY') AS created_at
     ) AS dates_table
LEFT JOIN items i 
       ON i.status = dates_table.status
      AND DATE(i.created_at) = dates_table.created_at
GROUP BY dates_table.created_at, dates_table.status
ORDER BY dates_table.created_at, dates_table.status;

and this query seems to take 10+ seconds, with the following EXPLAIN (ANALYZE, BUFFERS) output:

QUERY PLAN
Sort  (cost=2242005.05..2242006.05 rows=400 width=48) (actual time=21950.589..21950.601 rows=72 loops=1)
  Sort Key: dates_table.created_at, dates_table.status
  Sort Method: quicksort  Memory: 29kB
  Buffers: shared hit=676950 read=747852 dirtied=755, temp read=28515 written=28531
  ->  HashAggregate  (cost=2241982.76..2241987.76 rows=400 width=48) (actual time=21950.436..21950.492 rows=72 loops=1)
        Group Key: dates_table.created_at, dates_table.status
        Batches: 1  Memory Usage: 61kB
        Buffers: shared hit=676947 read=747852 dirtied=755, temp read=28515 written=28531
        ->  Merge Left Join  (cost=2161026.21..2239512.33 rows=247043 width=20) (actual time=21834.112..21948.382 rows=11066 loops=1)
              Merge Cond: ((dates_table.created_at = (date(i.created_at))) AND (((dates_table.status)::numeric) = i.status))
              Buffers: shared hit=676947 read=747852 dirtied=755, temp read=28515 written=28531
              ->  Sort  (cost=449.35..459.35 rows=4000 width=8) (actual time=895.905..895.933 rows=72 loops=1)
                    Sort Key: dates_table.created_at, ((dates_table.status)::numeric)
                    Sort Method: quicksort  Memory: 28kB
                    Buffers: shared hit=4
                    ->  Subquery Scan on dates_table  (cost=130.03..210.03 rows=4000 width=8) (actual time=895.792..895.846 rows=72 loops=1)
                          ->  HashAggregate  (cost=130.03..170.03 rows=4000 width=8) (actual time=895.788..895.831 rows=72 loops=1)
                                Group Key: ((created_ai.created_at)::date), (10)
                                Batches: 1  Memory Usage: 217kB
                                ->  Append  (cost=0.01..110.03 rows=4000 width=8) (actual time=895.697..895.749 rows=72 loops=1)
                                      ->  Function Scan on generate_series created_at  (cost=0.01..12.51 rows=1000 width=8) (actual time=895.694..895.697 rows=18 loops=1)
                                      ->  Function Scan on generate_series created_at_1  (cost=0.01..12.51 rows=1000 width=8) (actual time=0.012..0.014 rows=18 loops=1)
                                      ->  Function Scan on generate_series created_at_2  (cost=0.01..12.51 rows=1000 width=8) (actual time=0.010..0.012 rows=18 loops=1)
                                      ->  Function Scan on generate_series created_at_3  (cost=0.01..12.51 rows=1000 width=8) (actual time=0.010..0.012 rows=18 loops=1)
              ->  Materialize  (cost=2160576.87..2185898.76 rows=5064379 width=25) (actual time=19123.895..20601.926 rows=5066445 loops=1)
                    Buffers: shared hit=676943 read=747852 dirtied=755, temp read=28515 written=28531
                    ->  Sort  (cost=2160576.87..2173237.82 rows=5064379 width=25) (actual time=19123.888..20125.620 rows=5066445 loops=1)
                          Sort Key: (date(i.created_at)), i.status
                          Sort Method: external merge  Disk: 228120kB
                          Buffers: shared hit=676943 read=747852 dirtied=755, temp read=28515 written=28531
                          ->  Seq Scan on items i  (cost=0.00..1475438.79 rows=5064379 width=25) (actual time=0.064..16526.846 rows=5066445 loops=1)
                                Buffers: shared hit=676943 read=747852 dirtied=755
Planning Time: 0.399 ms
JIT:
  Functions: 44
  Options: Inlining true, Optimization true, Expressions true, Deforming true
  Timing: Generation 2.096 ms, Inlining 293.474 ms, Optimization 383.558 ms, Emission 218.758 ms, Total 897.885 ms
Execution Time: 21989.150 ms

and my cache hit ratio drops to 50% from 99.9% when I run this query. The same index (obviously TRUNC(created_at) instead of DATE(created_at)) in Oracle and the same query works in around ~500ms.

Update:

SELECT COUNT(i.id) AS count,
       SUM(i.item_price) AS amount,
       DATE(i.created_at) AS created_at,
       i.status AS status
FROM items i
WHERE i.status = 10 
AND DATE(i.created_at) >= '2024-10-01'
GROUP BY i.status, DATE(i.created_at)
ORDER BY i.status, DATE(i.created_at)

is fast with the following EXPLAIN (ANALYZE, VERBOSE, BUFFERS, SETTINGS) output:

QUERY PLAN
GroupAggregate  (cost=0.43..27.72 rows=9 width=49) (actual time=0.044..0.117 rows=9 loops=1)
  Output: count(id), sum(item_price), (date(created_at)), status
  Group Key: i.status, date(i.created_at)
  Buffers: shared hit=26
  ->  Index Scan using items_dash_idx  on tenant_xxx.items i  (cost=0.43..27.49 rows=9 width=21) (actual time=0.025..0.097 rows=24 loops=1)
        Output: date(created_at), status, id, item_price
        Index Cond: ((i.status = '10'::numeric) AND (date(i.created_at) >= '2024-10-01'::date))
        Buffers: shared hit=26
Settings: effective_cache_size = '16GB', search_path = 'tenant_xxx', work_mem = '32MB'
Planning:
  Buffers: shared hit=3
Planning Time: 0.255 ms
Execution Time: 0.140 ms

Update #2:

If I filter the items beforehand, and then perform the join:

WITH items_table AS (SELECT COUNT(i.id)        AS count,
                            SUM(i.item_price)  AS amount,
                            DATE(i.created_at) AS created_at,
                            i.status           AS status
                     FROM items t
                     WHERE i.status IN (10, 20, 30, 40)
                       AND DATE(i.created_at) >= '2024-10-01'
                       AND DATE(i.created_at) < '2024-10-18'::DATE + INTERVAL '1 DAY'
                     GROUP BY i.status, DATE(i.created_at)
                     ORDER BY i.status, DATE(i.created_at))
SELECT dates_table.created_at          AS created_at,
       dates_table.status              AS status,
       COALESCE(items_table.count, 0)  AS count,
       COALESCE(items_table.amount, 0) AS amount
FROM (SELECT created_at::DATE               AS created_at
           , UNNEST(ARRAY [10, 20, 30, 40]) AS status
      FROM GENERATE_SERIES('2024-10-01'::DATE, '2024-10-18'::DATE, '1 DAY') AS created_at) AS dates_table
         LEFT JOIN items_table ON items_table.status = dates_table.status AND
                                  items_table.created_at = dates_table.created_at
ORDER BY dates_table.created_at, dates_table.status

it returns in around ~100ms.


Solution

  • You could filter and aggregate before joining to the calendar, then coalesce() mismatches to 0's as suggested by @Dunes (in which case, the index is used) and generate your calendar once and let the target statuses unnest() for each day to simplify the query:
    demo at db<>fiddle

    WITH calendar AS
       (SELECT created_at::date
             , unnest(array[10,20,30,40]) AS status
        FROM GENERATE_SERIES('2024-09-18'::date,'2024-10-18','1 DAY') AS created_at)
    ,pre_agg AS
       (SELECT COUNT(i.id)                    AS count,
               COALESCE(SUM(i.item_price), 0) AS amount,
               i.status,
               date(i.created_at) as created_at
        FROM items i 
        WHERE i.status=any(array[10,20,30,40])
          AND date(i.created_at) between '2024-09-18' and '2024-10-18'
        GROUP BY date(i.created_at), i.status )
    SELECT created_at
         , status
         , coalesce(count,0) as count
         , coalesce(amount,0) as amount
    FROM calendar 
    LEFT JOIN pre_agg 
        USING(created_at,status)
    ORDER BY created_at,status;
    
    QUERY PLAN
    Sort (cost=6419.03..6429.03 rows=4000 width=76) (actual time=48.105..48.121 rows=124 loops=1)
    Output: ((created_at.created_at)::date), (((unnest('{10,20,30,40}'::integer[])))::numeric), (COALESCE(pre_agg.count, '0'::bigint)), (COALESCE(pre_agg.amount, '0'::numeric))
    Sort Key: ((created_at.created_at)::date), (((unnest('{10,20,30,40}'::integer[])))::numeric)
    Sort Method: quicksort Memory: 34kB
    -> Hash Left Join (cost=6068.21..6179.71 rows=4000 width=76) (actual time=47.844..48.054 rows=124 loops=1)
    Output: ((created_at.created_at)::date), (unnest('{10,20,30,40}'::integer[])), COALESCE(pre_agg.count, '0'::bigint), COALESCE(pre_agg.amount, '0'::numeric)
    Inner Unique: true
    Hash Cond: ((((created_at.created_at)::date) = pre_agg.created_at) AND (((unnest('{10,20,30,40}'::integer[])))::numeric = pre_agg.status))
    -> ProjectSet (cost=0.01..40.01 rows=4000 width=8) (actual time=0.028..0.120 rows=124 loops=1)
    Output: (created_at.created_at)::date, unnest('{10,20,30,40}'::integer[])
    -> Function Scan on pg_catalog.generate_series created_at (cost=0.01..10.01 rows=1000 width=8) (actual time=0.025..0.031 rows=31 loops=1)
    Output: created_at.created_at
    Function Call: generate_series(('2024-09-18'::date)::timestamp with time zone, '2024-10-18 00:00:00+00'::timestamp with time zone, '1 day'::interval)
    -> Hash (cost=5948.04..5948.04 rows=8011 width=48) (actual time=47.796..47.798 rows=124 loops=1)
    Output: pre_agg.count, pre_agg.amount, pre_agg.created_at, pre_agg.status
    Buckets: 8192 Batches: 1 Memory Usage: 72kB
    -> Subquery Scan on pre_agg (cost=5747.76..5948.04 rows=8011 width=48) (actual time=47.532..47.730 rows=124 loops=1)
    Output: pre_agg.count, pre_agg.amount, pre_agg.created_at, pre_agg.status
    -> HashAggregate (cost=5747.76..5867.93 rows=8011 width=48) (actual time=47.530..47.703 rows=124 loops=1)
    Output: count(i.id), COALESCE(sum(i.item_price), '0'::numeric), i.status, (date(i.created_at))
    Group Key: date(i.created_at), i.status
    Batches: 1 Memory Usage: 465kB
    -> Bitmap Heap Scan on public.items i (cost=521.68..5562.67 rows=18509 width=28) (actual time=5.149..36.145 rows=18355 loops=1)
    Output: i.status, date(i.created_at), i.id, i.item_price
    Recheck Cond: ((i.status = ANY ('{10,20,30,40}'::numeric[])) AND (date(i.created_at) >= '2024-09-18'::date) AND (date(i.created_at) <= '2024-10-18'::date))
    Heap Blocks: exact=1074
    -> Bitmap Index Scan on items_dash_idx (cost=0.00..517.05 rows=18509 width=0) (actual time=4.917..4.918 rows=18355 loops=1)
    Index Cond: ((i.status = ANY ('{10,20,30,40}'::numeric[])) AND (date(i.created_at) >= '2024-09-18'::date) AND (date(i.created_at) <= '2024-10-18'::date))
    Planning Time: 0.554 ms
    Execution Time: 48.340 ms