sqlpostgresqlsubqueryaggregate-functionslateral

Select multiple aggregation result from the same subquery


I find myself doing statistics like this more and more often - for every record in tbl1, use it as condition to scan tbl2 and do aggregation on tbl2, return the aggregated results. This is easy with sub-query.

But I need to do multiple aggregation in one pass in the same subquery instead of scanning multiple times.

For example, here is some sample data for analysis:

SELECT  to_timestamp((random()*999999)) evt_ts, 
    evt_id  
    INTO temp.test_so_44410354
FROM
    (
        SELECT *, (random()*2000)::int AS idx
            FROM generate_series( 1, 1000) AS evt_id
            ORDER BY idx 
    ) AS t

evt_ts              |evt_id |
--------------------|-------|
1970-01-01 08:13:15 |109    |
1970-01-01 08:17:22 |762    |
1970-01-01 08:19:28 |630    |
1970-01-01 08:29:34 |429    |
1970-01-01 08:48:28 |70     |

To see how events distribute over an given time frame, current approach & query plan & desired output :

With wcal AS (
    SELECT ts, ts::date as dt 
    FROM generate_series( '1970-01-01 0:0:0', '1970-01-12 0:0:0', INTERVAL '1 day' ) AS ts
)
SELECT ts, dt
,(
    SELECT count(evt_ts) 
    FROM temp.test_so_44410354 as el
    WHERE ( el.evt_ts < wcal.ts AND el.evt_ts > wcal.ts - INTERVAL '1 day'  )
) as ecnt
,(
    SELECT sum(evt_id) -- should be array_agg, sum is locate to the difference in the results.
    FROM temp.test_so_44410354 as el
    WHERE ( el.evt_ts < wcal.ts AND el.evt_ts > wcal.ts - INTERVAL '1 day'  )
) as eids
FROM wcal

QUERY PLAN                                                                                          |
----------------------------------------------------------------------------------------------------|
CTE Scan on wcal  (cost=12.50..21314.08 rows=1000 width=28)                                         |
  CTE wcal                                                                                          |
    ->  Function Scan on generate_series ts  (cost=0.00..12.50 rows=1000 width=12)                  |
  SubPlan 2                                                                                         |
    ->  Aggregate  (cost=10.63..10.64 rows=1 width=8)                                               |
          ->  Bitmap Heap Scan on test_so_44410354 el  (cost=4.33..10.62 rows=5 width=8)            |
                Recheck Cond: ((evt_ts < wcal.ts) AND (evt_ts > (wcal.ts - '1 day'::interval)))     |
                ->  Bitmap Index Scan on test_so_44410354_idx  (cost=0.00..4.33 rows=5 width=0)     |
                      Index Cond: ((evt_ts < wcal.ts) AND (evt_ts > (wcal.ts - '1 day'::interval))) |
  SubPlan 3                                                                                         |
    ->  Aggregate  (cost=10.63..10.64 rows=1 width=8)                                               |
    ...

ts                  |dt         |ecnt |eids  |
--------------------|-----------|-----|------|
1970-01-01 00:00:00 |1970-01-01 |0    |      |
1970-01-02 00:00:00 |1970-01-02 |68   |36156 |
1970-01-03 00:00:00 |1970-01-03 |85   |42103 |
1970-01-04 00:00:00 |1970-01-04 |94   |47092 |
1970-01-05 00:00:00 |1970-01-05 |74   |36276 |
....

There are 2 subplans, it seems pg will have to scan table evt_log twice to do the same aggregation twice. I'd would like avoid this. since the aggregation conditions are exactly the same and it maybe be much faster if it can do many statistical aggregation in one pass. Logs can grow very big, the processing will take quite some time.

Feedback on proposed solution I nested join is not working as expected:

With wcal AS (
SELECT ts, ts::date as dt 
FROM generate_series( '1970-01-01 0:0:0', '1970-01-12 0:0:0', INTERVAL '1 day' ) AS ts
)
, ag AS (
    SELECT 
        count(evt_ts) AS ecnt
      , sum(evt_id) AS eids
    FROM temp.test_so_44410354 as el
    JOIN wcal ON ( el.evt_ts < wcal.ts AND el.evt_ts > wcal.ts - INTERVAL '1 day'  )
)
SELECT ts, dt, ag.ecnt, eids
FROM wcal
JOIN ag ON true

QUERY PLAN                                                                                                            |
----------------------------------------------------------------------------------------------------------------------|
Nested Loop  (cost=4239.57..4269.59 rows=1000 width=28)                                                               |
  CTE wcal                                                                                                            |
    ->  Function Scan on generate_series ts  (cost=0.00..12.50 rows=1000 width=12)                                    |
  CTE ag                                                                                                              |
    ->  Aggregate  (cost=4227.06..4227.07 rows=1 width=16)                                                            |
          ->  Nested Loop  (cost=0.28..3671.50 rows=111111 width=12)                                                  |
                ->  CTE Scan on wcal wcal_1  (cost=0.00..20.00 rows=1000 width=8)                                     |
                ->  Index Scan using test_so_44410354_idx on test_so_44410354 el  (cost=0.28..2.54 rows=111 width=12) |
                      Index Cond: ((evt_ts < wcal_1.ts) AND (evt_ts > (wcal_1.ts - '1 day'::interval)))               |
  ->  CTE Scan on ag  (cost=0.00..0.02 rows=1 width=16)                                                               |
  ->  CTE Scan on wcal  (cost=0.00..20.00 rows=1000 width=12)                                                         |

ts                  |dt         |ecnt |eids   |
--------------------|-----------|-----|-------|
1970-01-01 00:00:00 |1970-01-01 |920  |464436 |
1970-01-04 00:00:00 |1970-01-04 |920  |464436 |
1970-01-05 00:00:00 |1970-01-05 |920  |464436 |
1970-01-06 00:00:00 |1970-01-06 |920  |464436 |
1970-01-07 00:00:00 |1970-01-07 |920  |464436 |
....

Feedack on proposed solution II: The writing are different but the query plan are basically the same with 2 sub plan.

DROP TYPE temp.my_agg ;
CREATE TYPE temp.my_agg AS ( ecnt int, eids int );
EXPLAIN
With wcal AS (
    SELECT ts, ts::date as dt 
    FROM generate_series( '1970-01-01 0:0:0', '1970-01-12 0:0:0', INTERVAL '1 day' ) AS ts
)
SELECT ts, dt
,(
    SELECT (count(evt_ts),sum(evt_id))::temp.my_agg
    FROM temp.test_so_44410354 as el
    WHERE ( el.evt_ts < wcal.ts AND el.evt_ts > wcal.ts - INTERVAL '1 day'  )
).*
FROM wcal

QUERY PLAN                                                                                                                                  |
--------------------------------------------------------------------------------------------------------------------------------------------|
CTE Scan on wcal  (cost=12.50..21349.08 rows=1000 width=20) (actual time=0.182..3.518 rows=12 loops=1)                                      |
  CTE wcal                                                                                                                                  |
    ->  Function Scan on generate_series ts  (cost=0.00..12.50 rows=1000 width=12) (actual time=0.057..0.083 rows=12 loops=1)               |
  SubPlan 2                                                                                                                                 |
    ->  Aggregate  (cost=10.64..10.66 rows=1 width=32) (actual time=0.140..0.140 rows=1 loops=12)                                           |
          ->  Bitmap Heap Scan on test_so_44410354 el  (cost=4.33..10.62 rows=5 width=12) (actual time=0.049..0.083 rows=77 loops=12)       |
                Recheck Cond: ((evt_ts < wcal.ts) AND (evt_ts > (wcal.ts - '1 day'::interval)))                                             |
                Heap Blocks: exact=66                                                                                                       |
                ->  Bitmap Index Scan on test_so_44410354_idx  (cost=0.00..4.33 rows=5 width=0) (actual time=0.037..0.037 rows=77 loops=12) |
                      Index Cond: ((evt_ts < wcal.ts) AND (evt_ts > (wcal.ts - '1 day'::interval)))                                         |
  SubPlan 3                                                                                                                                 |
    ...
Planning time: 0.548 ms                                                                                                                     |
Execution time: 3.748 ms                                                                                                                    |

For proposed solution II, if the aggregation is returned as record, the query plan will look like below: Which roughly doubled the speed c.f. previous query. (but record type element is not desired as final result.)

QUERY PLAN                                                                                                                                  |
--------------------------------------------------------------------------------------------------------------------------------------------|
CTE Scan on wcal  (cost=12.50..10690.79 rows=1000 width=44) (actual time=0.147..1.979 rows=12 loops=1)                                      |
  CTE wcal                                                                                                                                  |
    ->  Function Scan on generate_series ts  (cost=0.00..12.50 rows=1000 width=12) (actual time=0.057..0.082 rows=12 loops=1)               |
  SubPlan 2                                                                                                                                 |
    ->  Aggregate  (cost=10.64..10.66 rows=1 width=32) (actual time=0.151..0.151 rows=1 loops=12)                                           |
          ->  Bitmap Heap Scan on test_so_44410354 el  (cost=4.33..10.62 rows=5 width=12) (actual time=0.054..0.089 rows=77 loops=12)       |
                Recheck Cond: ((evt_ts < wcal.ts) AND (evt_ts > (wcal.ts - '1 day'::interval)))                                             |
                Heap Blocks: exact=66                                                                                                       |
                ->  Bitmap Index Scan on test_so_44410354_idx  (cost=0.00..4.33 rows=5 width=0) (actual time=0.040..0.040 rows=77 loops=12) |
                      Index Cond: ((evt_ts < wcal.ts) AND (evt_ts > (wcal.ts - '1 day'::interval)))                                         |
Planning time: 0.381 ms                                                                                                                     |
Execution time: 2.147 ms                                                                                                                    |

The LATERAL join solution works just as expected. Thanks.

EXPLAIN ANALYZE
With wcal AS (
    SELECT ts, ts::date as dt 
    FROM generate_series( '1970-01-01 0:0:0', '1970-01-12 0:0:0', INTERVAL '1 day' ) AS ts
)
SELECT ts, dt, p.*
FROM wcal
     CROSS JOIN LATERAL (
        SELECT count(evt_ts), sum(evt_id)
        FROM temp.test_so_44410354 as el
        WHERE ( el.evt_ts < wcal.ts AND el.evt_ts > wcal.ts - INTERVAL '1 day'  )
    ) p

QUERY PLAN                                                                                                                                |
------------------------------------------------------------------------------------------------------------------------------------------|
Nested Loop  (cost=23.15..10705.79 rows=1000 width=28) (actual time=0.130..1.611 rows=12 loops=1)                                         |
  CTE wcal                                                                                                                                |
    ->  Function Scan on generate_series ts  (cost=0.00..12.50 rows=1000 width=12) (actual time=0.054..0.075 rows=12 loops=1)             |
  ->  CTE Scan on wcal  (cost=0.00..20.00 rows=1000 width=12) (actual time=0.061..0.099 rows=12 loops=1)                                  |
  ->  Aggregate  (cost=10.64..10.65 rows=1 width=16) (actual time=0.123..0.123 rows=1 loops=12)                                           |
        ->  Bitmap Heap Scan on test_so_44410354 el  (cost=4.33..10.62 rows=5 width=12) (actual time=0.044..0.074 rows=77 loops=12)       |
              Recheck Cond: ((evt_ts < wcal.ts) AND (evt_ts > (wcal.ts - '1 day'::interval)))                                             |
              Heap Blocks: exact=66                                                                                                       |
              ->  Bitmap Index Scan on test_so_44410354_idx  (cost=0.00..4.33 rows=5 width=0) (actual time=0.033..0.033 rows=77 loops=12) |
                    Index Cond: ((evt_ts < wcal.ts) AND (evt_ts > (wcal.ts - '1 day'::interval)))                                         |
Planning time: 0.399 ms                                                                                                                   |
Execution time: 1.778 ms                                                                                                                  |

ts                  |dt         |count |sum   |
--------------------|-----------|------|------|
1970-01-01 00:00:00 |1970-01-01 |0     |      |
1970-01-02 00:00:00 |1970-01-02 |68    |36156 |
1970-01-03 00:00:00 |1970-01-03 |85    |42103 |
1970-01-04 00:00:00 |1970-01-04 |94    |47092 |
1970-01-05 00:00:00 |1970-01-05 |74    |36276 |
1970-01-06 00:00:00 |1970-01-06 |96    |43797 |

Solution

  • Use lateral join:

    With wcal AS (
        SELECT ts, ts::date as dt, 
            extract( week from ts) as wk, 
            extract( month from ts ) as mo,
            extract( quater from ts ) as qtr,
        FROM generate_series( t1, t2, tspan )
    )
    SELECT ts, dt, wk, mo, qtr, ecnt, eids
    from
      wcal cross join lateral (
        SELECT count(evt_ts) as ecnt, array_agg(evt_id) as eids
        FROM ev_log as el
        WHERE (el.ts < wcal.ts AND el.ts < wcal.ts - tspan)) as el;
    

    <table> cross join lateral (<query>) is a synonym to <table>, lateral (<query>).