postgresqlsortingbottleneck

What will cause the rows increase of the sort operation to in execution


I want to create some aggregations of table. But when I look at execution plan I find out that there is massive row increase in sort (from 5 121 to 1 453 235 569). Why this is happend? What can cause this increase?

Part of plan which show the increase of rows in sort

More info: Original table has this schema:

CREATE TABLE event (
    deviceid varchar NOT NULL,
    "time" timestamptz NOT NULL,
    signal_quality float8 NULL,
    packet_loss float8 NULL,
    ping float8 NULL,
    jitter float8 NULL,
    download float8 NULL,
    upload float8 NULL,
    latency float8 NULL,
    profinet float8 NULL
);

Next table has data about start tests:

CREATE TABLE start_test (
    deviceid varchar NOT NULL,
    "time" timestamptz NOT NULL
);

I want to agregate data for each test. For one test there will be more rows, each row contains one type of aggregation - for example one row for minimum, one row for median,... Here is subset of aggregation script:

WITH event_cycle AS (
    SELECT 
        e.*, 
        sc.time as start_time, 
        sc.end_time as end_time
    FROM (
        SELECT 
            *
        FROM event
    ) e 
    LEFT JOIN (
        SELECT *, COALESCE(LEAD(time) OVER(PARTITION BY deviceid ORDER BY time), NOW()) as end_time
        FROM start_test
    ) sc ON sc.time <= e.time AND e.time < sc.end_time AND e.deviceid = sc.deviceid
), bounds AS (
    SELECT 
        start_time start_time_b, 
        min(end_time) end_time_b,
        deviceid deviceid_b, 
        --low
        PERCENTILE_DISC(0.005) WITHIN GROUP(ORDER BY signal_quality) as signal_quality_low,
        --high
        PERCENTILE_DISC(0.995) WITHIN GROUP(ORDER BY signal_quality) as signal_quality_high
    FROM event_cycle
    GROUP BY start_time, deviceid
), data_with_boundaries AS (
    SELECT *
    FROM event_cycle ec
    LEFT JOIN bounds b ON b.start_time_b <= ec.time AND ec.time < b.end_time_b AND ec.deviceid = b.deviceid_b
)
SELECT *
FROM (
  SELECT 
      min(time) as start_time, 
      max(time) as end_time,
      deviceid, 
      'min' as agg_type,
      min(signal_quality) as signal_quality
  FROM event_cycle
  GROUP BY start_time, deviceid
) min
UNION ALL
SELECT *
FROM (
  SELECT 
      min(time) as start_time, 
      max(time) as end_time,
      deviceid, 
      'median' as agg_type,
      PERCENTILE_DISC(0.5) WITHIN GROUP(ORDER BY signal_quality) as signal_quality
  FROM event_cycle
  GROUP BY start_time, deviceid
) median
UNION ALL
SELECT *
FROM (
  SELECT 
      min(time) as start_time, 
      max(time) as end_time,
      deviceid, 
      'ipr_avg' as agg_type,
      avg(CASE WHEN signal_quality 
                        BETWEEN signal_quality_low
                        AND signal_quality_high
                  THEN signal_quality 
              ELSE NULL 
          END) as signal_quality
  FROM data_with_boundaries
  GROUP BY start_time, deviceid
) ipr_avg

This is whole query plan:

Append  (cost=65349092.63..2218756400.84 rows=120000 width=88) (actual time=108568.862..305159.053 rows=15363 loops=1)
  CTE event_cycle
    ->  Hash Left Join  (cost=652.57..23608823.05 rows=161244967 width=99) (actual time=11.336..107377.365 rows=855037 loops=1)
          Hash Cond: ((event.deviceid)::text = (sc.deviceid)::text)
          Join Filter: ((sc."time" <= event."time") AND (event."time" < sc.end_time))
          Rows Removed by Join Filter: 1451937372
          ->  Seq Scan on event  (cost=0.00..17802.37 rows=855037 width=83) (actual time=0.095..338.366 rows=855037 loops=1)
          ->  Hash  (cost=588.92..588.92 rows=5092 width=27) (actual time=11.190..11.193 rows=5118 loops=1)
                Buckets: 8192  Batches: 1  Memory Usage: 359kB
                ->  Subquery Scan on sc  (cost=423.43..588.92 rows=5092 width=27) (actual time=8.076..10.517 rows=5118 loops=1)
                      ->  WindowAgg  (cost=423.43..538.00 rows=5092 width=27) (actual time=8.075..10.152 rows=5118 loops=1)
                            ->  Sort  (cost=423.43..436.16 rows=5092 width=19) (actual time=8.060..8.260 rows=5118 loops=1)
                                  Sort Key: start_test.deviceid, start_test."time"
                                  Sort Method: quicksort  Memory: 592kB
                                  ->  Seq Scan on start_test  (cost=0.00..109.92 rows=5092 width=19) (actual time=0.017..0.557 rows=5118 loops=1)
  ->  Subquery Scan on min  (cost=41740269.58..44159744.08 rows=40000 width=88) (actual time=108568.860..108914.711 rows=5121 loops=1)
        ->  GroupAggregate  (cost=41740269.58..44159344.08 rows=40000 width=96) (actual time=108568.860..108914.282 rows=5121 loops=1)
              Group Key: event_cycle.start_time, event_cycle.deviceid
              ->  Sort  (cost=41740269.58..42143382.00 rows=161244967 width=56) (actual time=108568.845..108751.590 rows=855037 loops=1)
                    Sort Key: event_cycle.start_time, event_cycle.deviceid
                    Sort Method: external merge  Disk: 38608kB
                    ->  CTE Scan on event_cycle  (cost=0.00..3224899.34 rows=161244967 width=56) (actual time=11.338..108096.087 rows=855037 loops=1)
  ->  Subquery Scan on median  (cost=41740269.58..44159844.08 rows=40000 width=88) (actual time=812.829..1215.500 rows=5121 loops=1)
        ->  GroupAggregate  (cost=41740269.58..44159444.08 rows=40000 width=96) (actual time=812.828..1214.812 rows=5121 loops=1)
              Group Key: event_cycle_1.start_time, event_cycle_1.deviceid
              ->  Sort  (cost=41740269.58..42143382.00 rows=161244967 width=56) (actual time=812.797..1009.458 rows=855037 loops=1)
                    Sort Key: event_cycle_1.start_time, event_cycle_1.deviceid
                    Sort Method: external merge  Disk: 38608kB
                    ->  CTE Scan on event_cycle event_cycle_1  (cost=0.00..3224899.34 rows=161244967 width=56) (actual time=0.058..229.673 rows=855037 loops=1)
  ->  Subquery Scan on ipr_avg  (cost=2035162059.84..2106827389.62 rows=40000 width=88) (actual time=194608.278..195027.597 rows=5121 loops=1)
        ->  GroupAggregate  (cost=2035162059.84..2106826989.62 rows=40000 width=96) (actual time=194608.276..195027.074 rows=5121 loops=1)
              Group Key: ec.start_time, ec.deviceid
              ->  Sort  (cost=2035162059.84..2044120113.56 rows=3583221489 width=72) (actual time=194608.252..194833.440 rows=855037 loops=1)
                    Sort Key: ec.start_time, ec.deviceid
                    Sort Method: external merge  Disk: 45552kB
                    ->  Merge Left Join  (cost=85903271.20..731689364.04 rows=3583221489 width=72) (actual time=1673.883..194023.850 rows=855037 loops=1)
                          Merge Cond: ((ec.deviceid)::text = (b.deviceid_b)::text)
                          Join Filter: ((b.start_time_b <= ec."time") AND (ec."time" < b.end_time_b))
                          Rows Removed by Join Filter: 1452792384
                          ->  Sort  (cost=41740269.58..42143382.00 rows=161244967 width=56) (actual time=739.639..1020.075 rows=855037 loops=1)
                                Sort Key: ec.deviceid
                                Sort Method: external merge  Disk: 38608kB
                                ->  CTE Scan on event_cycle ec  (cost=0.00..3224899.34 rows=161244967 width=56) (actual time=0.027..182.877 rows=855037 loops=1)
                          ->  Sort  (cost=44163001.63..44163101.63 rows=40000 width=64) (actual time=934.232..51961.852 rows=1453235569 loops=1)
                                Sort Key: b.deviceid_b
                                Sort Method: quicksort  Memory: 593kB
                                ->  Subquery Scan on b  (cost=41740269.58..44159944.08 rows=40000 width=64) (actual time=638.853..932.024 rows=5121 loops=1)
                                      ->  GroupAggregate  (cost=41740269.58..44159544.08 rows=40000 width=64) (actual time=638.852..931.538 rows=5121 loops=1)
                                            Group Key: event_cycle_2.start_time, event_cycle_2.deviceid
                                            ->  Sort  (cost=41740269.58..42143382.00 rows=161244967 width=56) (actual time=638.826..788.470 rows=855037 loops=1)
                                                  Sort Key: event_cycle_2.start_time, event_cycle_2.deviceid
                                                  Sort Method: external merge  Disk: 35424kB
                                                  ->  CTE Scan on event_cycle event_cycle_2  (cost=0.00..3224899.34 rows=161244967 width=56) (actual time=0.027..184.881 rows=855037 loops=1)
Planning Time: 0.638 ms
Execution Time: 305182.448 ms

EDIT trying solution from @jjanes

Using GiST index for range of start and end time

First join (creating event_cycle): First join, index is not used

Merge Left Join  (cost=184294.16..571464.97 rows=855012 width=99) (actual time=477.090..252598.974 rows=855012 loops=1)
  Merge Cond: ((network_checker_event.deviceid)::text = (sc.deviceid)::text)
  Join Filter: (sc.start_end_time @> network_checker_event."time")
  Rows Removed by Join Filter: 1451937254
  ->  Sort  (cost=183874.68..186012.21 rows=855012 width=83) (actual time=475.940..854.321 rows=855012 loops=1)
        Sort Key: network_checker_event.deviceid
        Sort Method: external merge  Disk: 58568kB
        ->  Seq Scan on network_checker_event  (cost=0.00..17802.12 rows=855012 width=83) (actual time=0.022..45.523 rows=855012 loops=1)
  ->  Sort  (cost=419.48..432.28 rows=5118 width=80) (actual time=0.986..51704.010 rows=1452380432 loops=1)
        Sort Key: sc.deviceid
        Sort Method: quicksort  Memory: 912kB
        ->  Seq Scan on start_cycle_gist sc  (cost=0.00..104.18 rows=5118 width=80) (actual time=0.017..0.347 rows=5118 loops=1)
Planning Time: 1.095 ms
Execution Time: 253883.973 ms

Second join (last select): Second join, index is used

Nested Loop Left Join  (cost=0.15..382777.33 rows=557590 width=216) (actual time=0.065..7138.223 rows=855012 loops=1)
  ->  Seq Scan on event_cycle ec  (cost=0.00..15713.90 rows=557590 width=120) (actual time=0.036..184.663 rows=855012 loops=1)
  ->  Index Scan using bounds_idx on bounds b  (cost=0.15..0.65 rows=1 width=96) (actual time=0.008..0.008 rows=1 loops=855012)
        Index Cond: (start_end_time_b @> ec."time")
        Filter: ((ec.deviceid)::text = (deviceid_b)::text)
        Rows Removed by Filter: 1
Planning Time: 1.081 ms
Execution Time: 7166.532 ms

The SQL is:

DROP TABLE IF EXISTS start_cycle_gist;
EXPLAIN ANALYZE
CREATE TEMP TABLE start_cycle_gist AS
SELECT 
    *, 
    COALESCE(LEAD(time) OVER(PARTITION BY deviceid ORDER BY time), NOW()) as end_time,
    tstzrange(time, COALESCE(LEAD(time) OVER(PARTITION BY deviceid ORDER BY time), NOW()), '[)') AS start_end_time
FROM start_cycle
;
CREATE INDEX start_cycle_gist_idx ON start_cycle_gist USING gist(start_end_time)
;
DROP TABLE IF EXISTS event_cycle;
EXPLAIN ANALYZE
CREATE TEMP TABLE event_cycle AS
    SELECT 
        e.*, 
        sc.time as start_time, 
        sc.end_time as end_time
    FROM (
        SELECT 
            *
        FROM event
    ) e 
    LEFT JOIN start_cycle_gist sc ON sc.start_end_time @> e.time AND e.deviceid = sc.deviceid
;
DROP TABLE IF EXISTS bounds;
EXPLAIN ANALYZE
CREATE TEMP TABLE bounds AS
    SELECT 
        start_time start_time_b, 
        min(end_time) end_time_b,
        tstzrange(start_time, min(end_time), '[)') AS start_end_time_b,
        deviceid deviceid_b, 
        --low
        PERCENTILE_DISC(0.005) WITHIN GROUP(ORDER BY signal_quality) as signal_quality_low,
        --high
        PERCENTILE_DISC(0.995) WITHIN GROUP(ORDER BY signal_quality) as signal_quality_high
    FROM event_cycle
    WHERE start_time IS NOT null
    GROUP BY start_time, deviceid
;
CREATE INDEX bounds_idx ON bounds USING gist(start_end_time_b)
;
EXPLAIN ANALYZE
SELECT *
FROM event_cycle ec
LEFT JOIN bounds b ON b.start_end_time_b @> ec.time AND ec.deviceid = b.deviceid_b
;

Solution

  • On a merge join, every tie in the first child requires the 2nd child to be partially rescanned, and the number of actual rows reported by the plan includes each row each time it was scanned. The merge join actually produces a huge number of rows initially, but then the vast majority are thrown out based on

    Join Filter: ((b.start_time_b <= ec."time") AND (ec."time" < b.end_time_b))
    Rows Removed by Join Filter: 1452792384
    

    You might want to incorporate start_time_b and end_time_b into a single range type, then build a composite index using GiST (deviceid_b, range_time_b)