sqlpostgresqlmaxaggregatesql-limit

Performance of max() vs ORDER BY DESC + LIMIT 1


I was troubleshooting a few slow SQL queries today and don't quite understand the performance difference below:

When trying to extract the max(timestamp) from a data table based on some condition, using MAX() is slower than ORDER BY timestamp LIMIT 1 if a matching row exists, but considerably faster if no matching row is found.

SELECT timestamp
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 4
ORDER BY timestamp DESC
LIMIT 1;
(0 rows)  
Time: 1314.544 ms

SELECT timestamp
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 5
ORDER BY timestamp DESC
LIMIT 1;
(1 row)  
Time: 10.890 ms

SELECT MAX(timestamp)
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 4;
(0 rows)
Time: 0.869 ms

SELECT MAX(timestamp)
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 5;
(1 row)
Time: 84.087 ms 

There are indexes on (timestamp) and (sensor_id, timestamp), and I noticed that Postgres uses very different query plans and indexes for both cases:

QUERY PLAN (ORDER BY)                                              
--------------------------------------------------------------------------------------------------------
Limit  (cost=0.43..9.47 rows=1 width=8)
    ->  Nested Loop  (cost=0.43..396254.63 rows=43823 width=8)
          Join Filter: (data.sensor_id = sensors.id)
          ->  Index Scan using timestamp_ind on data  (cost=0.43..254918.66 rows=4710976 width=12)
          ->  Materialize  (cost=0.00..6.70 rows=2 width=4)
              ->  Seq Scan on sensors  (cost=0.00..6.69 rows=2 width=4)
                  Filter: (station_id = 4)
(7 rows)

QUERY PLAN (MAX)                                               
----------------------------------------------------------------------------------------------------------
Aggregate  (cost=3680.59..3680.60 rows=1 width=8)
    ->  Nested Loop  (cost=0.43..3571.03 rows=43823 width=8)
        ->  Seq Scan on sensors  (cost=0.00..6.69 rows=2 width=4)
              Filter: (station_id = 4)
        ->  Index Only Scan using sensor_ind_timestamp on data  (cost=0.43..1389.59 rows=39258 width=12)
              Index Cond: (sensor_id = sensors.id)
(6 rows)

So my two questions are:

  1. Where does this performance difference come from? I've seen the accepted answer here MIN/MAX vs ORDER BY and LIMIT, but that doesn't quite seem to apply here. Any good resources would be appreciated.
  2. Are there better ways to increase performance in all cases (matching row vs no matching row) than adding an EXISTS check?

EDIT to address the questions in the comments below. I kept the initial query plans above for future reference:

Table definitions:

                                  Table "public.sensors"
        Column        |          Type          |                            Modifiers                            
----------------------+------------------------+-----------------------------------------------------------------
id                    | integer                | not null default nextval('sensors_id_seq'::regclass)
station_id            | integer                | not null
....

Indexes:
    "sensor_primary" PRIMARY KEY, btree (id)
    "ind_station_id" btree (station_id, id)
    "ind_station" btree (station_id)

                                  Table "public.data"
  Column   |           Type           |                            Modifiers                             
-----------+--------------------------+------------------------------------------------------------------
 id        | integer                  | not null default nextval('data_id_seq'::regclass)
 timestamp | timestamp with time zone | not null
 sensor_id | integer                  | not null
 avg       | integer                  |

Indexes:
    "timestamp_ind" btree ("timestamp" DESC)
    "sensor_ind" btree (sensor_id)
    "sensor_ind_timestamp" btree (sensor_id, "timestamp")
    "sensor_ind_timestamp_desc" btree (sensor_id, "timestamp" DESC)

Note that I added ind_station_id on sensors just now after @Erwin's suggestion below. Timings haven't really changed drastically, still >1200ms in the ORDER BY DESC + LIMIT 1 case and ~0.9ms in the MAX case.

Query Plans:

QUERY PLAN (ORDER BY)
----------------------------------------------------------------------------------------------------------
Limit  (cost=0.58..9.62 rows=1 width=8) (actual time=2161.054..2161.054 rows=0 loops=1)
  Buffers: shared hit=3418066 read=47326
  ->  Nested Loop  (cost=0.58..396382.45 rows=43823 width=8) (actual time=2161.053..2161.053 rows=0 loops=1)
        Join Filter: (data.sensor_id = sensors.id)
        Buffers: shared hit=3418066 read=47326
        ->  Index Scan using timestamp_ind on data  (cost=0.43..255048.99 rows=4710976 width=12) (actual time=0.047..1410.715 rows=4710976 loops=1)
              Buffers: shared hit=3418065 read=47326
        ->  Materialize  (cost=0.14..4.19 rows=2 width=4) (actual time=0.000..0.000 rows=0 loops=4710976)
              Buffers: shared hit=1
              ->  Index Only Scan using ind_station_id on sensors  (cost=0.14..4.18 rows=2 width=4) (actual time=0.004..0.004 rows=0 loops=1)
                    Index Cond: (station_id = 4)
                    Heap Fetches: 0
                    Buffers: shared hit=1
Planning time: 0.478 ms
Execution time: 2161.090 ms
(15 rows)

QUERY (MAX)
----------------------------------------------------------------------------------------------------------
Aggregate  (cost=3678.08..3678.09 rows=1 width=8) (actual time=0.009..0.009 rows=1 loops=1)
   Buffers: shared hit=1
   ->  Nested Loop  (cost=0.58..3568.52 rows=43823 width=8) (actual time=0.006..0.006 rows=0 loops=1)
         Buffers: shared hit=1
         ->  Index Only Scan using ind_station_id on sensors  (cost=0.14..4.18 rows=2 width=4) (actual time=0.005..0.005 rows=0 loops=1)
               Index Cond: (station_id = 4)
               Heap Fetches: 0
               Buffers: shared hit=1
         ->  Index Only Scan using sensor_ind_timestamp on data  (cost=0.43..1389.59 rows=39258 width=12) (never executed)
               Index Cond: (sensor_id = sensors.id)
               Heap Fetches: 0
 Planning time: 0.435 ms
 Execution time: 0.048 ms
 (13 rows)

So just like in the earlier explains, ORDER BY does a Scan using timestamp_in on data, which is not done in the MAX case.

Postgres version: Postgres from the Ubuntu repos: PostgreSQL 9.4.5 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 5.2.1-21ubuntu2) 5.2.1 20151003, 64-bit

Note that there are NOT NULL constraints in place, so ORDER BY won't have to sort over empty rows.

Note also that I'm largely interested in where the difference comes from. While not ideal, I can retrieve data relatively quickly using EXISTS (<1ms) and then SELECT (~11ms).


Solution

  • For performance, a matching index on sensor.station_id makes all the difference.

    There is an actual difference between max() and ORDER BY DESC + LIMIT 1. In default ascending sort order null sorts last. Consequently, it sorts first in descending order. ORDER BY timestamp DESC LIMIT 1 returns null if one exists, while the aggregate function max() ignores null and returns the latest not-null timestamp. ORDER BY timestamp DESC NULLS LAST LIMIT 1 would be equivalent.

    Since your column d.timestamp is defined NOT NULL (as your update reveals), there is no effective difference. An index with DESC NULLS LAST and the same clause in the ORDER BY for the LIMIT query should still serve best. I suggest indexes on ...

    sensor(station_id, id)
    data(sensor_id, timestamp DESC NULLS LAST)

    My query below builds on the 2nd one. Drop the indexes sensor_ind_timestamp and sensor_ind_timestamp_desc unless they have other use.

    More importantly, the filter on the first table sensors returns few, but still (possibly) multiple rows. Postgres expects to find 2 rows (rows=2) according to your added query plan.
    The perfect technique would be an index-skip-scan (a.k.a. loose index scan) for the second table data. But that is not currently implemented (as of Postgres 16). There are workarounds. See:

    The best should be:

    SELECT d.timestamp
    FROM   sensors s
    CROSS  JOIN LATERAL  (
       SELECT timestamp
       FROM   data
       WHERE  sensor_id = s.id
       ORDER  BY timestamp DESC NULLS LAST
       LIMIT  1
       ) d
    WHERE  s.station_id = 4
    ORDER  BY d.timestamp DESC NULLS LAST
    LIMIT  1;
    

    The choice between max() and ORDER BY / LIMIT barely matters. You might as well:

    SELECT max(d.timestamp) AS timestamp
    FROM   sensors s
    CROSS  JOIN LATERAL (
       SELECT timestamp
       FROM   data
       WHERE  sensor_id = s.id
       ORDER  BY timestamp DESC NULLS LAST
       LIMIT  1
       ) d
    WHERE  s.station_id = 4;
    

    Or:

    SELECT max(d.timestamp) AS timestamp
    FROM   sensors s
    CROSS  JOIN LATERAL  (
       SELECT max(timestamp) AS timestamp
       FROM   data
       WHERE  sensor_id = s.id
       ) d
    WHERE  s.station_id = 4;
    

    Or even use a correlated subquery, shortest of all:

    SELECT max((SELECT max(timestamp) FROM data WHERE sensor_id = s.id)) AS timestamp
    FROM   sensors s
    WHERE  station_id = 4;
    

    Note the double parentheses!

    The advantage of the LATERAL subquery is that you can retrieve arbitrary columns of selected row(s), not just the latest timestamp (one value).

    Related: