postgresqlpostgresql-14

Select rows from table based on if serial is in multirange


Problem

Given a table with a BIGSERIAL I want to query for the rows which have an index contained in a multirange passed as an argument.

The table, trimmed down, looks like:

CREATE TABLE event
(
    id         UUID PRIMARY KEY,
    index      BIGSERIAL   NOT NULL,
    data       JSONB
);

CREATE UNIQUE INDEX ON event (index);

What I naively want to is select those events which have an index in the multirange I pass as an argument:

SELECT
    *
FROM
    event
WHERE
    INDEX <@ $1;

The table is decently large (30mil rows) and the query is run often so it's essential that it runs quickly. The above query results in a seq scan which is quite slow, compared at least to our old "hack" described below.

Typically the multirange doesn't contain many ranges and generally has a infinite end for the last range, i.e. {[100, 110], [160,]}. Additionally, we usually query towards the end, so the values are typically in the range of 30 million. A query would usually return just a few hundred rows, rarely more.

What I've tried

Our old solution handles this problem by in the backend-code splitting it up the multirange and doing a query for each range.

SELECT
    *
FROM
    event
WHERE
    INDEX >= LOWER($1) AND(UPPER_INF($1) OR INDEX < UPPER($1));

This results in an efficient index scan and takes a few milliseconds to complete. However, it's far from ideal especially because we also want to have LIMIT and OFFSET work for the entire multirange.

Queries and analyze

Running the naive query above against production using a multirange fetching just a few events.

EXPLAIN (ANALYZE, buffers)
SELECT
    id, index, data
FROM
    event
WHERE
    index <@ '{[30000000,30000005],[30000010,30000025)}'::int8multirange;

Gives the following:

Gather (cost=1000.00..1476747.50 rows=154593 width=157) (actual time=4326.117..5610.042 rows=21 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=1204253 read=95107
I/O Timings: read=9552.560
-> Parallel Seq Scan on event (cost=0.00..1460288.20 rows=64414 width=157) (actual time=5178.488..5605.465 rows=7 loops=3)
Filter: (index <@ '{[30000000,30000006),[30000010,30000025)}'::int8multirange)
Rows Removed by Filter: 10064869
Buffers: shared hit=1204253 read=95107
I/O Timings: read=9552.560
Planning Time: 0.066 ms
Execution Time: 5614.964 ms

And just as an example using <@ with a int8range fetching almost the same data, just a few more rows:

EXPLAIN (ANALYZE, buffers)
SELECT
    id, index, data
FROM
    event
WHERE
    index <@ '[30000000,30000025)'::int8range;

Gives the following:

Gather (cost=1000.00..1476751.99 rows=154593 width=157) (actual time=2510.701..4249.822 rows=25 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=1236553 read=62777
I/O Timings: read=5273.182
-> Parallel Seq Scan on event (cost=0.00..1460292.69 rows=64414 width=157) (actual time=3658.410..4235.849 rows=8 loops=3)
Filter: (index <@ '[30000000,30000025)'::int8range)
Rows Removed by Filter: 10064877
Buffers: shared hit=1236553 read=62777
I/O Timings: read=5273.182
Planning Time: 0.067 ms
Execution Time: 4249.849 ms

Finally, our old hack using upper and lower and splitting it into multiple queries for each range:

EXPLAIN (ANALYZE, buffers)
SELECT
    id, index, data
FROM
    event
WHERE
    index >= lower('[30000000,30000025)'::int8range) and(upper_inf('[30000000,30000025)'::int8range) OR index < upper('[30000000,30000025)'::int8range));

Has the following:

Index Scan using idx_event_index_unique on event (cost=0.44..9.37 rows=7 width=157) (actual time=0.016..0.037 rows=25 loops=1)
Index Cond: ((index >= '30000000'::bigint) AND (index < '30000025'::bigint))
Buffers: shared hit=28
Planning:
Buffers: shared hit=8
Planning Time: 0.305 ms
Execution Time: 0.054 ms

That is, compared to the first two that use ranges for comparison it is several orders of magnitue faster (0,054ms compared to 4s).


Solution

  • For the following, I'm using index_seq to avoid potential confusion due to using INDEX as an identifier instead of a keyword.

    Unfortunately, PostgreSQL doesn't use the fact that x <@ '[l,u)'::int8range is semantically equivalent to l <= x < u and thus doesn't take advantage of the available index. One option is to add an exclude constraint and modify the query to express index_seq as a single value range that is checked for inclusion in the multirange:

    CREATE EXTENSION IF NOT EXISTS btree_gist;
    
    ALTER TABLE events
      ADD CONSTRAINT event_index_seq_exc EXCLUDE USING gist (int8range(index_seq, index_seq, '[]') WITH &&);
    
    EXPLAIN (ANALYZE, buffers)
    SELECT
        id, index_seq, data
    FROM
        event
    WHERE
        int8range(index_seq, index_seq, '[]') <@ '{[30000000,30000005],[30000010,30000025)}'::int8multirange;