postgresql postgresql-10 postgresql-performance

How do I improve date-based query performance on a large table?

This is related to 2 other questions I posted (sounds like I should post this as a new question) - the feedback helped, but I think the same issue will come back the next time I need to insert data. Things were running slowly still which forced me to temporarily remove some of the older data so that only 2 months' worth remained in the table that I'm querying.

Indexing strategy for different combinations of WHERE clauses incl. text patterns

How to get date_part query to hit index?

Giving further detail this time - hopefully it will help pinpoint the issue:

PG version 10.7 (running on heroku
Total DB size: 18.4GB (this contains 2 months worth of data, and it will grow at approximately the same rate each month)
15GB RAM
Total available storage: 512GB
The largest table (the one that the slowest query is acting on) is 9.6GB (it's the largest chunk of the total DB) - about 10 million records

Schema of the largest table:

-- Table Definition ----------------------------------------------

CREATE TABLE reportimpression (
    datelocal timestamp without time zone,
    devicename text,
    network text,
    sitecode text,
    advertisername text,
    mediafilename text,
    gender text,
    agegroup text,
    views integer,
    impressions integer,
    dwelltime numeric
);

-- Indices -------------------------------------------------------

CREATE INDEX reportimpression_feb2019_index ON reportimpression(datelocal timestamp_ops) WHERE datelocal >= '2019-02-01 00:00:00'::timestamp without time zone AND datelocal < '2019-03-01 00:00:00'::timestamp without time zone;
CREATE INDEX reportimpression_mar2019_index ON reportimpression(datelocal timestamp_ops) WHERE datelocal >= '2019-03-01 00:00:00'::timestamp without time zone AND datelocal < '2019-04-01 00:00:00'::timestamp without time zone;
CREATE INDEX reportimpression_jan2019_index ON reportimpression(datelocal timestamp_ops) WHERE datelocal >= '2019-01-01 00:00:00'::timestamp without time zone AND datelocal < '2019-02-01 00:00:00'::timestamp without time zone;

Slow query:

SELECT
    date_part('hour', datelocal) AS hour,
    SUM(CASE WHEN gender = 'male' THEN views ELSE 0 END) AS male,
    SUM(CASE WHEN gender = 'female' THEN views ELSE 0 END) AS female
FROM reportimpression
WHERE
    datelocal >= '3-1-2019' AND
    datelocal < '4-1-2019'
GROUP BY date_part('hour', datelocal)
ORDER BY date_part('hour', datelocal)

The date range in this query will generally be for an entire month (it accepts user input from a web based report) - as you can see, I tried creating an index for each month's worth of data. That helped, but as far as I can tell, unless the query has recently been run (putting the results into the cache), it can still take up to a minute to run.

Explain analyze results:

Finalize GroupAggregate  (cost=1035890.38..1035897.86 rows=1361 width=24) (actual time=3536.089..3536.108 rows=24 loops=1)
  Group Key: (date_part('hour'::text, datelocal))
  ->  Sort  (cost=1035890.38..1035891.06 rows=1361 width=24) (actual time=3536.083..3536.087 rows=48 loops=1)
        Sort Key: (date_part('hour'::text, datelocal))
        Sort Method: quicksort  Memory: 28kB
        ->  Gather  (cost=1035735.34..1035876.21 rows=1361 width=24) (actual time=3535.926..3579.818 rows=48 loops=1)
              Workers Planned: 1
              Workers Launched: 1
              ->  Partial HashAggregate  (cost=1034735.34..1034740.11 rows=1361 width=24) (actual time=3532.917..3532.933 rows=24 loops=2)
                    Group Key: date_part('hour'::text, datelocal)
                    ->  Parallel Index Scan using reportimpression_mar2019_index on reportimpression  (cost=0.09..1026482.42 rows=3301168 width=17) (actual time=0.045..2132.174 rows=2801158 loops=2)
Planning time: 0.517 ms
Execution time: 3579.965 ms

I wouldn't think 10 million records would be too much to handle, especially given that I recently bumped up the PG plan that I'm on to try to throw resources at it, so I assume that the issue is still just either my indexes or my queries not being very efficient.

Solution

A materialized view is the way to go for what you outlined. Querying past months of read-only data works without refreshing the MV. You may want to special-case the current month to cover that, too.

The underlying query can still benefit from an index. There are two directions you might take:

1. Partitioning

Partial indexes like you have now won't buy much in your scenario. If you mostly query by month (and add / drop rows by month), table partitioning might be an idea. Then you have your indexes partitioned automatically, too.

If your rows are wide, create an index that allows index-only scans. Ideally, use a "covering" index with INCLUDE (Postgres 11 or later):

CREATE INDEX reportimpression_covering_idx ON reportimpression(datelocal) INCLUDE (views, gender);

See:

How does PostgreSQL perform ORDER BY with a b-tree index on the field?

2. BRIN index

If your rows are physically sorted by datelocal. It's extremely small and probably about as fast as a B-tree index for your case. But being so small it will stay cached more easily and not push out other data as much.

CREATE INDEX reportimpression_brin_idx ON reportimpression USING BRIN (datelocal);

You may be interested in CLUSTER or pg_repack to physically sort table rows. pg_repack can do it without exclusive locks on the table and even without a B-tree index (required by CLUSTER). But it's an additional module not shipped with the standard distribution of Postgres.