This is related to 2 other questions I posted (sounds like I should post this as a new question) - the feedback helped, but I think the same issue will come back the next time I need to insert data. Things were running slowly still which forced me to temporarily remove some of the older data so that only 2 months' worth remained in the table that I'm querying.
Indexing strategy for different combinations of WHERE clauses incl. text patterns
How to get date_part query to hit index?
Giving further detail this time - hopefully it will help pinpoint the issue:
Schema of the largest table:
-- Table Definition ----------------------------------------------
CREATE TABLE reportimpression (
datelocal timestamp without time zone,
devicename text,
network text,
sitecode text,
advertisername text,
mediafilename text,
gender text,
agegroup text,
views integer,
impressions integer,
dwelltime numeric
);
-- Indices -------------------------------------------------------
CREATE INDEX reportimpression_feb2019_index ON reportimpression(datelocal timestamp_ops) WHERE datelocal >= '2019-02-01 00:00:00'::timestamp without time zone AND datelocal < '2019-03-01 00:00:00'::timestamp without time zone;
CREATE INDEX reportimpression_mar2019_index ON reportimpression(datelocal timestamp_ops) WHERE datelocal >= '2019-03-01 00:00:00'::timestamp without time zone AND datelocal < '2019-04-01 00:00:00'::timestamp without time zone;
CREATE INDEX reportimpression_jan2019_index ON reportimpression(datelocal timestamp_ops) WHERE datelocal >= '2019-01-01 00:00:00'::timestamp without time zone AND datelocal < '2019-02-01 00:00:00'::timestamp without time zone;
Slow query:
SELECT
date_part('hour', datelocal) AS hour,
SUM(CASE WHEN gender = 'male' THEN views ELSE 0 END) AS male,
SUM(CASE WHEN gender = 'female' THEN views ELSE 0 END) AS female
FROM reportimpression
WHERE
datelocal >= '3-1-2019' AND
datelocal < '4-1-2019'
GROUP BY date_part('hour', datelocal)
ORDER BY date_part('hour', datelocal)
The date range in this query will generally be for an entire month (it accepts user input from a web based report) - as you can see, I tried creating an index for each month's worth of data. That helped, but as far as I can tell, unless the query has recently been run (putting the results into the cache), it can still take up to a minute to run.
Explain analyze results:
Finalize GroupAggregate (cost=1035890.38..1035897.86 rows=1361 width=24) (actual time=3536.089..3536.108 rows=24 loops=1)
Group Key: (date_part('hour'::text, datelocal))
-> Sort (cost=1035890.38..1035891.06 rows=1361 width=24) (actual time=3536.083..3536.087 rows=48 loops=1)
Sort Key: (date_part('hour'::text, datelocal))
Sort Method: quicksort Memory: 28kB
-> Gather (cost=1035735.34..1035876.21 rows=1361 width=24) (actual time=3535.926..3579.818 rows=48 loops=1)
Workers Planned: 1
Workers Launched: 1
-> Partial HashAggregate (cost=1034735.34..1034740.11 rows=1361 width=24) (actual time=3532.917..3532.933 rows=24 loops=2)
Group Key: date_part('hour'::text, datelocal)
-> Parallel Index Scan using reportimpression_mar2019_index on reportimpression (cost=0.09..1026482.42 rows=3301168 width=17) (actual time=0.045..2132.174 rows=2801158 loops=2)
Planning time: 0.517 ms
Execution time: 3579.965 ms
I wouldn't think 10 million records would be too much to handle, especially given that I recently bumped up the PG plan that I'm on to try to throw resources at it, so I assume that the issue is still just either my indexes or my queries not being very efficient.
A materialized view is the way to go for what you outlined. Querying past months of read-only data works without refreshing the MV. You may want to special-case the current month to cover that, too.
The underlying query can still benefit from an index. There are two directions you might take:
Partial indexes like you have now won't buy much in your scenario. If you mostly query by month (and add / drop rows by month), table partitioning might be an idea. Then you have your indexes partitioned automatically, too.
If your rows are wide, create an index that allows index-only scans. Ideally, use a "covering" index with INCLUDE
(Postgres 11 or later):
CREATE INDEX reportimpression_covering_idx ON reportimpression(datelocal) INCLUDE (views, gender);
See:
If your rows are physically sorted by datelocal
. It's extremely small and probably about as fast as a B-tree index for your case. But being so small it will stay cached more easily and not push out other data as much.
CREATE INDEX reportimpression_brin_idx ON reportimpression USING BRIN (datelocal);
You may be interested in CLUSTER
or pg_repack
to physically sort table rows. pg_repack
can do it without exclusive locks on the table and even without a B-tree index (required by CLUSTER
). But it's an additional module not shipped with the standard distribution of Postgres.
Related: