postgresqlindexingdatabase-optimization

Prevent massive sequential scan with datetime query


How can I filter queries by date to prevent massive sequential scan on a large database?

My survey app collects responses and each answer to a question in a survey is stored in a table response_answer.

When I query all response_answers for the month I filter by date; however postgres is running a sequential scan on all response_answers (which is in the millions), and it is slow.

The query:

explain analyse 
  select count(*)
    from response_answer
    left join response r on r.id = response_answer.response_id
    where r.date_recorded between '2019-08-01T00:00:00.000Z' and '2019-08-29T23:59:59.999Z';
QUERY PLAN
Aggregate  (cost=517661.09..517661.10 rows=1 width=8) (actual time=139362.882..139362.899 rows=1 loops=1)
  ->  Hash Join  (cost=8063.39..517565.30 rows=38316 width=0) (actual time=126512.031..136806.093 rows=316558 loops=1)
        Hash Cond: (response_answer.response_id = r.id)
        ->  Seq Scan on response_answer  (cost=0.00..480365.73 rows=7667473 width=4) (actual time=1.443..70216.817 rows=7667473 loops=1)
        ->  Hash  (cost=8053.35..8053.35 rows=803 width=4) (actual time=173.467..173.476 rows=7010 loops=1)
              Buckets: 8192 (originally 1024)  Batches: 1 (originally 1)  Memory Usage: 311kB
              ->  Seq Scan on response r  (cost=0.00..8053.35 rows=803 width=4) (actual time=0.489..107.417 rows=7010 loops=1)
                    Filter: ((date_recorded >= '2019-08-01'::date) AND (observed_at <= '2019-08-29'::date))
                    Rows Removed by Filter: 153682
Planning time: 21.310 ms
Execution time: 139373.365 ms

I do have indexes on response_answer(response_id), response_answer(id), and response(id).

As the system grows, this query will become so slow it will be unusable because the sequential scan will continue to take longer.

When dealing with large amounts of data, how should I design queries/tables so that the database doesn't have to run a sequential scan of every. single. row. Surely there's a way for Postgres to only consider responses in the date range before finding all the related response_answers?


Solution

  • You need indexes on

    response (date_recorded, id)
    

    and

    response_answer (response_id)
    

    VACUUM the tables for an index only scan.

    With a query like this, you don't need an outer join. PostgreSQL is smart enough to infer that from the fact that response.id cannot be NULL.