I've got a table pings
with about 15 million rows in it. I'm on postgres 9.2.4. The relevant columns it has are a foreign key monitor_id
, a created_at
timestamp, and a response_time
that's an integer that represents milliseconds. Here is the exact structure:
Column | Type | Modifiers
-----------------+-----------------------------+----------------------------------------------------
id | integer | not null default nextval('pings_id_seq'::regclass)
url | character varying(255) |
monitor_id | integer |
response_status | integer |
response_time | integer |
created_at | timestamp without time zone |
updated_at | timestamp without time zone |
response_body | text |
Indexes:
"pings_pkey" PRIMARY KEY, btree (id)
"index_pings_on_created_at_and_monitor_id" btree (created_at DESC, monitor_id)
"index_pings_on_monitor_id" btree (monitor_id)
I want to query for all the response times that are not NULL
(90% won't be NULL
, about 10% will be NULL
), that have a specific monitor_id
, and that were created in the last month. I'm doing the query with ActiveRecord, but the end result looks something like this:
SELECT "pings"."response_time"
FROM "pings"
WHERE "pings"."monitor_id" = 3
AND (created_at > '2014-03-03 20:23:07.254281'
AND response_time IS NOT NULL)
It's a pretty basic query, but it takes about 2000ms to run, which seems rather slow. I'm assuming an index would make it faster, but all the indexes I've tried aren't working, which I'm assuming means I'm not indexing properly.
When I run EXPLAIN ANALYZE
, this is what I get:
Bitmap Heap Scan on pings (cost=6643.25..183652.31 rows=83343 width=4) (actual time=58.997..1736.179 rows=42063 loops=1) Recheck Cond: (monitor_id = 3) Rows Removed by Index Recheck: 11643313 Filter: ((response_time IS NOT NULL) AND (created_at > '2014-03-03 20:23:07.254281'::timestamp without time zone)) Rows Removed by Filter: 324834 -> Bitmap Index Scan on index_pings_on_monitor_id (cost=0.00..6622.41 rows=358471 width=0) (actual time=57.935..57.935 rows=366897 loops=1) Index Cond: (monitor_id = 3)
So there is an index on monitor_id
that is being used towards the end, but nothing else. I've tried various permutations and orders of compound indexes using monitor_id
, created_at
, and response_time
. I've tried ordering the index by created_at
in descending order. I've tried a partial index with response_time IS NOT NULL
.
Nothing I've tried makes the query any faster. How would you optimize and/or index it?
Create a partial multicolumn index with the right sequence of columns. You have one:
"index_pings_on_created_at_and_monitor_id" btree (created_at DESC, monitor_id)
But the sequence of columns is not serving you well. Reverse it:
CREATE INDEX idx_pings_monitor_created ON pings (monitor_id, created_at DESC)
WHERE response_time IS NOT NULL;
The rule of thumb here is: equality first, ranges later. More about that:
Multicolumn index and performance
As discussed, the condition WHERE response_time IS NOT NULL
does not buy you much. If you have other queries that could utilize this index including NULL
values in response_time
, drop it. Else, keep it.
You can probably also drop both other existing indexes. More about the sequence of columns in btree indexes:
Working of indexes in PostgreSQL
If all you need from the table is response_time
, this can be much faster yet - if you don't have lots of write operations on the rows of your table. Include the column in the index at the last position to allow index-only scans (making it a "covering index"):
CREATE INDEX idx_pings_monitor_created
ON pings (monitor_id, created_at DESC, response_time)
WHERE response_time IS NOT NULL; -- maybe
Or, you try this even ..
Create a tiny helper function. Effectively a "global constant" in your db:
CREATE OR REPLACE FUNCTION f_ping_event_horizon()
RETURNS timestamp LANGUAGE sql IMMUTABLE COST 1 AS
$$SELECT '2014-03-03 0:0'::timestamp$$; -- One month in the past
Use it as condition in your index:
CREATE INDEX idx_pings_monitor_created_response_time
ON pings (monitor_id, created_at DESC, response_time)
WHERE response_time IS NOT NULL -- maybe
AND created_at > f_ping_event_horizon();
And your query looks like this now:
SELECT response_time
FROM pings
WHERE monitor_id = 3
AND response_time IS NOT NULL
AND created_at > '2014-03-03 20:23:07.254281'
AND created_at > f_ping_event_horizon();
Aside: I trimmed some noise.
The last condition seems logically redundant. Only include it, if Postgres does not understand it can use the index without it. Might be necessary. The actual timestamp in the condition must be bigger than the one in the function. But that's obviously the case according to your comments.
This way we cut all the irrelevant rows and make the index much smaller. The effect degrades slowly over time. Refit the event horizon and recreate indexes from time to time to get rid of added weight. You could do with a weekly cron job, for example.
When updating (recreating) the function, you need to recreate all indexes that use the function in any way. Best in the same transaction. Because the IMMUTABLE
declaration for the helper function is a bit of a false promise. But Postgres only accepts immutable functions in index definitions. So we have to lie about it. More about that:
Does PostgreSQL support "accent insensitive" collations?
Why the function at all? This way, all the queries using the index can remain unchanged.
With all of these changes the query should be faster by orders of magnitude now. A single, continuous index-only scan is all that's needed. Can you confirm that?