I want to count occurrences of some values in 2 big tables and return a single result set. In theory, there is no difference between counting values in each table then joining the result and joining the tables on the values then counting them. In practice, joining tables first makes a query less amenable to optimisation by DBMS. In particular, Postgres 16.3 planner is unable to push a condition of type WHERE col IN (1,2,3,...)
inside the branches of an OUTER JOIN
to take advantage of appropriate indexes. This means that I have to manually apply the filter inside each joined table (subquery or view).
Is there a way to avoid manually applying the filter to each joined table? If not, what is the best practice or idiomatic way of doing this?
Here is a toy example in Postgres 16.3, which illustrates the problem.
DROP TABLE IF EXISTS t1 CASCADE;
DROP TABLE IF EXISTS t2 CASCADE;
CREATE TABLE t1 (id INT);
CREATE TABLE t2 (id INT);
-- t1 contains some negative id, while t2 does not:
INSERT INTO t1
SELECT (SELECT (- random()*20)::int + i % 10000 LIMIT 1)
FROM generate_series(1,1000000) i;
INSERT INTO t2
SELECT (SELECT (random()*20)::int + i % 10000 LIMIT 1)
FROM generate_series(1,1000000) i;
CREATE INDEX "t1_idx" ON t1 USING btree ("id");
CREATE INDEX "t2_idx" ON t2 USING btree ("id");
-- views for counts:
CREATE VIEW c1 AS (
SELECT id, count(1) as cnt1
FROM t1
GROUP BY id
);
CREATE VIEW c2 AS (
SELECT id, count(1) as cnt2
FROM t2
GROUP BY id
);
Now I want counts of selected values from both tables in a single results set. The following query is extremely slow in a real database. Here we see why:
explain analyze
SELECT *
FROM c1
NATURAL FULL OUTER JOIN c2
WHERE id IN (-5, 11);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Hash Full Join (cost=27979.91..28107.22 rows=5011 width=20) (actual time=630.949..631.763 rows=2 loops=1)
Hash Cond: (t2.id = t1.id)
Filter: (COALESCE(t1.id, t2.id) = ANY ('{-5,11}'::integer[]))
Rows Removed by Filter: 10038
-> Finalize HashAggregate (cost=13881.60..13981.90 rows=10030 width=12) (actual time=306.364..307.877 rows=10020 loops=1)
Group Key: t2.id
Batches: 1 Memory Usage: 1169kB
-> Gather (cost=11675.00..13781.30 rows=20060 width=12) (actual time=287.054..294.920 rows=30056 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial HashAggregate (cost=10675.00..10775.30 rows=10030 width=12) (actual time=276.957..280.021 rows=10019 loops=3)
Group Key: t2.id
Batches: 1 Memory Usage: 1169kB
Worker 0: Batches: 1 Memory Usage: 1169kB
Worker 1: Batches: 1 Memory Usage: 1169kB
-> Parallel Seq Scan on t2 (cost=0.00..8591.67 rows=416667 width=4) (actual time=0.008..73.377 rows=333333 loops=3)
-> Hash (cost=13973.39..13973.39 rows=9993 width=12) (actual time=321.427..321.465 rows=10020 loops=1)
Buckets: 16384 Batches: 1 Memory Usage: 598kB
-> Finalize HashAggregate (cost=13873.46..13973.39 rows=9993 width=12) (actual time=318.759..320.219 rows=10020 loops=1)
Group Key: t1.id
Batches: 1 Memory Usage: 1169kB
-> Gather (cost=11675.00..13773.53 rows=19986 width=12) (actual time=300.414..306.962 rows=30057 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial HashAggregate (cost=10675.00..10774.93 rows=9993 width=12) (actual time=291.697..297.355 rows=10019 loops=3)
Group Key: t1.id
Batches: 1 Memory Usage: 1169kB
Worker 0: Batches: 1 Memory Usage: 1169kB
Worker 1: Batches: 1 Memory Usage: 1169kB
-> Parallel Seq Scan on t1 (cost=0.00..8591.67 rows=416667 width=4) (actual time=0.013..73.148 rows=333333 loops=3)
Planning Time: 0.298 ms
Execution Time: 632.066 ms
(32 rows)
We see that Filter: (COALESCE(t1.id, t2.id) = ANY ('{-5,11}'::integer[]))
is applied to the joined result after sequential scans on both tables, which is very suboptimal.
If I clone WHERE ...
clause into each argument of JOIN
, I get a performance better by orders of magnitude. The planner pushed the filter condition inside each joined table and is able to replace sequential scans with index only scans:
explain analyze
SELECT *
FROM (
SELECT * FROM c1
WHERE id IN (-5, 11)
)
NATURAL FULL OUTER JOIN (
SELECT * FROM c2
WHERE id IN (-5, 11)
);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Merge Full Join (cost=0.85..33.58 rows=198 width=20) (actual time=0.051..0.068 rows=2 loops=1)
Merge Cond: (t1.id = t2.id)
-> GroupAggregate (cost=0.42..15.33 rows=198 width=12) (actual time=0.029..0.045 rows=2 loops=1)
Group Key: t1.id
-> Index Only Scan using t1_idx on t1 (cost=0.42..12.35 rows=200 width=4) (actual time=0.012..0.029 rows=178 loops=1)
Index Cond: (id = ANY ('{-5,11}'::integer[]))
Heap Fetches: 0
-> GroupAggregate (cost=0.42..15.29 rows=197 width=12) (actual time=0.020..0.020 rows=1 loops=1)
Group Key: t2.id
-> Index Only Scan using t2_idx on t2 (cost=0.42..12.32 rows=199 width=4) (actual time=0.009..0.015 rows=64 loops=1)
Index Cond: (id = ANY ('{-5,11}'::integer[]))
Heap Fetches: 0
Planning Time: 0.165 ms
Execution Time: 0.110 ms
(14 rows)
In the slow query, it is the condition WHERE id IN (-5, 11)
that prevents the planner from producing a better execution plan. If, I replace it with WHERE id IN (11)
or with WHERE id = 11
, then there is no difference between the queries.
In this toy example we have to add a couple of lines of code. In a real database I have numerous views and more complex queries, and the inconvenience caused by manual optimisation is multiplied manyfold. Instead of creating a single aggregate view inside the database and sending a simple query from a client, I have to expose several intermediate views and construct a rather complex query on the client side. Can I do better?
Update 2024-10-15. Here is how the planner is able to push the WHERE COALESCE(c1.id, c2.id) IN (11)
condition into JOIN branches. Note that IN (11)
is equivalent to = 11
.
explain analyze
SELECT *
FROM c1
FULL OUTER JOIN c2 ON c1.id = c2.id
WHERE COALESCE(c1.id, c2.id) IN (11);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Merge Full Join (cost=0.85..12.88 rows=1 width=24) (actual time=0.057..0.059 rows=1 loops=1)
-> GroupAggregate (cost=0.42..6.43 rows=1 width=12) (actual time=0.034..0.035 rows=1 loops=1)
-> Index Only Scan using t1_idx on t1 (cost=0.42..6.17 rows=100 width=4) (actual time=0.015..0.023 rows=107 loops=1)
Index Cond: (id = 11)
Heap Fetches: 0
-> Materialize (cost=0.42..6.44 rows=1 width=12) (actual time=0.021..0.021 rows=1 loops=1)
-> GroupAggregate (cost=0.42..6.43 rows=1 width=12) (actual time=0.018..0.018 rows=1 loops=1)
-> Index Only Scan using t2_idx on t2 (cost=0.42..6.17 rows=100 width=4) (actual time=0.009..0.014 rows=65 loops=1)
Index Cond: (id = 11)
Heap Fetches: 0
Planning Time: 0.174 ms
Execution Time: 0.105 ms
(12 rows)
I don't think there is a better way to handle that than rewriting the query and manually pushing the condition into the join branches.
You kind of answered your own question by noticing that you have to change the condition to
coalesce(t1.id, t2.id) IN (-5,11)
if you apply it after the outer join to get equivalent results.
The optimizer tries to be as smart as is possible while being fast, but the kind of logic that would be necessary to prove that the above condition can be modified and pushed down into the join branches is clearly out of reach here.