I've two simple tables:
Profile (20M entries)
----------------------------
pId | fail
123 | 0
234 | 2
345 | 0
Work (50M entries)
-----------------
pId
123
234
123
345
123
345
I just want to mark fail
as 1
in Profile
table for entries that have above threshold pId
entries in Work
table. pId
in Profile
table is indexed and I don't want to touch rows where fail
is not 0
.
The query I'm using right now is:
UPDATE Profile
SET fail = 1
WHERE pId IN
(
SELECT pId
FROM Work
GROUP BY pId
HAVING COUNT(*) > 2
)
AND Profile.fail = 0;
In pgAdmin, I get explain plan as below:
"Update on Profile a (cost=1134492.79..1559750.23 rows=5180 width=1014)"
" -> Hash Join (cost=1134492.79..1559750.23 rows=5180 width=1014)"
" Hash Cond: (a.pId = b.pId)"
" -> Seq Scan on Profile a (cost=0.00..425216.00 rows=15462 width=976)"
" Filter: (fail = 0)"
" -> Hash (cost=1134491.95..1134491.95 rows=67 width=32)"
" -> Subquery Scan on b (cost=1134488.78..1134491.95 rows=67 width=32)"
" -> HashAggregate (cost=1134488.78..1134491.28 rows=67 width=4)"
" Group Key: Work.pId"
" Filter: (count(*) > 5)"
" -> Seq Scan on Work (cost=0.00..894341.52 rows=48029452 width=4)"
It takes couple of minutes to run.
Now when these two tables are created at runtime with the same data, the query plan changes to:
"Update on Profile (cost=1250747.42..1251317.47 rows=67 width=386)"
" -> Nested Loop (cost=1250747.42..1251317.47 rows=67 width=386)"
" -> Subquery Scan on "ANY_subquery" (cost=1250746.98..1250750.15 rows=67 width=32)"
" -> HashAggregate (cost=1250746.98..1250749.48 rows=67 width=4)"
" Group Key: Work.pId"
" Filter: (count(*) > 5)"
" -> Seq Scan on Work (cost=0.00..985990.32 rows=52951332 width=4)"
" -> Index Scan using Profile_idx on Profile (cost=0.44..8.46 rows=1 width=348)"
" Index Cond: (pId = "ANY_subquery".pId)"
" Filter: (fail = 0)"
Which takes an hour to run. I've even tried switching from subquery to a join, but it's still producing the same result. Any help would be appreciated.
The key to your problem probably is:
Now when these two tables are created at runtime with the same data, the query plan changes to [the worse]
PostgreSQL automatically collects table statistics, but it takes a short while for autoanalyze to kick in.
All queries that run between the bulk data modification and the time that autoanalyze finishes will be likely to have bad execution plans.
It is a good idea to explicitly run ANALYZE
on the tables at the end off mass data modifications.