postgresqlcommon-table-expressionquery-planner

PostgreSQL upgrade to 12+ changes hash join to slow nested loop


We're trying to upgrade from the version 9 series and have a deal-breaking slow query, which runs okay in 10 and 11 but is many times slower in 12 and 13. I've tested across minor versions in the 11 and 12 series, and the minor version doesn't affect it.

The problem lies in the planner choosing a nested loop join instead of the hash join it should use.

v11 hash join:

->  Nested Loop Left Join  (cost=276056.74..285056.52 rows=1714 width=230) (actual time=13519.865..15864.542 rows=57 loops=1)
      Join Filter: (placement_type.placement_type_id = job_order.placement_type_id)
      Rows Removed by Join Filter: 57
      ->  Nested Loop Left Join  (cost=276056.74..284967.78 rows=1714 width=224) (actual time=13519.837..15864.202 rows=57 loops=1)
            ->  Hash Join  (cost=276056.32..284215.53 rows=1714 width=217) (actual time=13519.803..15863.465 rows=57 loops=1)
                  Hash Cond: (ori.order_id = job_order_1.order_id)
                  ->  HashAggregate  (cost=246953.68..250381.92 rows=342824 width=487) (actual time=13397.503..15017.114 rows=1053462 loops=1)

v12 nested loop:

->  Nested Loop Left Join  (cost=304636.78..316645.32 rows=1716 width=230) (actual time=17799.135..152297.231 rows=57 loops=1)
      Join Filter: (placement_type.placement_type_id = job_order.placement_type_id)
      Rows Removed by Join Filter: 57
      ->  Nested Loop Left Join  (cost=304636.78..316556.50 rows=1716 width=224) (actual time=17799.111..152296.749 rows=57 loops=1)
            ->  Nested Loop  (cost=304636.37..315803.37 rows=1716 width=217) (actual time=17799.075..152295.098 rows=57 loops=1)
                  Join Filter: (job_order_1.order_id = ori.order_id)
                  Rows Removed by Join Filter: 60047277

The upgrade procedure in our test environment is done with pg_upgrade and a full analyze before testing this query.

So what changed in 12?


Solution

  • Thanks to the folks on the PG mailing lists who deserve the credit for this answer.

    Before PostgreSQL 12, using a CTE created an "optimization barrier." In some cases, just by luck, this led to good performance for queries which, otherwise, the planner can't succesfully estimate.

    Adding the MATERIALIZED keyword (WITH ... AS MATERIALIZED (...)) forces the old behavior. So that is one way to address the performance issue. Another way is to optimize the SQL that gives the planner trouble, which may surface when the planner decides to rearrange the processing of CTEs.