sqlpostgresqlquery-optimizationlateral-join

LATERAL JOINS Causing Performance Issues


This query is for a Tableau dashboard. The users want to see the active clients in a sector as well as the contacts at those firms and how the composition of the data changes from quarter to quarter. I created a base time subquery and basically layer in the fact data.

I use LATERAL JOINS to place the data into their respective quarters using timestamps of the historical changes. I tested the query on a subset of the data and it ran well. When opening to the whole data set, the query cost becomes huge. I've already done some optimization but nothing seems to get the cost to an acceptable level.

All the tables have adequate indexes. The final WITH subquery, HACPCQ , where I utilize LATERAL JOINS to layer in the contact data is the issue. The two join tables have 500k rows and 3.5mn rows respectively. I can't get this to complete.

DB engineers won't create a view or stored procedure because of the potential cost to the prod server. No true replication DB available right now.

SQL and query plan below. The only posts I've seen on performance improvement are for a prior Postgres version. How should I rewrite these LATERL JOINS?

-- Q creates a baseline of quarters to analyze the client and contact tables over time. Last full year plus current year.

WITH Q AS (
    SELECT Q_Start, ((Q_Start + INTERVAL '3 months') - INTERVAL '1 day') :: DATE Q_End, RANK() OVER (ORDER BY Q_Start DESC) Q_Rank
    FROM (
        SELECT GENERATE_SERIES( DATE_TRUNC('year', CURRENT_DATE) - INTERVAL '1 year', DATE_TRUNC('quarter', CURRENT_DATE), INTERVAL '3 months') :: DATE Q_Start
        ) q1
    ),

-- HA gathers the historic client records within the timeframe of Q and filters to the last change per quarter.

HA AS (
    SELECT id CL_ID, history_id CL_Hist_ID, history_date CL_Hist_Date FROM (
        SELECT ha.id, history_id, history_date, ROW_NUMBER() OVER (PARTITION BY ha.id, EXTRACT(year FROM history_date),
               EXTRACT(quarter FROM history_date) ORDER BY history_date DESC) AS Row_Rank
        FROM historicalclient ha
        WHERE EXTRACT(year FROM history_date) >= (EXTRACT(year FROM CURRENT_DATE) - 1)
        ) ha1
    WHERE Row_Rank = 1
    ),

-- Tag clients to relevant quarters 

HAQ AS (
    SELECT *
    FROM Q
    JOIN LATERAL (SELECT CL_ID, MAX(CL_Hist_ID) CL_Hist_ID, MAX(CL_Hist_Date) CL_Hist_Date FROM HA WHERE CL_Hist_Date :: DATE <= Q.Q_End GROUP BY CL_ID) HA
    ON TRUE
),

-- Same deal as HA but for contactposition.

HCP AS (
    SELECT CP_ID, client_id CP_CL_ID, contact_id CP_Cont_ID, history_id CP_Hist_ID, history_date CP_Hist_Date, current FROM (
        SELECT hcp.id CP_ID, client_id, Cont_ID, history_id, history_date, current,
               ROW_NUMBER() OVER (PARTITION BY hcp.id, EXTRACT(year FROM history_date), EXTRACT(quarter FROM history_date) ORDER BY history_date DESC) AS Row_Rank
        FROM contactposition hcp
        WHERE historydate >= '10/1/2023'
        ) hcp1
    WHERE Row_Rank = 1
    AND current = TRUE
    ),

-- HC is HCP but for contacts

HC AS (
    SELECT id Cont_ID, history_id Con_Hist_ID, history_date Con_Hist_Date FROM (
        SELECT hc.id, history_id, history_date, ROW_NUMBER() OVER (PARTITION BY hc.id, EXTRACT(year FROM history_date), EXTRACT(quarter FROM history_date) ORDER BY history_date DESC) AS Row_Rank
        FROM historicalcontact hc
        LEFT JOIN contactposition cp
        ON hc.id = cp.Cont_ID
        WHERE EXTRACT(year FROM history_date) >= (EXTRACT(year FROM CURRENT_DATE) - 1)
        ) hc1
    WHERE Row_Rank = 1
),

-- Joining all of the base data sources together

HACPCQ AS (
    SELECT DISTINCT *
    FROM HAQ
    LEFT JOIN LATERAL (SELECT CP_CL_ID, CP_ID, CP_Cont_ID, MAX(CP_Hist_ID) CP_Hist_ID, MAX(CP_Hist_Date) CP_Hist_Date FROM HCP WHERE CP_CL_ID = HAQ.CL_ID AND CP_Hist_Date :: DATE <= HAQ.Q_End GROUP BY CP_CL_ID, CP_ID, CP_Cont_ID) HCP
    ON TRUE
    LEFT JOIN LATERAL (SELECT Cont_ID, MAX(Con_Hist_ID) Con_Hist_ID, MAX(Con_Hist_Date) Con_Hist_Date FROM HC WHERE Cont_ID = HCP.CP_Cont_ID AND Con_Hist_Date :: DATE <= HAQ.Q_End GROUP BY Cont_ID) HC
    ON TRUE
)


    SELECT *
    FROM HACPCQ a

Query Plan

Edit: Switched out the full EXPLAIN query plan for the EXPLAIN ANALYZE query plan of one client. Freed up characters in my post - link below.

https://explain.depesz.com/s/Cor5

Edit 2: Trimmed down query to make it more readable. Still filtered to one client. New EXPLAIN ANALYZE query plan below.

https://explain.depesz.com/s/Wxde


Solution

    1. An interval constant can have an equation in it. You can do interval '1 month -1 day', without having to separately subtract interval '1 day' from interval '1 month'.
    2. SRFs like generate_series() don't need wrapping in subqueries to be selected from.
    3. Default to row_number() rather than rank() if you're just numbering rows.
    4. If you simply add WITH ORDINALITY AS g(val,row_num) when taking rows from a Set-Returning Function, you get numbered values. That way you don't have to collect, sort and number them separately, at all, removing the need for either rank() or row_number().
    5. Adding to timestamp(tz) implies interval addend, saving you an explicit cast.

    Here's a separate demo for the five points above applied to the first CTE Q.

    1. As pointed out by @Laurenz Albe, most of your subqueries can be shortened to a distinct on. It selects the top/sample record per group without having to aggregate, number, then filter everything out except the Row_Rank = 1 that keeps appearing in your code.
    2. In HAQ, the two max() could return values from different rows. It's possible in your data they happen to always come from the same row, but in the query there's nothing enforcing that. Distinct on does enforce that, while also performing and optimising better.
    3. CROSS JOIN is equivalent to JOIN ON TRUE, which is the same as just a comma ,: demo.
    4. The last JOINs don't seem to need to be lateral and you don't need those subqueries at all: they re-do operations you already did and they already expose columns to match on.
    5. There are slight inconsistencies in your query: is it history_date or historydate in contactposition? Does the same table have Cont_ID/cont_id or contact_id column, or both?
    6. Unless you double-quote your identifiers, they get folded to lowercase, so Q_Start is just q_start. Unless you create and refer to it as "Q_Start".

    Applying all of the above, you can slightly clean up your query and speed it up 50x. Demo:

    WITH Q AS (
      SELECT  Q_Start::DATE, 
              (Q_Start + '3 months -1 day')::DATE Q_End, 
              Q_Rank
      FROM GENERATE_SERIES(DATE_TRUNC('quarter',CURRENT_DATE),
                           DATE_TRUNC('year',   CURRENT_DATE) -INTERVAL '1 year', 
                           INTERVAL '-3 months')WITH ORDINALITY g(Q_Start,Q_Rank)
      ),
    HA AS (-- HA gathers the historic client records within the timeframe of Q 
         -- and filters to the last change per quarter.
      SELECT DISTINCT ON (ha.id, 
                          EXTRACT(year FROM history_date),
                          EXTRACT(quarter FROM history_date) )
             ha.id        AS CL_ID, 
             history_id   AS CL_Hist_ID, 
             history_date AS CL_Hist_Date
      FROM historicalclient ha
      WHERE EXTRACT(year FROM history_date)>=(EXTRACT(year FROM CURRENT_DATE) -1)
      ORDER BY ha.id, 
               EXTRACT(year FROM history_date),
               EXTRACT(quarter FROM history_date),
               history_date DESC
      ),
    HAQ AS (-- Tag clients to relevant quarters 
      SELECT * FROM Q
      ,LATERAL(SELECT CL_ID, 
                      MAX(CL_Hist_ID) CL_Hist_ID, 
                      MAX(CL_Hist_Date) CL_Hist_Date 
               FROM HA 
               WHERE CL_Hist_Date::DATE <= Q.Q_End 
               GROUP BY CL_ID) HA
    ),
    HCP AS (-- Same deal as HA but for contactposition.
      SELECT DISTINCT ON (hcp.id, 
                          EXTRACT(year FROM history_date), 
                          EXTRACT(quarter FROM history_date)) 
              hcp.id AS CP_ID, 
              client_id AS CP_CL_ID, 
              contact_id AS CP_Cont_ID, 
              history_id AS CP_Hist_ID, 
              history_date AS CP_Hist_Date, 
              current 
      FROM contactposition hcp
      WHERE history_date >= '10/1/2023'
      AND current IS TRUE
      ),
    HC AS (-- HC is HCP but for contacts
      SELECT DISTINCT ON (hc.id, 
                      EXTRACT(year FROM history_date), 
                      EXTRACT(quarter FROM history_date)) 
         hc.id AS Cont_ID, 
         history_id AS Con_Hist_ID, 
         history_date AS Con_Hist_Date
      FROM historicalcontact hc
      LEFT JOIN contactposition cp
      ON hc.id = cp.contact_id
      WHERE EXTRACT(year FROM history_date)>=(EXTRACT(year FROM CURRENT_DATE) -1)
      ORDER BY hc.id, 
               EXTRACT(year FROM history_date), 
               EXTRACT(quarter FROM history_date),
               history_date DESC
    )
    -- Joining all of the base data sources together
    
        SELECT DISTINCT ON (HAQ.CL_ID, HAQ.Q_Start, HCP.CP_Cont_ID) *
        FROM HAQ, HCP, HC
        WHERE HCP.CP_CL_ID = HAQ.CL_ID
          AND HCP.CP_Hist_Date::DATE <= HAQ.Q_End
          AND HC.Cont_ID = HCP.CP_Cont_ID
          AND HC.Con_Hist_Date::DATE <= HAQ.Q_End
        ORDER BY HAQ.CL_ID, HAQ.Q_Start DESC, HCP.CP_Cont_ID;