This query is for a Tableau dashboard. The users want to see the active clients in a sector as well as the contacts at those firms and how the composition of the data changes from quarter to quarter. I created a base time subquery and basically layer in the fact data.
I use LATERAL JOINS
to place the data into their respective quarters using timestamps of the historical changes. I tested the query on a subset of the data and it ran well. When opening to the whole data set, the query cost becomes huge. I've already done some optimization but nothing seems to get the cost to an acceptable level.
All the tables have adequate indexes. The final WITH
subquery, HACPCQ , where I utilize LATERAL JOINS
to layer in the contact data is the issue. The two join tables have 500k rows and 3.5mn rows respectively. I can't get this to complete.
DB engineers won't create a view or stored procedure because of the potential cost to the prod server. No true replication DB available right now.
SQL and query plan below. The only posts I've seen on performance improvement are for a prior Postgres version. How should I rewrite these LATERL JOINS
?
-- Q creates a baseline of quarters to analyze the client and contact tables over time. Last full year plus current year.
WITH Q AS (
SELECT Q_Start, ((Q_Start + INTERVAL '3 months') - INTERVAL '1 day') :: DATE Q_End, RANK() OVER (ORDER BY Q_Start DESC) Q_Rank
FROM (
SELECT GENERATE_SERIES( DATE_TRUNC('year', CURRENT_DATE) - INTERVAL '1 year', DATE_TRUNC('quarter', CURRENT_DATE), INTERVAL '3 months') :: DATE Q_Start
) q1
),
-- HA gathers the historic client records within the timeframe of Q and filters to the last change per quarter.
HA AS (
SELECT id CL_ID, history_id CL_Hist_ID, history_date CL_Hist_Date FROM (
SELECT ha.id, history_id, history_date, ROW_NUMBER() OVER (PARTITION BY ha.id, EXTRACT(year FROM history_date),
EXTRACT(quarter FROM history_date) ORDER BY history_date DESC) AS Row_Rank
FROM historicalclient ha
WHERE EXTRACT(year FROM history_date) >= (EXTRACT(year FROM CURRENT_DATE) - 1)
) ha1
WHERE Row_Rank = 1
),
-- Tag clients to relevant quarters
HAQ AS (
SELECT *
FROM Q
JOIN LATERAL (SELECT CL_ID, MAX(CL_Hist_ID) CL_Hist_ID, MAX(CL_Hist_Date) CL_Hist_Date FROM HA WHERE CL_Hist_Date :: DATE <= Q.Q_End GROUP BY CL_ID) HA
ON TRUE
),
-- Same deal as HA but for contactposition.
HCP AS (
SELECT CP_ID, client_id CP_CL_ID, contact_id CP_Cont_ID, history_id CP_Hist_ID, history_date CP_Hist_Date, current FROM (
SELECT hcp.id CP_ID, client_id, Cont_ID, history_id, history_date, current,
ROW_NUMBER() OVER (PARTITION BY hcp.id, EXTRACT(year FROM history_date), EXTRACT(quarter FROM history_date) ORDER BY history_date DESC) AS Row_Rank
FROM contactposition hcp
WHERE historydate >= '10/1/2023'
) hcp1
WHERE Row_Rank = 1
AND current = TRUE
),
-- HC is HCP but for contacts
HC AS (
SELECT id Cont_ID, history_id Con_Hist_ID, history_date Con_Hist_Date FROM (
SELECT hc.id, history_id, history_date, ROW_NUMBER() OVER (PARTITION BY hc.id, EXTRACT(year FROM history_date), EXTRACT(quarter FROM history_date) ORDER BY history_date DESC) AS Row_Rank
FROM historicalcontact hc
LEFT JOIN contactposition cp
ON hc.id = cp.Cont_ID
WHERE EXTRACT(year FROM history_date) >= (EXTRACT(year FROM CURRENT_DATE) - 1)
) hc1
WHERE Row_Rank = 1
),
-- Joining all of the base data sources together
HACPCQ AS (
SELECT DISTINCT *
FROM HAQ
LEFT JOIN LATERAL (SELECT CP_CL_ID, CP_ID, CP_Cont_ID, MAX(CP_Hist_ID) CP_Hist_ID, MAX(CP_Hist_Date) CP_Hist_Date FROM HCP WHERE CP_CL_ID = HAQ.CL_ID AND CP_Hist_Date :: DATE <= HAQ.Q_End GROUP BY CP_CL_ID, CP_ID, CP_Cont_ID) HCP
ON TRUE
LEFT JOIN LATERAL (SELECT Cont_ID, MAX(Con_Hist_ID) Con_Hist_ID, MAX(Con_Hist_Date) Con_Hist_Date FROM HC WHERE Cont_ID = HCP.CP_Cont_ID AND Con_Hist_Date :: DATE <= HAQ.Q_End GROUP BY Cont_ID) HC
ON TRUE
)
SELECT *
FROM HACPCQ a
Query Plan
Edit: Switched out the full EXPLAIN
query plan for the EXPLAIN ANALYZE
query plan of one client. Freed up characters in my post - link below.
https://explain.depesz.com/s/Cor5
Edit 2: Trimmed down query to make it more readable. Still filtered to one client. New EXPLAIN ANALYZE
query plan below.
interval
constant can have an equation in it. You can do interval '1 month -1 day'
, without having to separately subtract interval '1 day'
from interval '1 month'
.generate_series()
don't need wrapping in subqueries to be selected from.row_number()
rather than rank()
if you're just numbering rows.WITH ORDINALITY AS g(val,row_num)
when taking rows from a Set-Returning Function, you get numbered values. That way you don't have to collect, sort and number them separately, at all, removing the need for either rank()
or row_number()
.timestamp(tz)
implies interval
addend, saving you an explicit cast.Here's a separate demo for the five points above applied to the first CTE Q
.
distinct on
. It selects the top/sample record per group without having to aggregate, number, then filter everything out except the Row_Rank = 1
that keeps appearing in your code.HAQ
, the two max()
could return values from different rows. It's possible in your data they happen to always come from the same row, but in the query there's nothing enforcing that. Distinct on
does enforce that, while also performing and optimising better.CROSS JOIN
is equivalent to JOIN ON TRUE
, which is the same as just a comma ,
: demo.JOIN
s don't seem to need to be lateral
and you don't need those subqueries at all: they re-do operations you already did and they already expose columns to match on.history_date
or historydate
in contactposition
? Does the same table have Cont_ID
/cont_id
or contact_id
column, or both?Q_Start
is just q_start
. Unless you create and refer to it as "Q_Start"
.Applying all of the above, you can slightly clean up your query and speed it up 50x. Demo:
WITH Q AS (
SELECT Q_Start::DATE,
(Q_Start + '3 months -1 day')::DATE Q_End,
Q_Rank
FROM GENERATE_SERIES(DATE_TRUNC('quarter',CURRENT_DATE),
DATE_TRUNC('year', CURRENT_DATE) -INTERVAL '1 year',
INTERVAL '-3 months')WITH ORDINALITY g(Q_Start,Q_Rank)
),
HA AS (-- HA gathers the historic client records within the timeframe of Q
-- and filters to the last change per quarter.
SELECT DISTINCT ON (ha.id,
EXTRACT(year FROM history_date),
EXTRACT(quarter FROM history_date) )
ha.id AS CL_ID,
history_id AS CL_Hist_ID,
history_date AS CL_Hist_Date
FROM historicalclient ha
WHERE EXTRACT(year FROM history_date)>=(EXTRACT(year FROM CURRENT_DATE) -1)
ORDER BY ha.id,
EXTRACT(year FROM history_date),
EXTRACT(quarter FROM history_date),
history_date DESC
),
HAQ AS (-- Tag clients to relevant quarters
SELECT * FROM Q
,LATERAL(SELECT CL_ID,
MAX(CL_Hist_ID) CL_Hist_ID,
MAX(CL_Hist_Date) CL_Hist_Date
FROM HA
WHERE CL_Hist_Date::DATE <= Q.Q_End
GROUP BY CL_ID) HA
),
HCP AS (-- Same deal as HA but for contactposition.
SELECT DISTINCT ON (hcp.id,
EXTRACT(year FROM history_date),
EXTRACT(quarter FROM history_date))
hcp.id AS CP_ID,
client_id AS CP_CL_ID,
contact_id AS CP_Cont_ID,
history_id AS CP_Hist_ID,
history_date AS CP_Hist_Date,
current
FROM contactposition hcp
WHERE history_date >= '10/1/2023'
AND current IS TRUE
),
HC AS (-- HC is HCP but for contacts
SELECT DISTINCT ON (hc.id,
EXTRACT(year FROM history_date),
EXTRACT(quarter FROM history_date))
hc.id AS Cont_ID,
history_id AS Con_Hist_ID,
history_date AS Con_Hist_Date
FROM historicalcontact hc
LEFT JOIN contactposition cp
ON hc.id = cp.contact_id
WHERE EXTRACT(year FROM history_date)>=(EXTRACT(year FROM CURRENT_DATE) -1)
ORDER BY hc.id,
EXTRACT(year FROM history_date),
EXTRACT(quarter FROM history_date),
history_date DESC
)
-- Joining all of the base data sources together
SELECT DISTINCT ON (HAQ.CL_ID, HAQ.Q_Start, HCP.CP_Cont_ID) *
FROM HAQ, HCP, HC
WHERE HCP.CP_CL_ID = HAQ.CL_ID
AND HCP.CP_Hist_Date::DATE <= HAQ.Q_End
AND HC.Cont_ID = HCP.CP_Cont_ID
AND HC.Con_Hist_Date::DATE <= HAQ.Q_End
ORDER BY HAQ.CL_ID, HAQ.Q_Start DESC, HCP.CP_Cont_ID;