I'm playing around with PostgreSQL's lateral
join, specifically to perform a join on a group by/limit.
The query works really well when I'm looking up individual records, but the performance very quickly degrades as we query multiple records. This makes sense given that we have multiple subqueries running individual gather, filter aggregate, sorts.the question is, what Postgres strategy should we look at or how do we refactor the below query in order to make it performant at scale?
We have three main tables with a junction table between two of them:
|Managers| >- |Stores| >- |Store_Products| -< Products
We have all of the historical managers for a given store record, and we have a whole catalogue of products for stores (products may be carried by multiple stores).
Goal:given a Store ID, query the most recent Manager and the most recent Product sold.
It's an inner join from Store to Manager and to Product. Manager & Product must be sorted by date desc and limited to 1 (at least I believe that is the way to get the most recent).
SELECT
store.id as store_id,
manager.id as manager_id,
*
FROM
Stores as store,
LATERAL (
SELECT
*
FROM
Products as product
INNER JOIN Stores_Products store_product on store_product.product_id = product.id
WHERE
store_product.store_id = store.id
ORDER BY
store.date desc
LIMIT 1
) p,
LATERAL (
SELECT
*
FROM
Managers as manager
WHERE
manager.store_id = store.id
ORDER BY
manager.date desc
LIMIT 1
) m
WHERE store.name = 'ABC retail'
This works perfectly when you query a single store. However, if you try to batch query (e.g. WHERE store.name in [...]
), the query gets very slow and memory consuming very quickly.
Is there a better way to query the data that would scale well?
Thank you!
Note: the example given with stores/products is simply a device to communicate the problem. The actual schema is different - so I'd ask not to not put too much thought into whether this is the best way to normalize the schema! Thank you 🙏 !
Perhaps window functions will work faster. In the code below product ordering attribute is left as ...
because in your snippet they seem to be ordered by store.date which looks wrong (it is the property of the store, not of the product and not of the product sold in the store).
SELECT * FROM
-- Let's rank managers within each store, giving rank=1 to the most recent
(
SELECT id,
store_id,
RANK() OVER (PARTITION BY store_id ORDER BY date DESC) AS mgr_rank
FROM Manager
) AS MgrRank
JOIN
-- Let's rank products within each store, giving rank=1 to the most recent
(
SELECT store_id,
Products.*
RANK() OVER (PARTITION BY store_id ORDER BY .... DESC) AS product_rank
FROM Stores_Products JOIN Products ON product_id = Products.id
) AS ProductRank
USING(store_id)
-- Now let's join stores themselves
JOIN Stores ON store_id = Stores.id
-- Select most recent manager and product
WHERE mgr_rank=1 AND product_rank=1 AND Stores.name='ABC retail'
Keep in mind that this particular query will not output stores without managers or products. You need to use outer joins to include them as well.