I have two SQL queries like this for postgresql
:
The first query is fetching subscriptions that were never synced or that were updated recently and synced a long time ago:
SELECT *
FROM subscriptions
WHERE (last_sync_at IS NULL OR last_sync_at < updated_at) AND updated_at < $1
ORDER BY last_sync_at NULLS FIRST
LIMIT $2`,
The second query is fetching subscriptions, but the one synchronized the longest time ago first.
SELECT *
FROM subscriptions
ORDER BY last_sync_at NULLS FIRST
LIMIT $1`,
I want to aggregate these two queries like this: If the first query returns less results than the limit, then I want to execute the second query to fill the batch.
For example, if the limit is 100
and the first query returns 90
, I want to execute the second one with a limit of 10
.
I can do this easily with code, but I wonder I there's a way to do that with one SQL query and if it will be more perfomant.
I tried something like this, but obviously it doesn't work:
`(SELECT *, COUNT(*) as nb_rows_selected
FROM subscriptions
WHERE (last_chartmogul_sync_at IS NULL OR last_chartmogul_sync_at < updated_at) AND updated_at < $1
ORDER BY last_chartmogul_sync_at NULLS FIRST
GROUP BY id
LIMIT $2)
UNION
(SELECT *
FROM subscriptions
ORDER BY last_sy NULLS FIRST
LIMIT nb_rows_selected - $2)`,
Any help would be appreciated. Thanks!
I found the right answer by slightly modifying my first approach. I still use a UNION
and a COUNT
but this time I've broken the complexity by using the WITH clause:
WITH unsynced_or_recently_updated_subscriptions AS (
SELECT *
FROM subscriptions
WHERE (last_sync_at IS NULL OR last_sync_at < updated_at)
AND updated_at < $1
ORDER BY last_sync_at NULLS FIRST
LIMIT $2
),
oldest_synced_subscriptions AS (
SELECT *
FROM subscriptions
WHERE id NOT IN (SELECT id FROM unsynced_or_recently_updated_subscriptions)
ORDER BY last_sync_at NULLS FIRST
LIMIT $2 - (SELECT COUNT(*) FROM unsynced_or_recently_updated_subscriptions)
)
SELECT *
FROM unsynced_or_recently_updated_subscriptions
UNION ALL
SELECT *
FROM oldest_synced_subscriptions;
With this approach the second query is limited to [the number of rows of the first query] - [the limit]
. Which is way effective. Because in case the first query fullfiled the limit, the second query will have a limit of 0
and will return immediately.
From my understanding, this combined approach is generally more efficient than executing two separate queries, as it reduces the overhead associated with multiple database connections.
The only scenario where two separate queries might be more performant is when the first query consistently fulfills the limit, in which case the second query is unnecessary. However, this situation is unlikely to occur frequently in my case.
Also the NOT IN
clause should have minimal impact on performance since the id
column is indexed.
An even better solution could be to use NOT EXISTS
instead of NOT IN
for the second subquery:
oldest_synced_subscriptions AS (
SELECT *
FROM subscriptions
WHERE NOT EXISTS (
SELECT id FROM unsynced_or_recently_updated_subscriptions
WHERE unsynced_or_recently_updated_subscriptions.id = subscriptions.id
)
ORDER BY last_chartmogul_sync_at NULLS FIRST
LIMIT $2 - (SELECT COUNT(*) FROM unsynced_or_recently_updated_subscriptions)
)
The NOT EXISTS
clause is generally more efficient than NOT IN
, especially when dealing with NULL
values. In terms of performance, NOT EXISTS
is often faster than NOT IN
because it can stop evaluating as soon as it finds a match, whereas NOT IN
might have to scan the entire result set of the subquery.