sql-serversalesforce-marketing-cloud

Why are these two SQL queries returning different records?


I'm testing a few variations of a query to check which runs most efficiently. Most of the variations return the same number of records, but one doesn't, and I can't understand why. Can anyone help?

The purpose of the query is simply to identify the latest send records for each subscriber in a Business Unit.

Query 1:

SELECT top 250000
    sent.SubscriberKey,
    sent.EventDate as LatestSend
FROM
    ent._Sent as sent
WHERE 
    sent.OYBAccountID = '[id]'
    AND sent.EventDate = (
        SELECT MAX (latest.EventDate) 
        FROM ent._Sent as latest 
        WHERE latest.SubscriberKey = sent.SubscriberKey
    )
ORDER BY LatestSend ASC

Query 2:

SELECT top 250000
    sent.SubscriberKey,
    MAX(sent.EventDate) as LatestSend
FROM
    ent._Sent as sent
WHERE 
    sent.OYBAccountID = '[id]'
GROUP BY
    sent.SubscriberKey
ORDER BY LatestSend ASC

I expected these to return the same result set. However, query 2 returns approximately 3,000 more records than query 1. I can't figure out why those 3,000 records aren't being returned in query 1, as they seem to meet all the criteria.

EDIT: Thanks all for the suggestions. I've checked both query datasets and the source data, and there are no null values for EventDate or SubscriberKey, and no duplicates for SubscriberKey.

RESOLVED: Thank you all, the error was that I was missing the join on OYBAccountID in the subquery of query 1!


Solution

  • Query 1:

    SELECT top 250000
        sent.SubscriberKey,
        sent.EventDate as LatestSend
    FROM
        ent._Sent as sent
    WHERE 
        sent.OYBAccountID = '[id]'
        AND sent.EventDate = (
            SELECT MAX (latest.EventDate) 
            FROM ent._Sent as latest 
            WHERE latest.SubscriberKey = sent.SubscriberKey
        )
    ORDER BY LatestSend ASC
    

    This query is returning the top 250k rows from _Sent, where EventDate is the max EventDate for that SubscriberKey. Tied results per SubscriberKey will be returned as well, but only the top 25k of the whole set will be returned.

    Furthermore, any SubscriberKey which has all null EventDate will be excluded.

    You also appear to be missing an extra join condition in the subquery

            WHERE latest.SubscriberKey = sent.SubscriberKey
              AND latest.OYBAccountID = sent.OYBAccountID
    

    Query 2:

    SELECT top 250000
        sent.SubscriberKey,
        MAX(sent.EventDate) as LatestSend
    FROM
        ent._Sent as sent
    WHERE 
        sent.OYBAccountID = '[id]'
    GROUP BY
        sent.SubscriberKey
    ORDER BY LatestSend ASC
    

    This query aggregates by SubscriberKey, returning the max EventDate but only the first 250k. Tied results per SubscriberKey are ignored, as it's just aggregating.

    It will also not exclude nulls.


    If you wanted the full row then you can use a row-numbering solution.

    SELECT TOP (250000)
        sent.SubscriberKey,
        sent.EventDate as LatestSend
    FROM (
        SELECT sent.*,
          ROW_NUMBER() OVER (PARTITION BY sent.SubscriberKey ORDER BY sent.EventDate DESC) AS rn
        FROM
            ent._Sent as sent
        WHERE 
            sent.OYBAccountID = '[id]'
    ) AS sent
    WHERE sent.rn = 1;
    

    You can replace ROW_NUMBER with DENSE_RANK if you want tied results.