sqlsql-servergroup-bysql-server-2014

GROUP BY works but I don't think it should


From https://learn.microsoft.com/en-us/sql/t-sql/queries/select-group-by-transact-sql?view=sql-server-ver16 it says "However, each table or view column in any nonaggregate expression in the list must be included in the GROUP BY list". Exactly as you'd expect (there seem to be a couple of exceptions, probably MSSQL specific, regarding string concatenation but they don't seem to be relevant here).

At work I intended to write this:

SELECT 
    max(STATS_DATE(object_id, stats_id)), 
    object_name(object_id)
FROM sys.stats  
GROUP BY object_name(object_id)

Which works as expected. What I actually wrote by accident was this

SELECT 
    max(STATS_DATE(object_id, stats_id)), 
    object_name(object_id)
FROM sys.stats  
GROUP BY object_id -- should have been object_name(object_id)

The SELECT contains object_name(object_id) but the GROUP BY contains object_id, which is definitely not in the SELECT. ISTM this should have been rejected, but it runs fine.

I don't get it. Is there some support for functional dependencies that I'm not aware of? Is there even a functional dependency between object_id and object_name(object_id) (I don't think so because object_name(object_id)might not necessarily be unique, although my understanding of FDs is pretty poor).

What's going on?


Solution

  • Is there some support for functional dependencies that I'm not aware of?

    Sort of, but no.

    What you are allowed to do is use the GROUP BY value in any way you like in the SELECT, with any functions or calculations, as long as you are not using any columns or values that have not been grouped.

    So the below query works, because object_id has been grouped, and you are allowed to pass a grouping value to a function.

    SELECT
        max(STATS_DATE(object_id, stats_id)), 
        object_name(object_id)
        -- whatever
    FROM sys.stats  
    GROUP BY object_id
    

    You can add two GROUP BY values together, or pass them to a function, or cast them etc.

    What you cannot do is do the calculation, then group that, and then use the original value from before the calculation. The original value is not present in the GROUP BY so cannot be used, only the actual value that was used.

    SELECT 
        max(STATS_DATE(object_id, stats_id)), 
        object_id
    FROM sys.stats  
    GROUP BY object_name(object_id)
    

    Two further notes:

    Instead, use the following query:

    SELECT 
        MaxStatsDate, 
        t.name
    FROM sys.tables t
    CROSS APPLY (
        SELECT MAX(p.last_updated) AS MaxDate
        FROM sys.stats s
        CROSS APPLY sys.dm_db_stats_properties(s.object_id, s.stats_id) p
    ) s;