sqlsql-servergroup-bygrouping-sets

Median when using GROUPING SETS


Is there a good way to calculate a median when using GROUP BY GROUPING SETS in SQL server?

I'm trying to generate some analysis for different groupings. All the other aggregations (COUNT/ SUM/ AVG/ MIN/ MAX) work fine with GROUPING SETS.

Using PERCENTILE_CONT to calculate a median fails as [value] 'is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause'

I'm not wedded to PERCENTILE_CONT, but am struggling to think of a solution other than a very long UNION of this whole query repeated 4 times, but just using GROUP BY for one of the grouping sets at a time. Is there a better alternative I'm missing?

SELECT
    CASE 
        WHEN [type] IS NOT NULL AND [period] IS NOT NULL AND [area] IS NOT NULL THEN 'LA by event type by month'
        WHEN [type] IS NOT NULL AND [period] IS NULL AND [area] IS NOT NULL THEN 'LA by event type any time'
        WHEN [type] IS NOT NULL AND [period] IS NOT NULL AND [area] IS NULL THEN 'England event type by month'
        WHEN [type] IS NOT NULL AND [period] IS NULL AND [area] IS NULL THEN 'England event type any time'
        ELSE 'CHECK GROUPING' END AS [grouping_type]
    ,[type]
    ,[period]
    ,[area]
    ,COUNT(*) AS [number_projects]
    ,SUM([units]) AS [total_units]
    ,AVG([value]) as [mean_value]
    ,MIN([value]) AS [min_value]
    ,MAX([value]) AS [max_value]

    /*INCLUDING THIS CAUSES AN ERROR*/
    ,PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY [value]) OVER() AS median_value
FROM tb_source main
GROUP BY 
    GROUPING SETS (
        ([type])        
        ,([type], [period]) 
        ,([type], [area]) 
        ,([type], [period], [area])
        ,())

Source data sample

Expected output sample (for 2 of the grouping sets)


Solution

  • PERCENTILE_CONT is a window function, which means you need to put it into a derived table, you can't use it at the aggregation level unless you are only taking the median after aggregation.

    @AaronBertrand has a long article explaining all the different median options. But PERCENTILE_CONT is the simplest, especially in your case, although it's probably not very efficient.

    SELECT
        CASE WHEN type IS NOT NULL THEN
            CASE WHEN [period] IS NOT NULL THEN
                CASE WHEN area IS NOT NULL THEN
                    'LA by event type by month'
                ELSE
                    'England event type by month'
                END
            ELSE
                CASE WHEN area IS NOT NULL THEN
                    'LA by event type any time'
                ELSE
                    'England event type any time'
                END
            END
          ELSE
              'CHECK GROUPING'
          END AS grouping_type
        ,type
        ,period
        ,area
        ,COUNT(*) AS number_projects
        ,SUM(units) AS total_units
        ,AVG(value) as mean_value
        ,MIN(value) AS min_value
        ,MAX(value) AS max_value
        ,CASE WHEN type IS NOT NULL THEN
            CASE WHEN [period] IS NOT NULL THEN
                CASE WHEN area IS NOT NULL THEN
                    MIN(median_type_period_area)
                ELSE
                    MIN(median_type_period)
                END
            ELSE
                CASE WHEN area IS NOT NULL THEN
                    MIN(median_type_area)
                ELSE
                    MIN(median_type)
                END
            END
            ELSE
                MIN(median_all)
         END AS median_value
    FROM (
        SELECT *,
            PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY value) OVER (PARTITION BY type) AS median_type,
            PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY value) OVER (PARTITION BY type, [period]) AS median_type_period,
            PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY value) OVER (PARTITION BY type, area) AS median_type_area,
            PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY value) OVER (PARTITION BY type, [period], area) AS median_type_period_area,
            PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY value) OVER () AS median_all,
        FROM tb_source main
    ) main
    GROUP BY 
        GROUPING SETS (
            (type)
            (type, [period]),
            (type, area),
            (type, [period], area),
            ()
        );