I have a query that aggregates a couple of client sales figures by summing them by groupings of status
, region
and class
categories. The query fixes some issues with the client names using CASE WHEN
expressions, preferring the column big_name
when available, med_name
in big_name
is not available, and small_name
when both big_name
and med_name
are not available. Please see below.
SELECT
month_key,
status,
region,
class,
CASE
WHEN big_name IN ('UNKNOWN', '-', '') AND med_name IN ('UNKNOWN', '-', '') THEN small_name
WHEN big_name IN ('UNKNOWN', '-', '') AND med_name NOT IN ('UNKNOWN', '-', '') THEN med_name
WHEN big_name NOT IN ('UNKNOWN', '-', '') THEN big_name
END AS new_name,
SUM(os.net_sls_amnt) AS net_sales,
SUM(os.gross_sls_amnt) AS gross_sales,
SUM(os.dscnt_amnt) AS discount,
FROM SCHEMA.ORIGINAL.SOURCE As os
WHERE month_key >= 202001
GROUP BY month_key, status, region, class, new_name
ORDER BY month_key, status, region, class, new_name
The source data deals with millions of clients. I'm not suppose to aggregate the sales figures for all clients at all. I'm suppose to differentiate the top 300 clients, which we do care for individually, from the clients not in the top 300 rank. Thus, I need to somehow rank the clients by net_sales
while at the same time use the logic that fixes their name (as in the new_name
in CASE
expression). Then, I have to combine the clients not in the top 300 rank as one single 'SMALL CLIENT', while still aggregate the top 300 ranked clients individually by new_name
. The ranking of top clients is not suppose to differentiate their status
, region
and class
, I'm only suppose to care for fixed client name when ranking them, which adds to the difficulty.
Finally, I don't have editing permissions for this table, and given how massive it is, I can't copy it over either.
Is there anyway to do combine all small merchants under a single name? Thank you so much for your time and for your help!
I'm including here an EXAMPLE with mock up data for 10 clients, and the desired output if we were to care only about the top 3 clients and combined the others under the name 'SMALL CLIENT'.
Any thoughts? I super appreciate any help at all!
There are different ways to do this with unions and by grouping sets. It's probably just easiest to aggregate a second time after computing the ranks and collapsing everything past the cutoff into a single bucket.
Presumably if there are ties that overflow past 300 you would still want to see the individual breakout (or note ties in the earlier rankings.) You'll use rank()
as below, rather than row_number()
, to accommodate that requirement.
The sample data left me unclear about the proper order of coalescing the name values but that's a trivial adjustment.
There's also a minor assumption that no legitimate companies will be called NO NAME or SMALL CLIENT. That is easy to tweak if necessary.
WITH mydata AS (
SELECT *,
COALESCE(
CASE WHEN big_name NOT IN ('UNKNOWN', '-', '') THEN big_name END,
CASE WHEN med_name NOT IN ('UNKNOWN', '-', '') THEN med_name END,
CASE WHEN small_name NOT IN ('UNKNOWN', '-', '') THEN small_name END,
'NO NAME'
) AS new_name
FROM SCHEMA.ORIGINAL.SOURCE
), aggregated AS (
SELECT
new_name, month_key, status, region, class,
SUM(net_sls_amnt) AS net_sales,
SUM(gross_sls_amnt) AS gross_sales,
SUM(dscnt_amnt) AS discount,
SUM(SUM(net_sls_amnt)) OVER (PARTITION BY new_name) as combined_sales
FROM mydata
WHERE month_key >= 202001
GROUP BY new_name, month_key, status, region, class
), ranked AS (
SELECT *, RANK() OVER (ORDER BY combined_sales) as rnk
FROM aggregated
)
select
min(rnk) as ranking,
case when rnk < 301 then new_name else 'SMALL CLIENT' end as name,
month_key, status, region, class,
sum(net_sales) as net_sales,
sum(gross_sales) as gross_sales,
sum(discount) as discount
from ranked
group by
case when rnk < 301 then new_name else 'SMALL CLIENT' end,
month_key, status, region, class
order by ranking, name, month_key, status, region, class;