sqloracle-databasegrouping

Sum columns by ID in SQL


I am having trouble wrapping my head around the following problem. A sample dataset (truncated):

ID Test Instance Weight Outcome
1 A 1 0.5 2
1 A 1 0.5 2
1 A 1 0.5 2
1 A 1 0.5 2
1 A 1 0.5 2
1 A 1 0.5 2
1 A 1 0.5 2
1 A 1 0.5 2
1 B 1 0.5 2
1 B 1 0.5 2
2 A 1 1 1
2 A 2 1 2
2 B 1 1 2
3 C 1 1 2
3 C 1 1 2
3 C 1 1 2
3 C 1 1 2
3 C 1 1 2
4 A 1 0.5 3
4 B 1 0.5 4
4 B 1 0.5 4
5 A 1 0.5 2
5 B 1 0.5 2
6 A 1 0.25 2
6 A 1 0.25 2
6 A 2 0.25 2
6 A 3 0.25 1
6 B 1 0.25 2

My code:

SELECT 
    ID,
    TEST,
    INSTANCE,
    WEIGHT * (COUNT(UNIQUE(CASE WHEN OUTCOME = '1' THEN ID END))) AS Outcome1,
    WEIGHT * (COUNT(UNIQUE(CASE WHEN OUTCOME = '2' THEN ID END))) AS Outcome2,
    WEIGHT * (COUNT(UNIQUE(CASE WHEN OUTCOME = '3' THEN ID END))) AS Outcome3,
    WEIGHT * (COUNT(UNIQUE(CASE WHEN OUTCOME = '4' THEN ID END))) AS Outcome4
FROM 
    DATA
WHERE 
    TEST IN ('A', 'B', 'C')
GROUP BY 
    ID, TEST, WEIGHT
ORDER BY 
    ID;

which results in the following output:

ID Test Instance Outcome1 Outcome2 Outcome3 Outcome4
1 A 1 0 0.5 0 0
1 B 1 0 0.5 0 0
2 A 1 1 0 0 0
2 A 2 0 1 0 0
2 B 1 0 1 0 0
3 C 1 0 1 0 0
4 A 1 0 0 0.5 0
4 B 1 0 0 0 0.5
5 A 1 0 0.5 0 0
5 B 1 0 0.5 0 0
6 A 1 0 0.25 0 0
6 A 2 0 0.25 0 0
6 A 3 0.25 0 0 0
6 B 1 0 0.25 0 0
7 A 1 0.5 0 0 0
7 B 1 0.5 0 0 0
8 B 1 1 0 0 0
9 A 1 0 1 0 0
9 B 1 0.5 0 0 0
9 B 2 0.5 0 0 0
10 A 1 0 0.25 0 0
10 A 2 0 0.25 0 0
11 A 1 0 1 0 0
12 B 1 0 1 0 0
13 A 1 0.5 0 0 0
13 B 1 0 0 0 0.5
14 B 1 0 0.5 0 0
15 A 1 1 0 0 0
15 B 1 1 0 0 0
15 C 1 0 0 1 0

However, I would like to have one row for each ID that collapses everything like the following:

ID Outcome1 Outcome2 Outcome3 Outcome4
1 0 1 0 0
2 1 2 0 0
3 0 1 0 0
4 0 0 0.5 0.5
5 0 1 0 0
6 0.25 0.75 0 0
7 1 0 0 0
8 1 0 0 0
9 1 1 0 0
10 0 0.5 0 0
11 0 1 0 0
12 0 1 0 0
13 0 0 0 0.5
14 0 0.5 0 0
15 2 0 1 0

I have tried to pull this without the "Test" variable while keeping it in the group by, but that reduces each ID to just one outcome so no ID has a number higher than 1 (e.g., ID 2 Outcome 2 = 1 when it should be 2, I assume because it was only looking for each unique outcome. so if they achieved outcome 1 on tests A and B, it still only counts outcome 1 once). I tried removing the "unique" identifier as well, but that muddied things too much as some IDs have 50 records for Test A and I only care about each outcome once. (e.g., if they got Outcome 1 25 times and Outcome 2 25 times on Test A, I still want it to display Outcome 1 = 1 x weight, Outcome 2 = 1 x weight). I tried to do this in Excel using pivot tables but unfortunately my dataset is actually around 1 million rows so the export didn't go well. I would love to find a way to do within SQL.

To summarize: I would like to collapse my table so that each ID has one row with all data summarized while accounting for unique test outcomes.

Thank you!


Solution

  • Any modern RDBMS (>= 2010?) will provide you Common Table Expressions that will help you inject the results of your first query in a second query dedicated to the grouping:

    with t as
    (
    
    -- This is your query, unmodified except for the last ; removed:
    SELECT 
        ID,
        TEST,
        WEIGHT * (COUNT(UNIQUE(CASE WHEN OUTCOME = '1' THEN ID END))) AS Outcome1,
        WEIGHT * (COUNT(UNIQUE(CASE WHEN OUTCOME = '2' THEN ID END))) AS Outcome2,
        WEIGHT * (COUNT(UNIQUE(CASE WHEN OUTCOME = '3' THEN ID END))) AS Outcome3,
        WEIGHT * (COUNT(UNIQUE(CASE WHEN OUTCOME = '4' THEN ID END))) AS Outcome4
    FROM 
        DATA
    WHERE 
        TEST IN ('A', 'B', 'C')
    GROUP BY 
        ID, TEST, WEIGHT
    ORDER BY 
        ID
    
    )
    select
        ID,
        sum(Outcome1) Outcome1,
        sum(Outcome2) Outcome2,
        sum(Outcome3) Outcome3,
        sum(Outcome4) Outcome4
    from t
    group by ID
    order by ID;
    

    This will help you build your solution incrementally:
    you can comment the last select and place a select * from t to verify the results of the intermediate query, before unleashing the definitive version of the final one,
    thus secure every "good" query into its own alias (even isolating interesting attempts in subqueries that you don't want to use in the final result, but keep for posterity),
    and chain the steps to get your final report.

    In essence, they are like temporary tables, that vanish outside of the WITH.