I am having trouble wrapping my head around the following problem. A sample dataset (truncated):
ID | Test | Instance | Weight | Outcome |
---|---|---|---|---|
1 | A | 1 | 0.5 | 2 |
1 | A | 1 | 0.5 | 2 |
1 | A | 1 | 0.5 | 2 |
1 | A | 1 | 0.5 | 2 |
1 | A | 1 | 0.5 | 2 |
1 | A | 1 | 0.5 | 2 |
1 | A | 1 | 0.5 | 2 |
1 | A | 1 | 0.5 | 2 |
1 | B | 1 | 0.5 | 2 |
1 | B | 1 | 0.5 | 2 |
2 | A | 1 | 1 | 1 |
2 | A | 2 | 1 | 2 |
2 | B | 1 | 1 | 2 |
3 | C | 1 | 1 | 2 |
3 | C | 1 | 1 | 2 |
3 | C | 1 | 1 | 2 |
3 | C | 1 | 1 | 2 |
3 | C | 1 | 1 | 2 |
4 | A | 1 | 0.5 | 3 |
4 | B | 1 | 0.5 | 4 |
4 | B | 1 | 0.5 | 4 |
5 | A | 1 | 0.5 | 2 |
5 | B | 1 | 0.5 | 2 |
6 | A | 1 | 0.25 | 2 |
6 | A | 1 | 0.25 | 2 |
6 | A | 2 | 0.25 | 2 |
6 | A | 3 | 0.25 | 1 |
6 | B | 1 | 0.25 | 2 |
My code:
SELECT
ID,
TEST,
INSTANCE,
WEIGHT * (COUNT(UNIQUE(CASE WHEN OUTCOME = '1' THEN ID END))) AS Outcome1,
WEIGHT * (COUNT(UNIQUE(CASE WHEN OUTCOME = '2' THEN ID END))) AS Outcome2,
WEIGHT * (COUNT(UNIQUE(CASE WHEN OUTCOME = '3' THEN ID END))) AS Outcome3,
WEIGHT * (COUNT(UNIQUE(CASE WHEN OUTCOME = '4' THEN ID END))) AS Outcome4
FROM
DATA
WHERE
TEST IN ('A', 'B', 'C')
GROUP BY
ID, TEST, WEIGHT
ORDER BY
ID;
which results in the following output:
ID | Test | Instance | Outcome1 | Outcome2 | Outcome3 | Outcome4 |
---|---|---|---|---|---|---|
1 | A | 1 | 0 | 0.5 | 0 | 0 |
1 | B | 1 | 0 | 0.5 | 0 | 0 |
2 | A | 1 | 1 | 0 | 0 | 0 |
2 | A | 2 | 0 | 1 | 0 | 0 |
2 | B | 1 | 0 | 1 | 0 | 0 |
3 | C | 1 | 0 | 1 | 0 | 0 |
4 | A | 1 | 0 | 0 | 0.5 | 0 |
4 | B | 1 | 0 | 0 | 0 | 0.5 |
5 | A | 1 | 0 | 0.5 | 0 | 0 |
5 | B | 1 | 0 | 0.5 | 0 | 0 |
6 | A | 1 | 0 | 0.25 | 0 | 0 |
6 | A | 2 | 0 | 0.25 | 0 | 0 |
6 | A | 3 | 0.25 | 0 | 0 | 0 |
6 | B | 1 | 0 | 0.25 | 0 | 0 |
7 | A | 1 | 0.5 | 0 | 0 | 0 |
7 | B | 1 | 0.5 | 0 | 0 | 0 |
8 | B | 1 | 1 | 0 | 0 | 0 |
9 | A | 1 | 0 | 1 | 0 | 0 |
9 | B | 1 | 0.5 | 0 | 0 | 0 |
9 | B | 2 | 0.5 | 0 | 0 | 0 |
10 | A | 1 | 0 | 0.25 | 0 | 0 |
10 | A | 2 | 0 | 0.25 | 0 | 0 |
11 | A | 1 | 0 | 1 | 0 | 0 |
12 | B | 1 | 0 | 1 | 0 | 0 |
13 | A | 1 | 0.5 | 0 | 0 | 0 |
13 | B | 1 | 0 | 0 | 0 | 0.5 |
14 | B | 1 | 0 | 0.5 | 0 | 0 |
15 | A | 1 | 1 | 0 | 0 | 0 |
15 | B | 1 | 1 | 0 | 0 | 0 |
15 | C | 1 | 0 | 0 | 1 | 0 |
However, I would like to have one row for each ID that collapses everything like the following:
ID | Outcome1 | Outcome2 | Outcome3 | Outcome4 |
---|---|---|---|---|
1 | 0 | 1 | 0 | 0 |
2 | 1 | 2 | 0 | 0 |
3 | 0 | 1 | 0 | 0 |
4 | 0 | 0 | 0.5 | 0.5 |
5 | 0 | 1 | 0 | 0 |
6 | 0.25 | 0.75 | 0 | 0 |
7 | 1 | 0 | 0 | 0 |
8 | 1 | 0 | 0 | 0 |
9 | 1 | 1 | 0 | 0 |
10 | 0 | 0.5 | 0 | 0 |
11 | 0 | 1 | 0 | 0 |
12 | 0 | 1 | 0 | 0 |
13 | 0 | 0 | 0 | 0.5 |
14 | 0 | 0.5 | 0 | 0 |
15 | 2 | 0 | 1 | 0 |
I have tried to pull this without the "Test" variable while keeping it in the group by, but that reduces each ID to just one outcome so no ID has a number higher than 1 (e.g., ID 2 Outcome 2 = 1 when it should be 2, I assume because it was only looking for each unique outcome. so if they achieved outcome 1 on tests A and B, it still only counts outcome 1 once). I tried removing the "unique" identifier as well, but that muddied things too much as some IDs have 50 records for Test A and I only care about each outcome once. (e.g., if they got Outcome 1 25 times and Outcome 2 25 times on Test A, I still want it to display Outcome 1 = 1 x weight, Outcome 2 = 1 x weight). I tried to do this in Excel using pivot tables but unfortunately my dataset is actually around 1 million rows so the export didn't go well. I would love to find a way to do within SQL.
To summarize: I would like to collapse my table so that each ID has one row with all data summarized while accounting for unique test outcomes.
Thank you!
Any modern RDBMS (>= 2010?) will provide you Common Table Expressions that will help you inject the results of your first query in a second query dedicated to the grouping:
with t as
(
-- This is your query, unmodified except for the last ; removed:
SELECT
ID,
TEST,
WEIGHT * (COUNT(UNIQUE(CASE WHEN OUTCOME = '1' THEN ID END))) AS Outcome1,
WEIGHT * (COUNT(UNIQUE(CASE WHEN OUTCOME = '2' THEN ID END))) AS Outcome2,
WEIGHT * (COUNT(UNIQUE(CASE WHEN OUTCOME = '3' THEN ID END))) AS Outcome3,
WEIGHT * (COUNT(UNIQUE(CASE WHEN OUTCOME = '4' THEN ID END))) AS Outcome4
FROM
DATA
WHERE
TEST IN ('A', 'B', 'C')
GROUP BY
ID, TEST, WEIGHT
ORDER BY
ID
)
select
ID,
sum(Outcome1) Outcome1,
sum(Outcome2) Outcome2,
sum(Outcome3) Outcome3,
sum(Outcome4) Outcome4
from t
group by ID
order by ID;
This will help you build your solution incrementally:
you can comment the last select
and place a select * from t
to verify the results of the intermediate query, before unleashing the definitive version of the final one,
thus secure every "good" query into its own alias (even isolating interesting attempts in subqueries that you don't want to use in the final result, but keep for posterity),
and chain the steps to get your final report.
In essence, they are like temporary tables, that vanish outside of the WITH
.