I have c. 13m records that look something like;
Cust_ID | Order |
---|---|
123 | a1 |
123 | a1 |
123 | a2 |
456 | a2 |
789 | b2 |
789 | b3 |
I want to take a sample of say 2m but want to avoid breaking up any cusotmer records (i.e if 123 is in the sample all three records need to be included).
I tried sample but this has the issue that customer 123 might only have 1 record when I need all three. I also tried using QUALIFY altering the following found on antoher question;
SELECT * FROM TABLE1 QUALIFY ROW_NUMBER() OVER ( PARTITION BY CUST_ID ORDER BY NULL) <= 50;
However this 50 records within a group.
I'm kind of thinking I need a subquery but I'm drawing a blank. Any help
This is probably clearest to do in stages:
.
SELECT t.* FROM Table1 t JOIN
(SELECT Cust_ID FROM
(SELECT Cust_ID, cnt, RANDOM(-2147483648,2147483647) as rnd FROM
(SELECT Cust_ID, COUNT(*) as cnt FROM TABLE1 GROUP BY 1) RowCountByID
) RandomizedList
/* Cumulative sum not including current row */
QUALIFY COALESCE(SUM(cnt) OVER (ORDER BY rnd ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING),0) < 2000000
) SampleIDs
ON SampleIDs.Cust_ID = t.Cust_ID;