sqlgroup-byteradatasample

TERADATA SQL: RANDOM SAMPLE AVOID BREAKING GROUPS


I have c. 13m records that look something like;

Cust_ID Order
123 a1
123 a1
123 a2
456 a2
789 b2
789 b3

I want to take a sample of say 2m but want to avoid breaking up any cusotmer records (i.e if 123 is in the sample all three records need to be included).

I tried sample but this has the issue that customer 123 might only have 1 record when I need all three. I also tried using QUALIFY altering the following found on antoher question;

SELECT * FROM TABLE1 QUALIFY ROW_NUMBER() OVER ( PARTITION BY CUST_ID ORDER BY NULL) <= 50;

However this 50 records within a group.

I'm kind of thinking I need a subquery but I'm drawing a blank. Any help


Solution

  • This is probably clearest to do in stages:

    1. Make a list of Cust_IDs and corresponding row counts.
    2. Randomize the list.
    3. Use a cumulative sum to determine when we meet (or slightly exceed) the desired sample size.

    .

    SELECT t.* FROM Table1 t JOIN
        (SELECT Cust_ID FROM 
            (SELECT Cust_ID, cnt, RANDOM(-2147483648,2147483647) as rnd FROM
                (SELECT Cust_ID, COUNT(*) as cnt FROM TABLE1 GROUP BY 1) RowCountByID
            ) RandomizedList
        /* Cumulative sum not including current row */
        QUALIFY COALESCE(SUM(cnt) OVER (ORDER BY rnd ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING),0) < 2000000
        ) SampleIDs
      ON SampleIDs.Cust_ID = t.Cust_ID;