snowflake-cloud-data-platformbucketbatchingbucketing

Bucket records into batches of a certain size in Snowflake


What would be the best way to bucket records into batches of a predefined size? I would like to tag each record with a batch/bucket number for further processing.

For example, let's say I have 1110 records and the batch/bucket size is 200, I should end up with either:

The background/context is for each batch to then be processed by an external service, which has a maximum allowed batch size (which varies per service).


Solution

  • I figured out a way to do it using the NTILE window functions:

    WITH records AS (
      SELECT RANDOM() AS value
      FROM TABLE (generator(rowcount => 1110))
    ),
    batches AS (
      SELECT
        value,
        NTILE(
          (SELECT COUNT(*) FROM records)/200
        ) OVER (
          ORDER BY NULL
        ) AS batch
      FROM records
    )
    SELECT batch, COUNT(*)
    FROM batches
    GROUP BY ALL;
    

    Result:

    enter image description here


    My first approach was to use ROW_NUMBER(), but there's a bit more overhead with that approach (although the result is the same):

    WITH records AS (
      SELECT RANDOM() AS value
      FROM TABLE (generator(rowcount => 1110))
    ),
    batches AS (
      SELECT
        value,
        ROW_NUMBER() OVER (
          ORDER BY NULL
        ) AS rn,
        (rn % CEIL((SELECT COUNT(*) FROM records)/200)) AS batch
      FROM records
    )
    SELECT batch, COUNT(*)
    FROM batches
    GROUP BY ALL;