sqlsql-serverpartitioningcumulative-sumsql-server-2022

How to partition into batch of records with max batch size limit


I want to batch the students into maximum batch size of 100, partitioned by School. Each teacher can have 12 students. Students of any teacher should not be split into different batches. No. of students in a batch can be less than 100 but not more.

Fiddle URL

With the query used in the fiddle, for the second batch the number students are more than 100 (RowNum 17-32). How can I ensure the students are not split into different batches and there are no more than 100 students in a batch.

3rd party edit

This sample shows rows with RowNum > 14. Some have a CumulativeCount over 100 and a Batchnumber of 2.

RowNum greater 14

The inner query is nearly the same only order by was moved to the CTE.

WITH CTE AS (
    SELECT School, TeacherName,
         ROW_NUMBER() OVER(PARTITION BY SCHOOL ORDER BY sCHOOL) 
         AS RowNum,
         SUM(COUNT(*)) OVER(PARTITION BY School ORDER BY TeacherName) 
         AS CumulativeCount,
         SUM(COUNT(*)) OVER(PARTITION BY School 
                       ORDER BY TeacherName)/100+1 
         AS BatchNumber
    FROM mytable
GROUP BY School, TeacherName )
SELECT * FROM CTE WHERE  RowNum > 14
ORDER BY School

Solution

  • With the fiddle (and a better explanation) available, it becomes clear that what you have is a (somewhat) simplified version of the Knapsack Problem.

    As such, it cannot be solved without recursion, unfortunately. This means that, on larger datasets, you might run into resource exhaustion issues.

    The following query builds upon your fiddle:

    declare @BatchSize int = 100;
    
    with src as (
        -- Source query
        select t.School, t.TeacherName,
            count(*) as [GroupSize],
            dense_rank() over(partition by t.School order by t.TeacherName) as [RN]
        from #mytable t
        group by t.School, t.TeacherName
    ),
    cte as (
        -- Anchor part - start of the first batch for each school
        select s.School, s.TeacherName, s.GroupSize, s.RN,
            s.GroupSize as [BatchTotal], 1 as [IsBatchStart]
        from src s
        where s.RN = 1
        union all
        -- Recursive part - deciding if current batch can be continued or not
        select s.School, s.TeacherName, s.GroupSize, s.RN,
            case
                when c.BatchTotal + s.GroupSize > @BatchSize then s.GroupSize
                else c.BatchTotal + s.GroupSize
            end as [BatchTotal],
            case
                when c.BatchTotal + s.GroupSize > @BatchSize then 1
                else 0
            end as [IsBatchStart]
        from src s
            inner join cte c on c.School = s.School
                and s.RN = c.RN + 1
    )
    select c.*,
        sum(c.IsBatchStart) over(partition by c.School order by c.RN) as [BatchNumber]
    from cte c
    order by c.School, c.RN
    -- Infinite iteration - might be needed on large datasets
    -- option (maxrecursion 0)
    

    Also note that dense_rank() is a rather expensive function, performance-wise. On a larger scale I'd suggest to cache the src subquery into a temporary table with suitable indices, and then use something more lightweight, like row_number(), to number the rows.