pythonor-toolsoperations-researchcp-sat

Could this CP-SAT model be faster?


My team is building a CP-SAT solver that schedules assignments (think homework) over a period of days with variable availability (time available to do assignments). We're trying to speed up our model.

We've tried num_search_workers and other parameter tuning but want to check for other speed increases. The aim being to solve problems with ~100days and up to 2000 assignments in 5-10seconds (benchmarked on M1 mac). Any ideas?

Problem Description: Place a assignments across d days respecting these requirements

Solving slows dramatically with # days and # assignments. This is expected but we'd like to know if you can suggest possible speedups

Here's an example unit test. Hopefully shows the splitting, ordering, and time constraints.

days = [{"secondsAvailable": 1200}, {"secondsAvailable": 1200}, {"secondsAvailable": 1200}, {"secondsAvailable": 1200}]
assignments = [
    {"id": 1, "resourceType": "Type0", "seconds": 2400, "deps": [], "instances": 2},
    {"id": 2, "resourceType": "Type0", "seconds": 1200, "deps": [1], "instances": 1},
    {"id": 3, "resourceType": "Type0", "seconds": 1200, "deps": [1, 2], "instances": 1},
    ]
result = cp_sat.CP_SAT_FAST.schedule(days, assignments, options=solver_options)
# expect a list of lists where each inner list is a day with the included assignments
expected = shared.SolverOutput(feasible=True, solution=[
    [{"id": 1, "resourceType": "Type0", "time": 1200, "instances": 2}],
    [{"id": 1, "resourceType": "Type0", "time": 1200, "instances": 2}],
    [{"id": 2, "resourceType": "Type0", "time": 1200, "instances": 1}],
    [{"id": 3, "resourceType": "Type0", "time": 1200, "instances": 1}],
    ])
self.assertEqual(result, expected)

And here's the solver:

import math
from typing import List, Dict

from ortools.sat.python import cp_model
import numpy as np

import planner.solvers as solvers
from planner.shared import SolverOutput, SolverOptions


class CP_SAT_FAST(solvers.Solver):
    """
    CP_SAT_FAST is a CP_SAT solver with speed optimizations and a time limit (passed in through options).
    """

    @staticmethod
    def schedule(days: List[Dict], assignments: List[Dict], options: SolverOptions) -> SolverOutput:
        """
        Schedules a list of assignments on a studyplan of days

        Arguments:
        days: list of dicts containing available time for that day
        assignments: list of assignments to place on schedule
        """

        model = cp_model.CpModel()

        num_assignments = len(assignments)
        num_days = len(days)

        # x[d, a] shows is assignment a is on day d
        x = np.zeros((num_days, num_assignments), cp_model.IntVar) 

        # used for resource diversity optimization
        total_resource_types = 4
        unique_today = []

        # upper and lower bounds used for dependency ordering (if a needs b then b must be before or on the day of a)
        day_ub = {}
        day_lb = {}

        # track assignment splitting
        instances = {}
        assignment_times = {}

        id_to_assignment = {}
        for a, asm in enumerate(assignments):

            # track upper and lower bounds
            day_ub[a] = model.NewIntVar(0, num_days, "day_ub")
            day_lb[a] = model.NewIntVar(0, num_days, "day_lb")
            asm["ub"] = day_ub[a]
            asm["lb"] = day_lb[a]
            id_to_assignment[asm["id"]] = asm

            max_instances = min(num_days, asm.get("instances", num_days))
            
            # each assignment must occur at least once
            instances[a] = model.NewIntVar(1, max_instances, f"instances_{a}")
            model.AddHint(instances[a], max_instances)

            # when split keep a decision variable of assignment time
            assignment_times[a] = model.NewIntVar(asm.get("seconds") // max_instances, asm.get("seconds"), f"assignment_time_{a}")
            model.AddDivisionEquality(assignment_times[a], asm.get("seconds"), instances[a])  

        for d in range(num_days):

            time_available = days[d].get("secondsAvailable", 0)
            if time_available <= 0:
                # no assignments on zero-time days
                model.Add(sum(x[d]) == 0)

            else:
                
                # track resource diversity on this day
                type0_today = model.NewBoolVar(f"type0_on_{d}")
                type1_today = model.NewBoolVar(f"type1_on_{d}")
                type2_today = model.NewBoolVar(f"type2_on_{d}")
                type3_today = model.NewBoolVar(f"type3_on_{d}")
                types_today = model.NewIntVar(0, total_resource_types, f"unique_on_{d}")
                
                task_times = []

                for a, asm in enumerate(assignments):

                    # x[d, a] = True if assignment a is on day d
                    x[d, a] = model.NewBoolVar(f"x[{d},{a}]")
                    
                    # set assignment upper and lower bounds for ordering
                    model.Add(day_ub[a] >= d).OnlyEnforceIf(x[d, a])
                    model.Add(day_lb[a] >= (num_days - d)).OnlyEnforceIf(x[d, a])
                    
                    # track if a resource type is on a day for resource diversity optimization
                    resourceType = asm.get("resourceType")
                    if resourceType == "Type0":
                        model.AddImplication(x[d, a], type0_today)
                    elif resourceType == "Type1":
                        model.AddImplication(x[d, a], type1_today)
                    elif resourceType == "Type2":
                        model.AddImplication(x[d, a], type2_today)
                    elif resourceType == "Type3":
                        model.AddImplication(x[d, a], type3_today)
                    else:
                        raise RuntimeError(f"Unknown resource type {asm.get('resourceType')}")

                    # track of task time (considering splitting), for workload requirements
                    task_times.append(model.NewIntVar(0, asm.get("seconds"), f"time_{a}_on_{d}"))
                    model.Add(task_times[a] == assignment_times[a]).OnlyEnforceIf(x[d, a])

                # time assigned to day d cannot exceed the day's available time
                model.Add(time_available >= sum(task_times))

                # sum the unique resource types on this day for later optimization
                model.Add(sum([type0_today, type1_today, type2_today, type3_today]) == types_today)
                unique_today.append(types_today)


        """
        Resource Diversity:

        Keeps track of what instances of a resource type appear on each day
        and the minimum number of unique resource types on any day. (done above ^)
        
        Then the model objective is set to maximize that minimum
        """
        total_diversity = model.NewIntVar(0, num_days * total_resource_types, "total_diversity")
        model.Add(sum(unique_today) == total_diversity)

        avg_diversity = model.NewIntVar(0, total_resource_types, "avg_diversity")
        model.AddDivisionEquality(avg_diversity, total_diversity, num_days)

        # Set objective
        model.Maximize(avg_diversity)


        # Assignment Occurance/Splitting and Dependencies
        for a, asm in enumerate(assignments):
            
            # track how many times an assignment occurs (since we can split)
            model.Add(instances[a] == sum(x[d, a] for d in range(num_days))) 

            # Dependencies 
            for needed_asm in asm.get("deps", []):
                needed_ub = id_to_assignment[needed_asm]["ub"]
                
                # this asm's lower bound must be greater than or equal to the upper bound of the dependency
                model.Add(num_days - asm["lb"] >= needed_ub)

        # Solve
        solver = cp_model.CpSolver()

        # set time limit
        solver.parameters.max_time_in_seconds = float(options.time_limit)
        solver.parameters.preferred_variable_order = 1
        solver.parameters.initial_polarity = 0
        # solver.parameters.stop_after_first_solution = True
        # solver.parameters.num_search_workers = 8

        intermediate_printer = SolutionPrinter()
        status = solver.Solve(model, intermediate_printer)


        print("\nStats")
        print(f"  - conflicts       : {solver.NumConflicts()}")
        print(f"  - branches        : {solver.NumBranches()}")
        print(f"  - wall time       : {solver.WallTime()}s")
        print()

        if status == cp_model.OPTIMAL or status == cp_model.FEASIBLE:
            sp = []

            for i, d in enumerate(days):
                day_time = 0
                days_tasks = []
                for a, asm in enumerate(assignments):
                    if solver.Value(x[i, a]) >= 1:
                        asm_time = math.ceil(asm.get("seconds") / solver.Value(instances[a]))
                        day_time += asm_time

                        days_tasks.append({"id": asm["id"], "resourceType": asm.get("resourceType"), "time": asm_time, "instances": solver.Value(instances[a])})
                
                sp.append(days_tasks)

            return SolverOutput(feasible=True, solution=sp)
            
        else:
            return SolverOutput(feasible=False, solution=[])


class SolutionPrinter(cp_model.CpSolverSolutionCallback):

    def __init__(self):
        cp_model.CpSolverSolutionCallback.__init__(self)
        self.__solution_count = 0

    def on_solution_callback(self):
        print(f"Solution {self.__solution_count} objective value = {self.ObjectiveValue()}")
        self.__solution_count += 1


Solution

  • Before answering your actual question I want to point out a few things in your model that I suspect are not working as you intended.

    The constraints on the assignment types present on a given day

    model.AddImplication(x[d, a], type0_today)
    

    etc., do enforce that type0_today == 1 if there is an assignment of that type on that day. However, it does not enforce that type0_today == 0 if there are no assignments of that type on that day. The solver is still free to choose type0_today == 1, and it will do so, because that fulfills this constraint and also directly increases the objective function. You will probably discover in the optimal solution to the test case you gave that all the type0_today to type3_today variables are 1 and that avg_diversity == 4 in the optimal solution, even though there are no assignments of any type but 0 in the input data. In the early stages of modelling, it's always a good idea to check the value of all the variables in the model for plausibility.

    Since I don't have a Python installation, I translated your model to c# to be able to do some experiments. Sorry, you'll have to translate into the equivalent Python code. I reformulated the constraint on the type0_today variables to use an array type_today[d, t] (for day d and type t) and use the AddMaxEquality constraint, which for Boolean variables is equivalent to the logical OR of all the participating variables:

        // For each day...
        for (int d = 0; d < num_days; d++)
            {
                // ... make a list for each assignment type of all x[d, a] where a has that type.
                List<IntVar>[] assignmentsByType = new List<IntVar>[total_resource_types];
                for (int t = 0; t < total_resource_types; t++)
                {
                    assignmentsByType[t] = new List<IntVar>();
                }
                for (int a = 0; a < num_assignments; a++)
                {
                    int t = getType(assignments[a].resourceType);
                    assignmentsByType[t].Add(x[d, a]);
                }
                // Constrain the types present on the day to be the logical OR of assignments with that type on that day
                for (int t = 0; t < total_resource_types; t++)
                {
                    if (assignmentsByType[t].Count > 0)
                    {
                        model.AddMaxEquality(type_today[d, t], assignmentsByType[t]); 
                    }
                    else
                    {
                        model.Add(type_today[d, t] == 0);
                    }
    
                }
            }
    

    You compute the average diversity as

            avg_diversity = model.NewIntVar(0, total_resource_types, "avg_diversity")
            model.AddDivisionEquality(avg_diversity, total_diversity, num_days)
    

    Since the solver only works with integer variables, avg_diversity will be exactly one of the values 0, 1, 2, 3 or 4 with no fractional part. The constraint AddDivisionEquality will also ensure that total_diversity is an exact integer multiple of both avg_diversity and num_days. This is a very strong restriction on the solutions and will lead to infeasibility in many cases that I don't think you intended.

    For example, avg_diversity == 3, num_days == 20 and total_diversity == 60 would be an allowed solution, but total_diversity == 63 would not be allowed, although there are three days in that solution with higher diversity than in the one with total_diversity == 60.

    Instead, I recommend that you eliminate the variable avg_diversity and its constraint and simply use total_diversity as your objective function. Since the number of days is a fixed constant during the solution, maximizing the total diversity will be equivalent without introducing artificial infeasibilities.

    That said, here is my answer.

    Generic constraint satisfaction problems are in general NP problems and should not be expected to scale well. Although many specific problem formulations can actually be solved quickly, small changes in the input data or the formulation can push the problem into a black hole of exponentiality. There is really no other approach than trying out various methods to see what works best with your exact problem.

    Although it sounds paradoxical, it is easier for the solver to find optimal solutions for strongly constrained problems than for lightly constrained ones (assuming they are feasible!). The search space in a strongly constrained problem is smaller than in the lightly constrained one, so the solver has fewer choices about what to experiment with to optimize and therefore completes the job faster.

    First suggestion

    In your problem, you have variables day_ub and day_lb for each assignment. These have a range from 0 to num_days. The constraints on them

                        model.Add(day_ub[a] >= d).OnlyEnforceIf(x[d, a])
                        model.Add(day_lb[a] >= (num_days - d)).OnlyEnforceIf(x[d, a])
    

    allow the solver freedom to choose any value between 0 and the largest d resp. largest (num_days - d) (inclusive). During the optimization, the solver probably spends time trying out different values for these variables but rarely discovers that it leads to an improvement; that would happen only when the placement of a dependent assignment would be changed.

    You can eliminate the variables day_ub and day_lb and their constraints and instead formulate the dependencies directly with the x variables.

    In my c# model I reformulated the assignment dependency constraint as follows:

    for (int a = 0; a < num_assignments; a++)
                {
                    Assignment assignment = assignments[a];
                    foreach (int predecessorIndex in getPredecessorAssignmentIndicesFor(assignment))
                    {
                        for (int d1 = 0; d1 < num_days; d1++)
                        {
                            for (int d2 = 0; d2 < d1; d2++)
                            {
                                model.AddImplication(x[d1, predecessorIndex], x[d2, a].Not());
                            }
                        }
                    }
                }
    

    In words: if an assignment B (predecessorIndex) on which assignment A (a) depends is placed on day d1, then all the x[0..d1, a] must be false. This directly relates the dependencies using the x variables insteading of introducing helping variables with additional freedom which bog down the solver. This change reduces the number of variables in the problem and increases the number of constraints, both of which help the solver.

    In an experiment I did with 25 days and 35 assignments, checking the model stats showed

    Original:

    #Variables: 2020
    #kIntDiv:   35
    #kIntMax:   100
    #kLinear1:  1750
    #kLinear2:  914
    #kLinearN:  86
    Total constraints   2885
    

    New formulation:

    #Variables: 1950
    #kBoolOr:   11700
    #kIntDiv:   35
    #kIntMax:   100
    #kLinear2:  875
    #kLinearN:  86
    Total constraints   12796
    

    So the new formulation has fewer variables but far more constraints.

    The solution times in the experiment were improved, the solver took only 2,6 s to achieve total_diversity == 68 instead of over 90 s.

    Original formulation

    Time    Objective
    0,21    56
    0,53    59
    0,6 60
    0,73    61
    0,75    62
    0,77    63
    2,9 64
    3,82    65
    3,84    66
    91,01   67
    91,03   68
    91,05   69
    

    New formulation

    Time    Objective
    0,2347  41
    0,3066  42
    0,4252  43
    0,4602  44
    0,5014  49
    0,6437  50
    0,6777  51
    0,6948  52
    0,7108  53
    0,9593  54
    1,0178  55
    1,1535  56
    1,2023  57
    1,2351  58
    1,2595  59
    1,2874  60
    1,3097  61
    1,3325  62
    1,388   63
    1,5698  64
    2,4948  65
    2,5993  66
    2,6198  67
    2,6431  68
    32,5665 69
    

    Diagram of objective improvement vs time

    Of course, the solution times you get will be strongly dependent on the input data.

    Second suggestion

    During my experiments I observed that solutions are found much more quickly when the assignments have a lot of dependencies. This is consistent with more highly constrained models being easier to solve.

    If you often have assignments of the same type and duration (like the numbers 2 and 3 in your test data) and they both have instance == 1` and either no dependencies or the same ones, then exchanging their position in the solution will not improve the objective.

    In a pre-processing step you could look for such duplicates and make one of them dependent on the other. This is essentially a symmetry-breaking constraint. This will prevent the solver from wasting time with an attempt to see if exchanging their positions would improve the objective.

    Third suggestion

    The solution needs to deal with determining how many instances of each assignment will be present in a solution. That requires two variables for each assignment instances[a] and assignment_times[a] with an associated constraint.

    Instead of doing this, you could get rid of the variables instances[a] and assignment_times[a] and instead split assignments with instances > 1 into multiple assignments in a preprocessing step. For example, in your test data, assignment 1 would be split into two assignments 1_1 and 1_2 each having instances == 1 and seconds = 1200. For this test case where instances == 2 for assignment 1, this will not have any effect on the final solution-- maybe the solver will schedule 1_1 and 1_2 on the same day, maybe not, but the final result is equivalent to splitting or not but doesn't need the extra variables.

    In the preprocessing step, when an assignment is split, you should add symmetry breaking constraints to make 1_2 dependent on 1_1, etc., for the reasons mentioned above.

    When an assignment has instances > 2, splitting it into multiple assignments before the run is actually a change to the model. For example, if instances == 3 and seconds = 2400 you cannot get a solution in which the assignment is split over two days with 1200 s each; the solver will always be scheduling 3 assignments of 800 s each.

    So this suggestion is actually a change to the model and you'll have to determine if that is acceptable or not.

    The total diversity will usually be helped by having more instances of an assignment to place, so the change may not have large practical consequences. It would also allow scheduling 2/3 of an assignment on one day and the remaining 1/3 on another day, so it even adds some flexibility.

    But this may or may not be acceptable in terms of your overall requirements.

    In all cases, you'll have to test changes with your exact data to see if they really result in an improvement or not.

    I hope this helps (and that that this is a real world problem and not a homework assignment, as I did spend a few hours investigating...).