python algorithm optimization scipy genetic-algorithm

Assistance formatting and solving a very large system of equations problem

This is a real world problem that has been solved in the past via thousands of manual iterations.

Please forgive my inexperience with stackoverflow and with how to format my question. Dave, I believe correctly, rewrote my question in the comments and I will quote him below. I would also thank him if I could figure out how.

This is likelier to get a good response if you make it more focused, if you define terms before you use them, and if you clarify what are the inputs. E.g., "A data point is a container which has a positive integer called 'value', an attribute called 'A' which contains an integer, and attributes B, C, D, each of which contains integer-float pairs where the floats for each attribute sum to 1.0. My input is about 20k of these data points, and my goal is to find new values for the points, leaving everything else unchanged, which maximizes (new - old value)

I have a collection of roughly 20,000 datapoints(X_i). Each datapoint contains exactly one value greater than zero. Each datapoint also has 4 attributes. For attribute A there are 99 possible categories and the datapoint may only belong to one. The remaining three attributes may split the value across multiple categories of the attribute. For example: 80% of the value of X_i belongs to category 2 of attribute B while the remaining 20% belongs to category 5 of attribute B.

I also have the previous values of each data point for several years (one value per year).

Point	Prev Value	New Value	Diff	Attr A	Attr B	Attr C	Attr D
X1	68	72	4	1: 100%	2: 80%	3: 100%	7: 90%
					5: 20%		9: 10%
X2	56,000	66,000	10,000	7: 100%	1: 50%	3: 90%	2: 100%
					5: 50%	6: 10%

I need to solve for the New Value column subject to the following constraints:

New Value must be positive

Sum of all (X_i) = Known Value

Sum of all (X_i) * percent allocated to category per attribute = Target Value

Sum of all Target Values per category = Known Value

Attr A

A1: Target Value

A2: Target Value

...

A99: Target Value

A1 + A2 +... A99 = Known Value

Attr B

B1: Target Value

B2: Target Value

...

B27: Target Value

B1 + B2 +... B27 = Known Value

Attr C

C1: Target Value

C2: Target Value

...

C18: Target Value

C1 + C2 +... C18 = Known Value

Attr D

D1: Target Value

D2: Target Value

...

D8: Target Value

D1 + D2 +... D8 = Known Value

Soft Constraints

The value in the Diff column should be positive

Smaller values should have more freedom to change than larger values

Ultimate goal

Seek to minimize the value in the Diff column by row. When looking at the percent change from the previous value try and evenly distribute the change rather than have it all attributed to a single row.

As stated previously this problem is solved once per year via manual iteration. We have developed a workflow to first populate a starting value by converting the previous value to a percent of total and then multiplying that by the new known value which is always larger than the previous grand total. We use a BI tool to view all of the total differences between the starting value and the target values at the same time. From there manual iterations in the form of [subtract 10,000 from X34 and add 10,000 to X452] are added to a running list and the BI tool is updated to show the new status. That running list can be many thousands of lines. Also, by "solving" I mean that we can eventually generate a solution that fits all of the constraints but we are well aware that it is a solution and not the best solution.

We've made several attempts to automate the iteration portion of the process via python with some degree of success. We've also talked with reps from Matlab who were confident that they could get close using fmincon and we still might pursue that as a solution, but I would like to investigate alternatives.

What I am requesting in this post is not necessarily a solution (although I would accept one) but links to resources of similar problems (where a change to one element results in changes to others). Or possibly assistance in more clearly defining the problem mathematically. I have looked into several optimization and genetic algorithms but nothing seems to match what I'm after.

"Sum of all (Xi) * percent allocated to category per attribute = Target Value" Is there one target value for the whole problem, or category specific target values, or something else?

Each category of each attribute has a target value. For attribute D there are 8 target values. The total sum of those 8 values is equal to the total sum of the of 18 target values of attribute C (or any of the attributes). It is a closed system. There are a total of 99 + 27 + 18 + 8 = 152 target values. In reality some of those target values are more important than others. Attribute B specifically must match all 27 targets. While only 5 of the 18 targets in attribute C are critical.

What are the parameters you're optimizing? E.g. can you change the percentage allocations between categories? – Nick ODell

In reality the percentage could vary slightly. Rather than 80/20 it could be 75/25. Honestly that we have always held those splits constant to try and simplify an already complicated problem. What you can't do is add a category that doesn't exist. For instance changing 80/20 to 70/20/10. The parameter we would like to optimize (minimize) is the difference between the previous value and the new value. At the same time the ideal solution would also be smooth. Given two solutions: In the first solution the majority of the datapoints change a very small amount except one single data point which changes drastically. While for the second solution all of the datapoints change by a larger amount, but no single point changes drastically. The second solution would be preferred.

Forgive me if I've missed any posted above, but it looks like all the constraints so far are linear. If the objective can be to minimize the sum of the absolute values of the differences, then this can be expressed as a linear programming problem and solved with scipy.optimize.linprog or scipy.optimize.milp, which scale to large problems much better than minimize. If you are interested in this approach, I can post an example that can help you formulate your problem similarly. – Matt Haberland

@Matt Yes, I would be very interested in seeing how you would formulate this for solving in SciPy. Possibly incorporating the suggestions of Cary Swoveland & Nick ODell as well.

Solution

Note: I don't have more time to polish this. Even if it's hard to understand on first read, I hope it at least gives you a sense of whether you want to pursue automation of this solution yourself or if you want to hire someone to do it.

The advantage of linear programming, if it is flexible enough for your problem, is that solution is very efficient and you are essentially guaranteed (barring unusual numerical difficulties) to get either a globally optimal solution or certificate of infeasibility/unboundedness.

scipy.optimize.milp solves linear and mixed integer linear programming problems, and has a cleaner interface with a tighter connection to the underlying solver (HiGHS) than scipy.optimize.linprog, so I'd suggest using that. All of the code below is written with regular NumPy arrays, but you'd really want to use sparse arrays because the arrays will be large and many elements will be zero.

I have a collection of roughly 20,000 datapoints(Xi).

In the language of optimization, the new values of the datapoints will be "decision variables". Let's say you have mx of them.

It sounds like your "target values" Aj, Bj, Cj, and Dj are also variables that need to be chosen by the solver, since you write constraints with them, so these would be "decision variables", too. Let's say you have ma=99, mb=27, mc=18, and md=8 of each of these, respectively.

So for now, let's say you have m = mx + ma + mb + mc + md decision variables, and we denote all of these decision variables together as x.

New Value must be positive

By default, all decision variables are constrained to be non-negative. There is no way to specify a strict inequality constraint ("positive"), but you can represent inclusive bounds with a Bounds object to ensure that the decision variables satisfy lb <= x <= ub.

# vector of size `m` of lower bounds, one for each decision variable
# may be a small nonzero number to ensure that values are strictly positive
lb = ...  
# vector of size `m` of upper bounds, one for each decision variable
# may be `np.inf` to indicate "no upper bound"
ub = ...  
bounds = Bounds(lb, ub)

(Another strategy is to just hope that all the values are nonzero in the optimal solution, and add bounds later if you find that some values are exactly zero.)

Sum of all (Xi) = Known Value

We'll specify this with a LinearConstraint, which represents constraints of the form lb <= A @ x <= ub.

A1 = np.zeros((1, m))
A1[:mx] = 1  # you only want to sum your Xi here, not your Aj, Bj, etc...
b1 = np.asarray([...])  # your known value as a vector with one element
constraint1 = LinearConstraint(A1, lb=b1, ub=b1)

Sum of all (Xi) * percent allocated to category per attribute = Target Value

In reality the percentage could vary slightly. Rather than 80/20 it could be 75/25. Honestly that we have always held those splits constant to try and simplify an already complicated problem.

I'm going to assume that these percentages are specified. If not, these percentages would become decision variables, your constraints would involve the product of decision variables, and this would make your problem nonlinear (in the decision variables). Nonlinear problems cannot be solved with linprog or milp. The exception is that we can assign the datapoints to a single category of A using binary decision variables. I'll address that at the end, but assume for now that you have made all the assignments.

Here is how I think you would specify this constraint in the form lb <= A @ x <= ub.

# Matrix with 152 rows and `mx` columns specifying percentages in each category
# Assume for now that you will manually assign datapoints to categories for
# attribute A; we'll consider that later
A2a = ...
# To ensure that the sums of these Xi * percentages equals the target value,
# we subtract the target value...
A2b = -np.eye(ma + mb + mc + md)
A2 = np.concatenate([A2a, A2b], axis=1)
# ... and the result must be zero for each row.
b2 = np.zeros(ma + mb + mc + md)
constraint2 = LinearConstraint(A2, lb=b2, ub=b2)

Sum of all Target Values per category = Known Value

Hopefully this is beginning to make sense:

A3 = np.zeros((4, m))  # one constraint per category
A3[0, mx:mx + ma] = 1  # sum all the Aj in the zeroth row
A3[1, mx + ma: mx + ma + mb] = 1  # sum all the Bj in the next row
...
b3 = ... # target values for each sum

Smaller values should have more freedom to change than larger values

You could reformulate the problem where instead of using the Xi themselves as decision variables, you make the decision variables the ratio of Xi over the old value. Let's assume you do this.

The value in the Diff column should be positive

Let's make this a hard constraint for simplicity: the lower bound associated with each of the Xi ratio decision variables is 1. (Soft constraints can be incorporated into the objective function, but that is up to your creativity.)

Seek to minimize the value in the Diff column by row.

Again, the the decision variables are now Xi ratios, and you set the lower bound of all these ratios to be 1. The goal is now to minimize the maximum ratio. To do this, we add a single new decision variable, which we want to be greater than the maximum of the Xi ratios. We ensure this by constraining the new decision variable to be greater than each of the Xi ratios:

A = np.zeros((mx, m + 1))
A[:, -1] = 1  # corresponding with the new decision variable
i = np.arange(mx)
A[i, i] = -1  # subtract ratio `i` in row `i`
lb = np.zeros(mx)  # must be greater than zero
ub = np.full(mx, np.inf)

Your objective function would become:

c = np.zeros(m + 1)
c[-1] = 1  # minimize the new decision variable

Note: Because you've added this new decision variable, all the constraint matrices above would need an extra column. All constraint matrices need to have the same number of columns, which is the total number of decision variables.

Note 2: I made "The value in the Diff column should be positive" a hard constraint. If it is not a constraint, the way you express the objective function will need to change; it can no longer be "minimize the maximum Xi ratio" because some of the ratios will be less than 1. Presumably you also want to ensure that they are not much less than 1. You might want the objective function to become "minimize the maximum of the absolute values of the differences between the ratios and 1". I won't attempt to describe how to accomodate that here, but see the "Absolute values" and "Minimax objective" tricks in the AIMMS Modeling Guide, for instance.)

Now we return to the assignment of datapoints to the categories of attribute A. Here is the basic idea. You add mx * ma new decision variables (yes, like 2,000,000 of them), each of which is exactly 1 (if Xi is assigned to category Aj) or exactly 0 (otherwise). You can ensure that these decision variables are "binary" using the integrality argument of milp. You add linear constraints just like above in the form of a matrix A with mx rows and m + 1 + mx * ma columns. Almost every element of this matrix is 0, but in each row j there would be ma=99 1s corresponding with the binary decision variables that assign data point i to category j. You set all the elements of lb and ub to 1 - the sum of all the binary decision variables corresponding with a single data point must equal 1. (This ensures that a decision variable can only belong to one category.)

There are some linear programming (LP) and mixed integer linear programming (MILP) tutorials in the SciPy User Guide, and you can find others posted on Medium and elsewhere.