I have a system of linear equations Ax=b, A - matrix (m x n), x - (n x 1), b - (m x 1). The matrix A is almost all 0 and contains only ~ n*m^(1/2) non-zero elements. m and n can be quite large, such as n=28680, m=22500 or n=28680, m=62500. It is obvious that the exact solution of the system with a high probability simply does not exist. Therefore, I need to find such a vector x0 at which the minimum of the sum( (Ax-b)^2 ) function is realized. At the same time, it is highly desirable that the vector x0 consists only of 0 and 1. Thus, I need to solve the problem sum( (Ax-b)^2 ) -> min, x from {0,1}
How did you try to solve the problem? First, I solved a simplified problem. sum( (Ax-b)^2 ) -> min, x from real The problem was quickly solved using tensoflow methods. I rounded up the resulting solution, but this resulted in very poor quality.
Then I tried to solve complete problem. The CVXPY library was suggested to me. By researching the official CVXPY website found out that it would suit me to use solver for the Mixed-integer quadratic class. I wrote a program using the SCIP solver. Since it would be too long to wait for the optimal solution, I limited the solution time to 200 seconds. The program works, but only for small m and n, for example n=4950, m=1600. If you increase m and n further, then SCIP gives a simple zero vector as a solution, although practically any vector with even one 1 is closer to the optimum. Here is the program code:
import cvxpy as cp
import numpy as np
#I tried to simulate my inputs for you.
#Here the problem manifests itself already at m, n = 1600, 4950.
np.random.seed(0)
m, n= 1600, 4950
A = np.round(np.random.rand(m, n)/(1-1/m**(1/2))/2)*255*(m**(1/2)/1800)**1.1
b = 255*(np.random.randn(m))
x = cp.Variable(n, integer=True)
objective = cp.Minimize(cp.sum_squares(A @ x - b))
constraints = []
constraints.extend([x >= 0.0, x <= 1])
prob = cp.Problem(objective,constraints)
prob.solve(solver=cp.SCIP,verbose=True,scip_params={"limits/time": 50})
print("Status: ", prob.status)
print("The optimal value is", prob.value)
print("A solution x is")
print(x.value)
print("Non-zero ",sum([i for i in x.value]))
My questions are the following:
Update: I thought it would be better to avoid functions with a discontinuous derivative, but "sum of absolute errors" seems to work better than sum of squares, at least in my example. As for the matrix A, it is ok. And, unfortunately, there is no exact solution for my problem. However, matrix A is still not noise like in the example and, I hope, can provide a solution close enough to the exact one. At the same time, I am 100% sure that the zero vector is extremely far from the optimal solution, since I checked some vectors consisting of zeros with one "1" and ~ 80% of them were closer to the optimal one. AirSquid CVXPY code works for me even with matrices up to 27000x23000 and without an exact solution! Further I'm just running out of RAM... On my real data, the option with the sum of absolute errors worked - no longer returns a null vector. But there is another problem - too many 1... Now sum(abs(Ax-b)) is less with 0 vector than with the solution that solver gives me! I don't understand why this is happening.
I ran with the assumption that "sum of absolute error" is good enough as a quality measure of the solution instead of SSE. By doing so, the model can be easily linearized and I think much more tractable by MIP solver rather than a quadratic solver. Note that both the examples below assume there is a good (actually exact in this case) solution. If the matrices are basically "noise" and you are looking for a best fit, then the solvers will almost certainly struggle a lot more as the problem is likely more degenerate.
Before looking at the solve, I think there are issues with your construction of A
above. Make a small m x n matrix and see if it is what you expect. I get a lot of duplicate, all positive values, so it is completely possible that "all zeros" for x vector is the optimal solution for a b
vector that should average ~0. I changed things around a bit with pos/neg values and the construction to get the density you want.
If you take my cvxpy
shell and switch it back to SSE, I'd be curious to see what kind of performance you get vs. linear. I couldn't get an appropriate solver working for that as I don't use cvxpy
as much.
cvxpy
is kind of a natural for problems like this that can easily be expressed in matrix format. I'm jumping through a few hoops in pyomo
to get to the same point before the solve, but it works quite well. On my 8 yr. old machine, pyomo
takes about 200 sec to construct the maximally sized problem (as you see in results)
my cvxpy
solution below works fine for smaller instances but barfs "infeasible" for matrices of 20k x 20k or such, so there is maybe something internal there that I don't understand, as it is clearly feasible. Perhaps somebody with a better knowledge of cvxpy
can tinker and see what is up.
import cvxpy as cp
import numpy as np
from time import time
#I tried to simulate my inputs for you.
#Here the problem manifests itself already at m, n = 1600, 4950.
np.random.seed(0)
m, n= 5_000, 5_000 # 20_000, 20_000 <--- barfs
A_base = (np.random.randint(-1000, 1000, (m, n)))
marks = np.random.binomial(n=1, p=(n*m)**0.5/(m*n), size=(m, n))
A = np.where(marks, A_base, 0)
true_x = np.random.binomial(n=1, p=0.6, size=n)
b = A @ true_x
tic = time()
x = cp.Variable(n, integer=True)
objective = cp.Minimize(cp.sum(cp.abs(A @ x - b)))
constraints = []
constraints.extend([x >= 0.0, x <= 1])
prob = cp.Problem(objective,constraints)
res=prob.solve() #solver=cp.SCIP,verbose=True,scip_params={"limits/time": 50})
toc = time()
print("Status: ", prob.status)
print("The optimal value is", prob.value)
print("A solution x is")
print(x.value)
print("Non-zero ",sum([i for i in x.value if i is not None ]))
print(f'solver time: {toc-tic : 0.1f} sec.')
Status: optimal
The optimal value is 0.0
A solution x is
[ 1. 1. 1. ... -0. 1. -0.]
Non-zero 1907.0
solver time: 3.5 sec.
[Finished in 5.5s]
import numpy as np
import pyomo.environ as pyo
from time import time
import sys
np.random.seed(0)
m, n = 28_680, 62_500
A_base = (np.random.randint(-1000, 1000, (m, n)))
marks = np.random.binomial(n=1, p=(n*m)**0.5/(m*n), size=(m, n))
A = np.where(marks, A_base, 0)
true_x = np.random.binomial(n=1, p=0.6, size=n)
b = A @ true_x
# b = np.random.randint(-100, 100, n)
# print(b)
tic = time()
model = pyo.ConcreteModel()
model.N = pyo.Set(initialize=range(n))
model.M = pyo.Set(initialize=range(m))
# initializing x only because some values of x[n] may end up unconstrained and this prevents "None" in result
model.x = pyo.Var(model.N, within=pyo.Binary, initialize=0)
model.row_err = pyo.Var(model.M)
model.row_abs_err = pyo.Var(model.M)
# constrain the rowsum/error appropriately
@model.Constraint(model.M)
def rowsum(model, m):
return model.row_err[m] == sum(A[m,n] * model.x[n] for n in model.N if A[m,n]) - b[m]
# constrain the abs error
@model.Constraint(model.M)
def abs_1(model, m):
return model.row_abs_err[m] >= model.row_err[m]
@model.Constraint(model.M)
def abs_2(model, m):
return model.row_abs_err[m] >= -model.row_err[m]
# minimize sum of absolute row errors
model.obj = pyo.Objective(expr=sum(model.row_abs_err[m] for m in model.M))
toc = time()
print(f'starting solver... at time: {toc-tic: 0.1f} sec.')
solver = pyo.SolverFactory('cbc')
res = solver.solve(model)
toc = time()
print(f'total solve time: {toc-tic: 0.1f} sec')
print(res)
# model.x.display()
x = np.array([pyo.value(model.x[n]) for n in model.N], dtype=int)
print(f'verifying solution: {np.sum(A@x - b)}')
starting solver... at time: 228.2 sec.
solver time: 384.8 sec
Problem:
- Name: unknown
Lower bound: 0.0
Upper bound: 0.0
Number of objectives: 1
Number of constraints: 40941
Number of variables: 50488
Number of binary variables: 30839
Number of integer variables: 30839
Number of nonzeros: 20949
Sense: minimize
Solver:
- Status: ok
User time: -1.0
System time: 123.18
Wallclock time: 152.91
Termination condition: optimal
Termination message: Model was solved to optimality (subject to tolerances), and an optimal solution is available.
Statistics:
Branch and bound:
Number of bounded subproblems: 21
Number of created subproblems: 21
Black box:
Number of iterations: 2668
Error rc: 0
Time: 153.19930219650269
Solution:
- number of solutions: 0
number of solutions displayed: 0
verifying solution: 0