I have the following problem. I have n (typically n = 1000) data points (integers from {1,2,3}, so there are a lot of repeting numbers) and a real number d. I have to choose k<n points (k is given) which minimize the distance between the mean of those k points and point d. This can be expressed as a MILP problem (please see here).
I tried to solve that in R using lpSolve and Rglpk packages but it takes a lot of time to solve it (I tried to solve it for n = 100 points and the code has been running for 40 minutes already). I guess the issue is that there are lots of binary variables (n) and there are also repeating numbers.
library(Rglpk)
set.seed(123)
sampsize <- sample(c(1,2,3),size=100,replace = TRUE)
k <- 50
d <- 86/47
lngth <- length(sampsize)
f.obj <- c(1,rep(0,lngth))
f.con <- matrix(c(0,rep(1,lngth),-1,sampsize/k,1,sampsize/k),nrow=3, byrow = TRUE)
f.dir <- c("==","<=",">=")
f.rhs <- c(k,d,d)
f.types <- c("C",rep("B",lngth))
res <- Rglpk_solve_LP(obj=f.obj,mat=f.con,dir=f.dir,rhs=f.rhs,max=FALSE,types=f.types)
I will be satisfied with a sub-optimal solution. Is there a way to solve it quickly or re-express the problem in a certain way to speed up the algorithm?
I would appreciate any input on this.
CVXR is a much better tool for this:
#
# generate random data
#
set.seed(123)
N <- 100 # sample size
v <- c(1,2,3) # sample from these unique values
M <- length(v) # number of unique values
data <- sample(v, size=N, replace=TRUE)
tab <- table(data) # tabulate
K <- 50 # number of points to choose
target <- 86/47 # target for mean
#
# CVXR model
# see https://cvxr.rbind.io/
#
library(CVXR)
# select a number of values from each bin
select <- Variable(M,integer=T)
# obj: sum of absolute deviations
objective = Minimize(abs(sum(v*select)/K-target))
# include nonnegativity constraints
constraints = list(sum(select)==K, select >= 0, select <= vec(tab))
problem <- Problem(objective, constraints)
sol <- solve(problem,verbose=T)
cat("\n")
cat("Status:",sol$status,"\n")
cat("Objective:",sol$value,"\n")
cat("Solution:",sol$getValue(select),"\n")
Output:
GLPK Simplex Optimizer, v4.65
9 rows, 4 columns, 17 non-zeros
0: obj = 0.000000000e+00 inf = 5.183e+01 (2)
3: obj = 5.702127660e-01 inf = 0.000e+00 (0)
* 4: obj = 1.065814104e-16 inf = 0.000e+00 (0)
OPTIMAL LP SOLUTION FOUND
GLPK Integer Optimizer, v4.65
9 rows, 4 columns, 17 non-zeros
3 integer variables, none of which are binary
Integer optimization begins...
Long-step dual simplex will be used
+ 4: mip = not found yet >= -inf (1; 0)
+ 55: >>>>> 1.021276596e-02 >= 9.787234043e-03 4.2% (52; 0)
+ 56: >>>>> 9.787234043e-03 >= 9.787234043e-03 < 0.1% (16; 36)
+ 56: mip = 9.787234043e-03 >= tree is empty 0.0% (0; 103)
INTEGER OPTIMAL SOLUTION FOUND
Status: optimal
Objective: 0.009787234
Solution: 26 7 17