ralgorithmperformanceknapsack-problemsubset-sum

Subset sum problem with selection constraint in large R dataset


I have a tibble:

sample_tibble <- tibble(
    group = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4),
    threshold = c(100, 100, 100, 100, 100, 80, 80, 80, 80, 80, 80, 150, 150, 150, 200, 200, 200, 200, 200, 200),
    id = c("A", "B", "C", "A", "D", "A", "B", "C", "A", "D", "E", "A", "B", "C", "A", "B", "C", "A", "B", "D"),
    value = c(10, 5, 20, 90, 40, 90, 5, 1, 10, 60, 50, 10, 10, 10, 32, 70, 100, 120, 50, 15)
)

sample_tibble
# A tibble: 20 × 4
   group threshold id    value
   <dbl>     <dbl> <chr> <dbl>
 1     1       100 A        10
 2     1       100 B         5
 3     1       100 C        20
 4     1       100 A        90
 5     1       100 D        40
 6     2        80 A        90
 7     2        80 B         5
 8     2        80 C         1
 9     2        80 A        10
10     2        80 D        60
11     2        80 E        50
12     3       150 A        10
13     3       150 B        10
14     3       150 C        10
15     4       200 A        32
16     4       200 B        70
17     4       200 C       100
18     4       200 A       120
19     4       200 B        50
20     4       200 D        15

As shown, each group is given one threshold, and has multiple ids. There can be more than 1 id appearing, but the values differ.

The goal is to select ids until the sum of values is the closest to the threshold; hence, unlike typical Knapsack problem, the sum of values of selected ids can exceed threshold, if the summed value is optimally the closest to the threshold for that group.

The caveat is, for each group, the same id cannot be selected twice.

The solution (in the form of table, ideally) I am looking for would be the following:

solution_tibble
# A tibble: 12 × 4
   group threshold id    value
   <dbl>     <dbl> <chr> <dbl>
 1     1       100 B         5
 2     1       100 A        90
 3     2        80 B         5
 4     2        80 C         1
 5     2        80 A        10
 6     2        80 D        60
 7     3       150 A        10
 8     3       150 B        10
 9     3       150 C        10
10     4       200 A        32
11     4       200 B        70
12     4       200 C       100

Now, this could be done by checking out all powersets and their sums, but the issue is the actual data I'm working with is some hundreds of thousands of groups, and some millions of corresponding ids.

I would like to discuss the optimal way to solve this variation of subset sum problem for a huge dataset.

(I only currently have the greedy Algorithm set up, which obviously does not guarantee the optimal-ness of the sum of selected ids)


Solution

  • I think it could be interpreted as an "Integer Programming" optimization problem, but constrained by the uniqueness of id per group.

    I am not sure how efficient the solution would be in your scaled-up dataset, but you might be inspired a bit if you look into how CVXR helps to solve the problem.


    First, you could define a custom function f, which applies each group

    library(CVXR)
    
    f <- function(df) {
        th <- unique(df$threshold)
        vals <- with(df, split(value, id))
        id <- names(vals)
        d <- expand.grid(vals)
        x <- Variable(ncol(d), boolean = TRUE)
        cost <- Inf
        for (i in seq_len(nrow(d))) {
            C <- as.matrix(d[i, ])
            result <- solve(Problem(Minimize(abs(C %*% x - th))))
            if (result$value < cost) {
                msk <- round(result$getValue(x)) > 0
                out <- data.frame(unique(df$group), th, id[msk], C[msk])
                cost <- result$value
            }
        }
        setNames(out, names(df))
    }
    

    Then, you split sample_tibble and run f for each group

    sample_tibble %>%
        group_split(group) %>%
        map(f) %>%
        bind_rows()
    

    and you will obtain

       group threshold id value
    1      1       100  A    90
    2      1       100  B     5
    3      2        80  A    10
    4      2        80  B     5
    5      2        80  C     1
    6      2        80  D    60
    7      3       150  A    10
    8      3       150  B    10
    9      3       150  C    10
    10     4       200  A    32
    11     4       200  B    70
    12     4       200  C   100