I have a tibble:
sample_tibble <- tibble(
group = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4),
threshold = c(100, 100, 100, 100, 100, 80, 80, 80, 80, 80, 80, 150, 150, 150, 200, 200, 200, 200, 200, 200),
id = c("A", "B", "C", "A", "D", "A", "B", "C", "A", "D", "E", "A", "B", "C", "A", "B", "C", "A", "B", "D"),
value = c(10, 5, 20, 90, 40, 90, 5, 1, 10, 60, 50, 10, 10, 10, 32, 70, 100, 120, 50, 15)
)
sample_tibble
# A tibble: 20 × 4
group threshold id value
<dbl> <dbl> <chr> <dbl>
1 1 100 A 10
2 1 100 B 5
3 1 100 C 20
4 1 100 A 90
5 1 100 D 40
6 2 80 A 90
7 2 80 B 5
8 2 80 C 1
9 2 80 A 10
10 2 80 D 60
11 2 80 E 50
12 3 150 A 10
13 3 150 B 10
14 3 150 C 10
15 4 200 A 32
16 4 200 B 70
17 4 200 C 100
18 4 200 A 120
19 4 200 B 50
20 4 200 D 15
As shown, each group is given one threshold, and has multiple ids. There can be more than 1 id appearing, but the values differ.
The goal is to select ids until the sum of values is the closest to the threshold; hence, unlike typical Knapsack problem, the sum of values of selected ids can exceed threshold, if the summed value is optimally the closest to the threshold for that group.
The caveat is, for each group, the same id cannot be selected twice.
The solution (in the form of table, ideally) I am looking for would be the following:
solution_tibble
# A tibble: 12 × 4
group threshold id value
<dbl> <dbl> <chr> <dbl>
1 1 100 B 5
2 1 100 A 90
3 2 80 B 5
4 2 80 C 1
5 2 80 A 10
6 2 80 D 60
7 3 150 A 10
8 3 150 B 10
9 3 150 C 10
10 4 200 A 32
11 4 200 B 70
12 4 200 C 100
Now, this could be done by checking out all powersets and their sums, but the issue is the actual data I'm working with is some hundreds of thousands of groups, and some millions of corresponding ids.
I would like to discuss the optimal way to solve this variation of subset sum problem for a huge dataset.
(I only currently have the greedy Algorithm set up, which obviously does not guarantee the optimal-ness of the sum of selected ids)
I think it could be interpreted as an "Integer Programming" optimization problem, but constrained by the uniqueness of id
per group
.
I am not sure how efficient the solution would be in your scaled-up dataset, but you might be inspired a bit if you look into how CVXR
helps to solve the problem.
First, you could define a custom function f
, which applies each group
library(CVXR)
f <- function(df) {
th <- unique(df$threshold)
vals <- with(df, split(value, id))
id <- names(vals)
d <- expand.grid(vals)
x <- Variable(ncol(d), boolean = TRUE)
cost <- Inf
for (i in seq_len(nrow(d))) {
C <- as.matrix(d[i, ])
result <- solve(Problem(Minimize(abs(C %*% x - th))))
if (result$value < cost) {
msk <- round(result$getValue(x)) > 0
out <- data.frame(unique(df$group), th, id[msk], C[msk])
cost <- result$value
}
}
setNames(out, names(df))
}
Then, you split sample_tibble
and run f
for each group
sample_tibble %>%
group_split(group) %>%
map(f) %>%
bind_rows()
and you will obtain
group threshold id value
1 1 100 A 90
2 1 100 B 5
3 2 80 A 10
4 2 80 B 5
5 2 80 C 1
6 2 80 D 60
7 3 150 A 10
8 3 150 B 10
9 3 150 C 10
10 4 200 A 32
11 4 200 B 70
12 4 200 C 100