Here is my problem:
There are 5 candies (1,2,3,4,5) - each candy has the same "value" associated with its number (i.e. candy_1 has value=1, candy_2 has value=2, etc.).
In the future, I want to be able to answer questions such as: Is it possible to pick "x" number of candies such that the values of the chosen candies sum to less than "y" and the values of the unchosen candies sum to less than "z"?
The first way to solve this problem that comes to mind is to use the "power set" (https://en.wikipedia.org/wiki/Power_set). That is, if there are "n" candies, the total number of combinations that can be made are 2^n combinations.
Step 1: For example, suppose I have 5 candies - here is a function that will identify the power set:
a <- c(1,2,3,4,5)
power_set <- function(x) {
c(list(NULL), unlist(lapply(seq_along(x), function(i) combn(x, i, simplify = FALSE)), recursive = FALSE), list(x))
}
pset <- power_set(a)
Step 2: Then, for each element in the power set, I need to identify the "unchosen" candies:
find_missing <- function(x, full_set) {
full_set[!full_set %in% x]
}
diffset <- lapply(pset, find_missing, full_set = a)
Step 3: Finally, I can use some data manipulation to bring everything into a data frame and summarize different properties of the selections (e.g. sums and counts):
pset_char <- sapply(pset, function(x) paste(x, collapse = ","))
diffset_char <- sapply(diffset, function(x) paste(x, collapse = ","))
df <- data.frame(pset = pset_char, diff = diffset_char)
df$sum_pset <- sapply(pset, sum)
df$sum_diffset <- sapply(diffset, sum)
df$count_pset <- sapply(pset, length)
df$count_diffset <- sapply(diffset, length)
The final result would look something like this:
pset diff sum_pset sum_diffset count_pset count_diffset
1 1,2,3,4,5 0 15 0 5
2 1 2,3,4,5 1 14 1 4
3 2 1,3,4,5 2 13 1 4
4 3 1,2,4,5 3 12 1 4
5 4 1,2,3,5 4 11 1 4
6 5 1,2,3,4 5 10 1 4
From here, I could then simply select rows that match a certain criteria (e.g. select from df where count_pset <= x & sum_pset <=y & sum_diffset <=z;
)
My Question: The obvious problem with this approach is that the Power Set grows very large as the number of elements increase and this approach will become completely impractical.
Suppose I have 1000 candies (e.g. candy_1 has value = 1,... candy_999 has value = 999, candy_1000 has value = 100) and I am only interested in finding some limited number of combinations that match a certain criteria (i.e. not all of them).
Is there some other approach I can use to avoid generating the entire power set and then searching the results? For example, is there something involving recursion, backtracking, and dynamic programming that can be done to make this problem solvable? Perhaps Evolutionary Algorithms?
From you description, it seems you would like to enumerate all possible subset combinations that satisfies some sum constraints. I think this a variant of the "subset sum" problem, and what you want is to list all the feasible subsets
Below is a function hopefully might solve your problem, where a recursion helper
function is devised to find the constrained subsets
f <- function(a, x, y, z) {
helper <- function(vec, s, k) {
v <- vec[vec <= s]
if (k == 1) {
return(as.list(v))
}
unlist(
lapply(v, \(p) Map(c, p, helper(v[v > p], s - p, k - 1))),
recursive = FALSE
)
}
Filter(
\(v) sum(a) - sum(v) <= z,
unlist(
lapply(
0:x,
\(k) helper(a, y, k)
),
recursive = FALSE
)
)
}
Given input as below
a <- 1:5
x <- 3
y <- 12
z <- 10
you will obtain the all feasible subsets
> f(a, x, y, z)
[[1]]
[1] 5
[[2]]
[1] 1 4
[[3]]
[1] 1 5
[[4]]
[1] 2 3
[[5]]
[1] 2 4
[[6]]
[1] 2 5
[[7]]
[1] 3 4
[[8]]
[1] 3 5
[[9]]
[1] 4 5
[[10]]
[1] 1 2 3
[[11]]
[1] 1 2 4
[[12]]
[1] 1 2 5
[[13]]
[1] 1 3 4
[[14]]
[1] 1 3 5
[[15]]
[1] 1 4 5
[[16]]
[1] 2 3 4
[[17]]
[1] 2 3 5
[[18]]
[1] 2 4 5
[[19]]
[1] 3 4 5
Further, if you want the output presented in a data.frame
, you can run the following steps in addition
lst <- f(a, x, y, z)
dfout <- do.call(
rbind,
lapply(
lst,
\(v)
data.frame(
pset = toString(v),
diffset = toString(setdiff(a, v)),
sum_pset = sum(v),
sum_diffset = sum(a) - sum(v),
cnt_pset = length(v),
cnt_diffset = length(a) - length(v)
)
)
)
which gives
> dfout
pset diffset sum_pset sum_diffset cnt_pset cnt_diffset
1 5 1, 2, 3, 4 5 10 1 4
2 1, 4 2, 3, 5 5 10 2 3
3 1, 5 2, 3, 4 6 9 2 3
4 2, 3 1, 4, 5 5 10 2 3
5 2, 4 1, 3, 5 6 9 2 3
6 2, 5 1, 3, 4 7 8 2 3
7 3, 4 1, 2, 5 7 8 2 3
8 3, 5 1, 2, 4 8 7 2 3
9 4, 5 1, 2, 3 9 6 2 3
10 1, 2, 3 4, 5 6 9 3 2
11 1, 2, 4 3, 5 7 8 3 2
12 1, 2, 5 3, 4 8 7 3 2
13 1, 3, 4 2, 5 8 7 3 2
14 1, 3, 5 2, 4 9 6 3 2
15 1, 4, 5 2, 3 10 5 3 2
16 2, 3, 4 1, 5 9 6 3 2
17 2, 3, 5 1, 4 10 5 3 2
18 2, 4, 5 1, 3 11 4 3 2
19 3, 4, 5 1, 2 12 3 3 2