I have a vector V
of consecutive integers of length l
, e.g., 1, 2, 3, 4, 5, 6, 7
. I want to find all subsets of size k
such that the difference between any two numbers in the subset can be no less than m
, e.g., 2
. Using the example above with l = 7
, k = 3
, and m = 2
, the subsets are
1, 3, 5
1, 3, 6
1, 3, 7
1, 4, 6
1, 4, 7
1, 5, 7
2, 4, 6
2, 4, 7
2, 5, 7
3, 5, 7
One approach is to enumerate all possible subsets of size k
and discard any that fail to meet the m
constraint, but this procedure explodes even when the number of solutions is small.
My current approach is a brute-force algorithm in which I start from the subset with the lowest possible integer and increase the last entry by 1 until the upper limit is reached, increment the previous entry and reset the last entry to the lowest it can be given the increase in the previous entry. That is, I start with 1, 3, 5
, then increase the last digit by one to get 1, 3, 6
and 1, 3, 7
, and then since the upper limit is reached I increment the middle by 1 (to 4
) and set the last digit to the smallest it can be given that digit (to 6
) to get 1, 4, 6
, and carry on as such. This ends up being quite slow in R for large l
, and I'm wondering if there is a clever vectorized solution that can instantly generate all the combinations, possible by capitalizing on the sequential nature of the entries. Implementing this algorithm in Rcpp
speeds this up a bit, but I'm still hoping for a more elegant solution if one is available.
Here are several base R options
We can define recursive function like below
f0 <- function(v, k, m) {
if (k == 1) {
return(matrix(v))
}
d <- Recall(v, k - 1, m)
u <- unique(d[, ncol(d)])
uu <- (u + m)[(u + m) %in% v]
lst <- list()
for (i in u) {
dd <- d[d[, ncol(d)] == i, , drop = FALSE]
p <- uu[uu - i >= m]
if (length(p) > 0) {
lst <- append(
lst,
list(cbind(dd[rep(1:nrow(dd), each = length(p)), , drop = FALSE], p))
)
}
}
unname(do.call(rbind, lst))
}
and we can obtain
> f0(v = 1:7, k = 3, m = 2)
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 1 3 6
[3,] 1 3 7
[4,] 1 4 6
[5,] 1 4 7
[6,] 2 4 6
[7,] 2 4 7
[8,] 1 5 7
[9,] 2 5 7
[10,] 3 5 7
> f0(v = 1:10, k = 3, m = 2)
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 1 3 6
[3,] 1 3 7
[4,] 1 3 8
[5,] 1 3 9
[6,] 1 3 10
[7,] 1 4 6
[8,] 1 4 7
[9,] 1 4 8
[10,] 1 4 9
[11,] 1 4 10
[12,] 2 4 6
[13,] 2 4 7
[14,] 2 4 8
[15,] 2 4 9
[16,] 2 4 10
[17,] 1 5 7
[18,] 1 5 8
[19,] 1 5 9
[20,] 1 5 10
[21,] 2 5 7
[22,] 2 5 8
[23,] 2 5 9
[24,] 2 5 10
[25,] 3 5 7
[26,] 3 5 8
[27,] 3 5 9
[28,] 3 5 10
[29,] 1 6 8
[30,] 1 6 9
[31,] 1 6 10
[32,] 2 6 8
[33,] 2 6 9
[34,] 2 6 10
[35,] 3 6 8
[36,] 3 6 9
[37,] 3 6 10
[38,] 4 6 8
[39,] 4 6 9
[40,] 4 6 10
[41,] 1 7 9
[42,] 1 7 10
[43,] 2 7 9
[44,] 2 7 10
[45,] 3 7 9
[46,] 3 7 10
[47,] 4 7 9
[48,] 4 7 10
[49,] 5 7 9
[50,] 5 7 10
[51,] 1 8 10
[52,] 2 8 10
[53,] 3 8 10
[54,] 4 8 10
[55,] 5 8 10
[56,] 6 8 10
for
-loops approachYou can simply run with for
loops like below
f1 <- function(v, k, m) {
res <- matrix(v)
p <- v
for (i in 1:(k - 1)) {
q <- (p + m)[(p + m) %in% v]
lst <- list()
for (j in 1:nrow(res)) {
s <- q[q - res[j, i] >= m]
if (length(s) > 0) {
lst <- append(
lst,
list(cbind(res[rep(j, each = length(s)), , drop = FALSE], s))
)
}
}
res <- unname(do.call(rbind, lst))
p <- q
}
res
}
and will obtain
> f1(v = 1:7, k = 3, m = 2)
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 1 3 6
[3,] 1 3 7
[4,] 1 4 6
[5,] 1 4 7
[6,] 2 4 6
[7,] 2 4 7
[8,] 1 5 7
[9,] 2 5 7
[10,] 3 5 7
> f1(v = 1:10, k = 3, m = 2)
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 1 3 6
[3,] 1 3 7
[4,] 1 3 8
[5,] 1 3 9
[6,] 1 3 10
[7,] 1 4 6
[8,] 1 4 7
[9,] 1 4 8
[10,] 1 4 9
[11,] 1 4 10
[12,] 2 4 6
[13,] 2 4 7
[14,] 2 4 8
[15,] 2 4 9
[16,] 2 4 10
[17,] 1 5 7
[18,] 1 5 8
[19,] 1 5 9
[20,] 1 5 10
[21,] 2 5 7
[22,] 2 5 8
[23,] 2 5 9
[24,] 2 5 10
[25,] 3 5 7
[26,] 3 5 8
[27,] 3 5 9
[28,] 3 5 10
[29,] 1 6 8
[30,] 1 6 9
[31,] 1 6 10
[32,] 2 6 8
[33,] 2 6 9
[34,] 2 6 10
[35,] 3 6 8
[36,] 3 6 9
[37,] 3 6 10
[38,] 4 6 8
[39,] 4 6 9
[40,] 4 6 10
[41,] 1 7 9
[42,] 1 7 10
[43,] 2 7 9
[44,] 2 7 10
[45,] 3 7 9
[46,] 3 7 10
[47,] 4 7 9
[48,] 4 7 10
[49,] 5 7 9
[50,] 5 7 10
[51,] 1 8 10
[52,] 2 8 10
[53,] 3 8 10
[54,] 4 8 10
[55,] 5 8 10
[56,] 6 8 10
Reduce
approachAnother option is using Reduce
, which iteratively increases the dimension of the resulting data.frame
by adding eligible elements from v
as a new column.
f2 <- function(v, k, m) {
helper <- function(df, v) {
u <- unique(df[[length(df)]])
v <- (u + m)[(u + m) %in% v]
grp <- split(df, df[length(df)])
lst <- lapply(
grp,
\(x) {
p <- v[v - x[[length(x)]][1] >= 2]
if (length(p) > 0) {
cbind(x[rep(1:nrow(x), each = length(p)), ], p)
}
}
)
as.data.frame(`row.names<-`(unname(do.call(rbind, lst)), NULL))
}
Reduce(helper, rep(list(v), k - 1), init = as.data.frame(v))
}
and you will obtain
> f2(v = 1:7, k = 3, m = 2)
1 1 3 5
2 1 3 6
3 1 3 7
4 1 4 6
5 1 4 7
6 2 4 6
7 2 4 7
8 1 5 7
9 2 5 7
10 3 5 7
> f2(v = 1:10, k = 3, m = 2)
1 1 3 5
2 1 3 6
3 1 3 7
4 1 3 8
5 1 3 9
6 1 3 10
7 1 4 6
8 1 4 7
9 1 4 8
10 1 4 9
11 1 4 10
12 2 4 6
13 2 4 7
14 2 4 8
15 2 4 9
16 2 4 10
17 1 5 7
18 1 5 8
19 1 5 9
20 1 5 10
21 2 5 7
22 2 5 8
23 2 5 9
24 2 5 10
25 3 5 7
26 3 5 8
27 3 5 9
28 3 5 10
29 1 6 8
30 1 6 9
31 1 6 10
32 2 6 8
33 2 6 9
34 2 6 10
35 3 6 8
36 3 6 9
37 3 6 10
38 4 6 8
39 4 6 9
40 4 6 10
41 1 7 9
42 1 7 10
43 2 7 9
44 2 7 10
45 3 7 9
46 3 7 10
47 4 7 9
48 4 7 10
49 5 7 9
50 5 7 10
51 1 8 10
52 2 8 10
53 3 8 10
54 4 8 10
55 5 8 10
56 6 8 10
The benchmark includes all existing answers to this question
v <- 1:20
k <- 4
m <- 2
microbenchmark(
f_AC = f(v, k, m), # Allan Cameron's solution
f_TIC0 = f0(v, k, m), # ThomasIsCoding's solution 0
f_TIC1 = f1(v, k, m), # ThomasIsCoding's solution 1
f_TIC2 = f2(v, k, m), # ThomasIsCoding's solution 2
times = 20L,
unit = "relative"
)
and we see
Unit: relative
expr min lq mean median uq max neval
f_AC 82.677137 69.78411 43.01843 83.497758 81.922518 6.791794 20
f_TIC0 1.000000 1.00000 1.00000 1.000000 1.000000 1.000000 20
f_TIC1 7.731772 7.74679 4.75455 8.012004 7.592652 1.316215 20
f_TIC2 19.764325 16.68923 11.65485 17.963988 24.911318 2.359821 20
For a pressure test with v <- 1:100
, k <- 5
and m <- 2
, the performance of recursions f
(by @Allan Cameron) and f0
(by @ThomasIsCoding) are shown as below
> system.time(
+ res <- f0(v = 1:100, k = 5, m = 2)
+ )
user system elapsed
3.33 1.99 5.39
> system.time(
+ res <- f(v = 1:100, k = 5, m = 2)
+ )
user system elapsed
146.70 4.17 157.02