I wonder if someone can figure out a faster way to calculate combinations of elements in vector. My approach works but is slow with about 6 million elements in the vector.
Test vector
test.vector <- c("335261 344015 537633","22404 132858","254654 355860 488288","219943 373817","331839 404477")
My approach
lapply(strsplit(test.vector, " "), function(x) unique(apply(combn(x, 2), 2, function(y) paste0(y, collapse = ""))))
Expected output
[[1]]
[1] "335261344015" "335261537633" "344015537633"
[[2]]
[1] "22404132858"
[[3]]
[1] "254654355860" "254654488288" "355860488288"
[[4]]
[1] "219943373817"
[[5]]
[1] "331839404477"
Here is an answer that is much faster than the OP’s solution on large test cases. It doesn’t rely on paste
, but rather we take advantage of properties of numbers and vectorized operations. We also use comboGeneral
from the RcppAlgos
package (I am the author) which is much faster than combn
and combn_prim
from the linked answer for generating combinations of a vector. First we show the efficiency gains of comboGeneral
over the other functions:
## library(gRbase)
library(RcppAlgos)
library(microbenchmark)
options(digits = 4)
options(width = 90)
options(scipen = 99)
microbenchmark(
gRBase = gRbase::combn_prim(300, 2),
utils = combn(300, 2),
RcppAlgos = comboGeneral(300, 2),
unit = "relative"
)
#> Warning in microbenchmark(gRBase = gRbase::combn_prim(300, 2), utils = combn(300, : less
#> accurate nanosecond times to avoid potential integer overflows
#> Unit: relative
#> expr min lq mean median uq max neval
#> gRBase 104.6 93.79 122.1 91.82 88.26 1259.00 100
#> utils 141.8 127.72 126.2 125.70 128.44 88.04 100
#> RcppAlgos 1.0 1.00 1.0 1.00 1.00 1.00 100
Now, we create a function to create some random reproducible data that will be passed to our test functions:
makeTestSet <- function(vectorSize, elementSize, mySeed = 42, withRep = FALSE) {
set.seed(mySeed)
sapply(1:vectorSize, function(x) {
paste(sample(10^6, s1 <- sample(2:elementSize, 1),
replace = withRep), collapse = " ")
})
}
makeTestSet(5, 3)
#> [1] "280897 566346" "74362 46208 964632" "226905 326254"
#> [4] "626546 926595 586329" "222181 946797"
That looks good. Now, lets see if setting fixed = TRUE
gets us any gains (as suggested above by @MichaelChirico):
bigVec <- makeTestSet(10, 100000)
microbenchmark(standard = strsplit(bigVec, " "),
withFixed = strsplit(bigVec, " ", fixed = TRUE),
times = 15, unit = "relative")
#> Unit: relative
#> expr min lq mean median uq max neval
#> standard 6.18 6.134 5.197 6.048 5.678 2.921 15
#> withFixed 1.00 1.000 1.000 1.000 1.000 1.000 15
@MichaelChirico was spot on. Putting it all together we get:
combPairFast <- function(testVec) {
lapply(strsplit(testVec, " ", fixed = TRUE), function(x) {
combs <- RcppAlgos::comboGeneral(as.numeric(x), 2)
unique(combs[,1] * (10)^(as.integer(log10(combs[,2])) + 1L) + combs[,2])
})
}
## test.vector defined above by OP
test.vector <- c(
"335261 344015 537633", "22404 132858",
"254654 355860 488288", "219943 373817",
"331839 404477"
)
combPairFast(test.vector)
#> [[1]]
#> [1] 335261344015 335261537633 344015537633
#>
#> [[2]]
#> [1] 22404132858
#>
#> [[3]]
#> [1] 254654355860 254654488288 355860488288
#>
#> [[4]]
#> [1] 219943373817
#>
#> [[5]]
#> [1] 331839404477
## OP original code
combPairOP <- function(testVec) {
lapply(strsplit(testVec, " "), function(x) unique(
apply(combn(x, 2), 2, function(y) paste0(y, collapse = "")))
)
}
As stated in the comments by the OP, the maximum number is less than a million (600000 to be exact), which means that after we multiply one of the numbers by at most 10^6 and add it to another 6 digit number (equivalent to simply concatenating two strings of numbers), we are guaranteed to be within the numerical precision of base R (i.e. 2^53 - 1
). This is good because arithmetic operations on numerical numbers is much more efficient than strings operations.
All that is left is to benchmark:
test.vector <- makeTestSet(100, 50)
microbenchmark(
original = combPairOP(test.vector),
new_impl = combPairFast(test.vector),
times = 20,
unit = "relative"
)
#> Unit: relative
#> expr min lq mean median uq max neval
#> original 31.68 31.55 23.85 30.46 29.14 11.2 20
#> new_impl 1.00 1.00 1.00 1.00 1.00 1.0 20
And on larger vectors:
bigTest.vector <- makeTestSet(1000, 100, mySeed = 22, withRep = TRUE)
## Duplicate values exist
any(sapply(strsplit(bigTest.vector, " ", fixed = TRUE), function(x) {
any(duplicated(x))
}))
#> [1] TRUE
system.time(t1 <- combPairFast(bigTest.vector))
#> user system elapsed
#> 0.081 0.004 0.086
system.time(t2 <- combPairOP(bigTest.vector))
#> user system elapsed
#> 4.519 0.040 4.560
## results are the same
all.equal(t1, lapply(t2, as.numeric))
#> [1] TRUE