Consider vector s
as follows:
s=seq(0.01, 0.99, 0.01)
> s
[1] 0.01 0.02 0.03 0.04 0.05 0.06 0.07
0.08 0.09 .......... 0.89 0.90 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99
Now given s
and a fixed length m
, I would like to have a matrix for all possible permutations of length m
such that each row of matrix sums up to 1
(excluding the brute force approach).
For example, if m=4
(i.e. number of columns), the desired matrix would be something like this:
0.01 0.01 0.01 0.97
0.02 0.01 0.01 0.96
0.03 0.01 0.01 0.95
0.04 0.01 0.01 0.94
0.05 0.01 0.01 0.93
0.06 0.01 0.01 0.92
.
.
.
0.53 0.12 0.30 0.05
.
.
.
0.96 0.02 0.01 0.01
0.97 0.01 0.01 0.01
.
.
.
0.01 0.97 0.01 0.01
.
.
.
Here's how to do this using recursion:
permsum <- function(s,m) if (m==1L) matrix(s) else do.call(rbind,lapply(seq_len(s-m+1L),function(x) unname(cbind(x,permsum(s-x,m-1L)))));
res <- permsum(100L,4L);
head(res);
## [,1] [,2] [,3] [,4]
## [1,] 1 1 1 97
## [2,] 1 1 2 96
## [3,] 1 1 3 95
## [4,] 1 1 4 94
## [5,] 1 1 5 93
## [6,] 1 1 6 92
tail(res);
## [,1] [,2] [,3] [,4]
## [156844,] 95 2 2 1
## [156845,] 95 3 1 1
## [156846,] 96 1 1 2
## [156847,] 96 1 2 1
## [156848,] 96 2 1 1
## [156849,] 97 1 1 1
You can divide by 100 to get fractions, as opposed to integers:
head(res)/100;
## [,1] [,2] [,3] [,4]
## [1,] 0.01 0.01 0.01 0.97
## [2,] 0.01 0.01 0.02 0.96
## [3,] 0.01 0.01 0.03 0.95
## [4,] 0.01 0.01 0.04 0.94
## [5,] 0.01 0.01 0.05 0.93
## [6,] 0.01 0.01 0.06 0.92
First let's define the inputs:
s
This is the target value to which each row in the output matrix should sum.m
This is number of columns to produce in the output matrix.It is more efficient and reliable to compute the result using integer arithmetic, as opposed to floating-point arithmetic, so I designed my solution to work only with integers. Hence s
is a scalar integer representing the target integer sum.
Now let's examine the sequence generated by seq_len()
for the non-base case:
seq_len(s-m+1L)
This generates a sequence from 1 to the highest possible value that could be part of a sum to s
with m
columns remaining. For example, think about the case of s=100,m=4
: the highest number we can use is 97, participating in a sum of 97+1+1+1. Each remaining column reduces the highest possible value by 1, which is why we must subtract m
from s
when computing the sequence length.
Each element of the generated sequence should be viewed as one possible "selection" of an addend in the summation.
do.call(rbind,lapply(seq_len(s-m+1L),function(x) ...))
For each of the selections, we must then recurse. We can use lapply()
to do this.
To jump ahead, the lambda will make a single recursive call to permsum()
and then cbind()
the return value with the current selection. This will produce a matrix, always of the same width for this level of recursion. Hence, the lapply()
call will return a list of matrices, all of the same width. We must then row-bind them together, which is why we must use the do.call(rbind,...)
trick here.
unname(cbind(x,permsum(s-x,m-1L)))
The body of the lambda is fairly simple; we cbind()
the current selection x
with the return value of the recursive call, completing the summation for this submatrix. Unfortunately we must call unname()
, otherwise each column that ends up being set from the x
argument will have column name x
.
The most important detail here is the choice of arguments to the recursive call. First, because the lambda argument x
has just been selected out during the current recursive evaluation, we must subtract it from s
to get a new summation target, which the impending recursive call will be responsible for attaining. Hence the first argument becomes s-x
. Second, because the selection of x
takes up one column, we must subtract 1 from m
, so that the recursive call will have one fewer column to produce in its output matrix.
if (m==1L) matrix(s) else ...
Lastly, let's examine the base case. In every evaluation of the recursive function we must check if m
has reached 1, in which case we can simply return the required sum s
itself.
I looked into the discrepancy between my results and psidom's results. For example:
library(data.table);
bgoldst <- function(s,m) permsum(s,m)/s;
psidom <- function(ss,m) { raw <- do.call(data.table::CJ,rep(list(ss),m)); raw[rowSums(raw)==1,]; };
## helper function to sort a matrix by columns
smp <- function(m) m[do.call(order,as.data.frame(m)),];
s <- 100L; m <- 3L; ss <- seq_len(s-1L)/s;
x <- smp(bgoldst(s,m));
y <- smp(unname(as.matrix(psidom(ss,m))));
nrow(x);
## [1] 4851
nrow(y);
## [1] 4809
So there's a 42 row discrepancy between our two results. I decided to try to find exactly which permutations were omitted with the following line of code. Basically, it compares each element of the two matrices and prints the comparison result as a logical matrix. We can scan down the scrollback to find the first differing row. Below is the excerpted output:
x==do.call(rbind,c(list(y),rep(list(NA),nrow(x)-nrow(y))));
## [,1] [,2] [,3]
## [1,] TRUE TRUE TRUE
## [2,] TRUE TRUE TRUE
## [3,] TRUE TRUE TRUE
## [4,] TRUE TRUE TRUE
## [5,] TRUE TRUE TRUE
##
## ... snip ...
##
## [24,] TRUE TRUE TRUE
## [25,] TRUE TRUE TRUE
## [26,] TRUE TRUE TRUE
## [27,] TRUE TRUE TRUE
## [28,] TRUE TRUE TRUE
## [29,] TRUE FALSE FALSE
## [30,] TRUE FALSE FALSE
## [31,] TRUE FALSE FALSE
## [32,] TRUE FALSE FALSE
## [33,] TRUE FALSE FALSE
##
## ... snip ...
So it's at row 29 where we have the first discrepancy. Here's a window around that row in each permutation matrix:
win <- 27:31;
x[win,]; y[win,];
## [,1] [,2] [,3]
## [1,] 0.01 0.27 0.72
## [2,] 0.01 0.28 0.71
## [3,] 0.01 0.29 0.70 (missing from y)
## [4,] 0.01 0.30 0.69 (missing from y)
## [5,] 0.01 0.31 0.68
## [,1] [,2] [,3]
## [1,] 0.01 0.27 0.72
## [2,] 0.01 0.28 0.71
## [3,] 0.01 0.31 0.68
## [4,] 0.01 0.32 0.67
## [5,] 0.01 0.33 0.66
Interestingly, the missing permutations normally do sum to exactly 1 when you compute the sum manually. At first I thought it was data.table's CJ()
function that was doing something strange with floats, but further testing seems to indicate it's something rowSums()
is doing:
0.01+0.29+0.70==1;
## [1] TRUE
ss[1L]+ss[29L]+ss[70L]==1;
## [1] TRUE
rowSums(CJ(0.01,0.29,0.70))==1; ## looks like CJ()'s fault, but wait...
## [1] FALSE
cj <- CJ(0.01,0.29,0.70);
cj$V1+cj$V2+cj$V3==1; ## not CJ()'s fault
## [1] TRUE
rowSums(matrix(c(0.01,0.29,0.70),1L,byrow=T))==1; ## rowSums()'s fault
## [1] FALSE
We can work around this rowSums()
quirk by applying a manual (and somewhat arbitrary) tolerance in the floating-point comparison. To do this we need to take the absolute difference and then perform a less-than comparison against the tolerance:
abs(rowSums(CJ(0.01,0.29,0.70))-1)<1e-10;
## [1] TRUE
Hence:
psidom2 <- function(ss,m) { raw <- do.call(data.table::CJ,rep(list(ss),m)); raw[abs(rowSums(raw)-1)<1e-10,]; };
y <- smp(unname(as.matrix(psidom2(ss,m))));
nrow(y);
## [1] 4851
identical(x,y);
## [1] TRUE
Thanks to Joseph Wood for pointing out that this is really permutations. I originally named my function combsum()
, but I renamed it to permsum()
to reflect this revelation. And, as Joseph suggested, it is possible to modify the algorithm to produce combinations, which can be done as follows, now living up to the name combsum()
:
combsum <- function(s,m,l=s) if (m==1L) matrix(s) else do.call(rbind,lapply(seq((s+m-1L)%/%m,min(l,s-m+1L)),function(x) unname(cbind(x,combsum(s-x,m-1L,x)))));
res <- combsum(100L,4L);
head(res);
## [,1] [,2] [,3] [,4]
## [1,] 25 25 25 25
## [2,] 26 25 25 24
## [3,] 26 26 24 24
## [4,] 26 26 25 23
## [5,] 26 26 26 22
## [6,] 27 25 24 24
tail(res);
## [,1] [,2] [,3] [,4]
## [7148,] 94 3 2 1
## [7149,] 94 4 1 1
## [7150,] 95 2 2 1
## [7151,] 95 3 1 1
## [7152,] 96 2 1 1
## [7153,] 97 1 1 1
This required 3 changes.
First, I added a new parameter l
, which stands for "limit". Basically, in order to guarantee that each recursion generates a unique combination, I enforce that each selection must be less than or equal to any previous selection in the current combination. This required taking the current upper limit as a parameter l
. On the top-level call l
can just be defaulted to s
, which is actually too high anyway for cases where m>1
, but that's not a problem, since it's just one of two upper limits that will be applied during sequence generation.
The second change was of course to pass the latest selection x
as the argument to l
when making the recursive call in the lapply()
lambda.
The final change is the trickiest. The selection sequence must now be computed as follows:
seq((s+m-1L)%/%m,min(l,s-m+1L))
The lower limit had to be raised from the 1 used in permsum()
to the lowest possible selection that would still allow a descending-magnitude combination. The lowest possible selection of course depends on how many columns have yet to be produced; the more columns, the more "room" we have to leave for future selections. The formula is to take an integer division of s
on m
, but we also must effectively "round up", which is why I add m-1L
prior to taking the division. I also considered doing a floating-point division and then calling as.integer(ceiling(...))
, but I think the all-integer approach is much better.
For example, consider the case of s=10,m=3
. To produce a sum of 10 with 3 columns remaining, we cannot make a selection less than 4, because then we would not have enough quantity to produce 10 without ascending along the combination. In this case, the formula divides 12 by 3 to give 4.
The upper limit can be computed from the same formula used in permsum()
, except that we must also apply the current limit l
using a call to min()
.
I've verified that my new combsum()
behaves identically to Joseph's IntegerPartitionsOfLength()
function for many random test cases with the following code:
## helper function to sort a matrix within each row and then by columns
smc <- function(m) smp(t(apply(m,1L,sort)));
## test loop
for (i in seq_len(1000L)) {
repeat {
s <- sample(1:100,1L);
m <- sample(2:5,1L);
if (s>=m) break;
};
x <- combsum(s,m);
y <- IntegerPartitionsOfLength(s,m);
cat(paste0(s,',',m,'\n'));
if (!identical(smc(x),smc(y))) stop('bad.');
};
Common self-contained test code:
library(microbenchmark);
library(data.table);
library(partitions);
library(gtools);
permsum <- function(s,m) if (m==1L) matrix(s) else do.call(rbind,lapply(seq_len(s-m+1L),function(x) unname(cbind(x,permsum(s-x,m-1L)))));
combsum <- function(s,m,l=s) if (m==1L) matrix(s) else do.call(rbind,lapply(seq((s+m-1L)%/%m,min(l,s-m+1L)),function(x) unname(cbind(x,combsum(s-x,m-1L,x)))));
IntegerPartitionsOfLength <- function(n, Lim, combsOnly = TRUE) { a <- 0L:n; k <- 2L; a[2L] <- n; MyParts <- vector("list", length=P(n)); count <- 0L; while (!(k==1L) && k <= Lim + 1L) { x <- a[k-1L]+1L; y <- a[k]-1L; k <- k-1L; while (x<=y && k <= Lim) {a[k] <- x; y <- y-x; k <- k+1L}; a[k] <- x+y; if (k==Lim) { count <- count+1L; MyParts[[count]] <- a[1L:k]; }; }; MyParts <- MyParts[1:count]; if (combsOnly) {do.call(rbind, MyParts)} else {MyParts}; };
GetDecimalReps <- function(s,m) { myPerms <- permutations(m,m); lim <- nrow(myPerms); intParts <- IntegerPartitionsOfLength(s,m,FALSE); do.call(rbind, lapply(intParts, function(x) { unique(t(sapply(1L:lim, function(y) x[myPerms[y, ]]))); })); };
smp <- function(m) m[do.call(order,as.data.frame(m)),];
smc <- function(m) smp(t(apply(m,1L,sort)));
bgoldst.perm <- function(s,m) permsum(s,m)/s;
psidom2 <- function(ss,m) { raw <- do.call(data.table::CJ,rep(list(ss),m)); raw[abs(rowSums(raw)-1)<1e-10,]; };
joseph.perm <- function(s,m) GetDecimalReps(s,m)/s;
bgoldst.comb <- function(s,m) combsum(s,m)/s;
joseph.comb <- function(s,m) IntegerPartitionsOfLength(s,m)/s;
Permutations
## small scale
s <- 10L; m <- 3L; ss <- seq_len(s-1L)/s;
ex <- smp(bgoldst.perm(s,m));
identical(ex,smp(unname(as.matrix(psidom2(ss,m)))));
## [1] TRUE
identical(ex,smp(joseph.perm(s,m)));
## [1] TRUE
microbenchmark(bgoldst.perm(s,m),psidom2(ss,m),joseph.perm(s,m));
## Unit: microseconds
## expr min lq mean median uq max neval
## bgoldst.perm(s, m) 347.254 389.5920 469.1011 420.383 478.7575 1869.697 100
## psidom2(ss, m) 702.206 830.5015 1007.5111 907.265 1038.3405 2618.089 100
## joseph.perm(s, m) 1225.225 1392.8640 1722.0070 1506.833 1860.0745 4411.234 100
## large scale
s <- 100L; m <- 4L; ss <- seq_len(s-1L)/s;
ex <- smp(bgoldst.perm(s,m));
identical(ex,smp(unname(as.matrix(psidom2(ss,m)))));
## [1] TRUE
identical(ex,smp(joseph.perm(s,m)));
## [1] TRUE
microbenchmark(bgoldst.perm(s,m),psidom2(ss,m),joseph.perm(s,m),times=5L);
## Unit: seconds
## expr min lq mean median uq max neval
## bgoldst.perm(s, m) 1.286856 1.304177 1.426376 1.374411 1.399850 1.766585 5
## psidom2(ss, m) 6.673545 7.046951 7.416161 7.115375 7.629177 8.615757 5
## joseph.perm(s, m) 5.299452 10.499891 13.769363 12.680607 15.107748 25.259117 5
## very large scale
s <- 100L; m <- 5L; ss <- seq_len(s-1L)/s;
ex <- smp(bgoldst.perm(s,m));
identical(ex,smp(unname(as.matrix(psidom2(ss,m)))));
## Error: cannot allocate vector of size 70.9 Gb
identical(ex,smp(joseph.perm(s,m)));
## [1] TRUE
microbenchmark(bgoldst.perm(s,m),joseph.perm(s,m),times=1L);
## Unit: seconds
## expr min lq mean median uq max neval
## bgoldst.perm(s, m) 28.58359 28.58359 28.58359 28.58359 28.58359 28.58359 1
## joseph.perm(s, m) 50.51965 50.51965 50.51965 50.51965 50.51965 50.51965 1
Combinations
## small-scale
s <- 10L; m <- 3L;
ex <- smc(bgoldst.comb(s,m));
identical(ex,smc(joseph.comb(s,m)));
## [1] TRUE
microbenchmark(bgoldst.comb(s,m),joseph.comb(s,m));
## Unit: microseconds
## expr min lq mean median uq max neval
## bgoldst.comb(s, m) 161.225 179.6145 205.0898 187.3120 199.5005 1310.328 100
## joseph.comb(s, m) 172.344 191.8025 204.5681 197.7895 205.2735 437.489 100
## large-scale
s <- 100L; m <- 4L;
ex <- smc(bgoldst.comb(s,m));
identical(ex,smc(joseph.comb(s,m)));
## [1] TRUE
microbenchmark(bgoldst.comb(s,m),joseph.comb(s,m),times=5L);
## Unit: milliseconds
## expr min lq mean median uq max neval
## bgoldst.comb(s, m) 409.0708 485.9739 556.4792 591.4774 627.419 668.4548 5
## joseph.comb(s, m) 2164.2134 3315.0138 3317.9725 3540.6240 3713.732 3856.2793 5
## very large scale
s <- 100L; m <- 6L;
ex <- smc(bgoldst.comb(s,m));
identical(ex,smc(joseph.comb(s,m)));
## [1] TRUE
microbenchmark(bgoldst.comb(s,m),joseph.comb(s,m),times=1L);
## Unit: seconds
## expr min lq mean median uq max neval
## bgoldst.comb(s, m) 2.498588 2.498588 2.498588 2.498588 2.498588 2.498588 1
## joseph.comb(s, m) 12.344261 12.344261 12.344261 12.344261 12.344261 12.344261 1