I have a dataframe, df with two variables as given below. Using the code below, I want to get matrix "mat".
This code works perfect for unique(df$V1)= 3 but it takes a lot of time (>10 hours) for operations where unique(df$V1) is in 1000's.
Dataframe
V1 V2
1 60
1 30
1 38
1 46
2 29
2 35
2 13
2 82
3 100
3 72
3 63
3 45
Code:
#Unique V1 values
vec <- unique(df$V1)
#Count <= valies
val <- combn(vec, 2, function(x)
sum(outer(df$V2[df$V1 == x[1]], df$V2[df$V1 == x[2]], `<=`)))
val
#[1] 5 14 13
#Create an empty matrix
mat <- matrix(0,length(vec), length(vec))
#Fill the lower triangle of the matrix.
mat[lower.tri(mat)] <- val
mat
Basically, for V1=1 we want to compare all values of V2 with all values of V2 for V1= 2 and 3. Repeat the same for V1=2 and V1=3. In other words, for a given value of V1 we want to see if the values in V2 are less than the values in V2 for rest values in V1. For instance we compare the values in V2 for V1=1 and V1=2. If the value in V2 for V1=1 is less than value in V2 for V1=2, then the return value is 1 else 0. For example:
For V1=1->
( 60 > 29 : returns 0,
60 > 35 : returns 0,
60 > 13 : returns 0,
60 < 82 : returns 1,
30 > 29 : returns 0,
30 < 35 : returns 1,
30 > 13 : returns 0,
30 < 82 : returns 1,
38 > 29 : returns 0,
38 > 35 : returns 0,
38 > 13 : returns 0,
38 < 82 : returns 1,
46 > 29 : returns 0,
46 > 35 : returns 0,
46 > 13 : returns 0,
30 < 82 : returns 1)=Sum is 5 (i.e. mat[1,2])
This should be lightning fast for this problem and not use excessive memory.
library(data.table)
setDT(df)
numvec <- max(df[,V1])
dl <- lapply(1:numvec, function(i) df[V1 == i, sort(V2)])
dmat <- CJ(x=1:numvec, y=1:numvec)[, .(z = sum(findInterval(dl[[y]],dl[[x]]))), .(x,y)]
mat <- as.matrix(dcast(dmat, x~y, value.var = 'z')[, -'x'])