rloopsmatrixpicking

Picking top n% percent of elements from matrix rows, different number of elements on each row


I have a problem with picking the top n% largest and smallest element's from each data matrix row. Specifically, I would like to find the column numbers of those top n% elements. This would not be a problem if each row had the same number of non-NA-elements, but in this situation the number of picked elements is different for each row. Here's an example of the situation (the real data matrix is 195x1030 so I'wont be using it here), where top 40% are picked

data=     
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    1   NA   100  98   200  78   80   35   NA    55
[2,]   32   67   15   73   NA   12   91   230  3     99
[3,]   NA   NA   NA   45   53   26   112  64   80    41
[4,]   54   38   60   70   163  69   109  205  5     31
[5,]   107  28   296  254  30   40   NA   18   28    90

The resulting top 40% column numbers matrixes should look like these (the number of picked elements is calculated by rounding down, as the function as.integer does)

largest=                              smallest=
      [,1] [,2] [,3] [,4]                   [,1] [,2] [,3] [,4]  
[1,]    5   3    4    NA              [1,]    1   8    10   NA
[2,]    8   10   7    NA              [2,]    9   6    3    NA
[3,]    7   9    NA   NA              [3,]    6   10   NA   NA
[4,]    8   5    7    4               [4,]    9   10   2    1
[5,]    3   4    1    10              [5,]    8   9    2    5

So the top numbers are selected looking only at the non-NA-elements of the rows. For example the first row of data matrix contains only 8 non-NA-numbers and thus 40%*8=3,2~ 3 elements are selected. This creates the NA's to the resulting matrixes.

Once again, I tried using a for-loop (this code is to finding the largest 40%):

   largest <- matrix(rep(NA, 20), nrow = 5)
 for(i in 1:5){
   largest[i,]<-order(data[i,], decreasing=T)   
 [1:as.integer(0.4*nrow(data[complete.cases(data[,i]),]))]
 }

but R returns an error: "number of items to replace is not a multiple of replacement length", which I think means that since not all the elements of the original largest-matrix are not replaced while looping, this for-loop can't be used. Am I right?

How could this sort of picking be done?


Solution

  • The following reproduces your expected output

    # Determine number of columns for output matrix as
    # maximum of 40% of all non-NA values per row
    ncol <- max(floor(apply(mat, 1, function(x) sum(!is.na(x))) * 0.4))
    
    # Top 40% largest
    t(apply(mat, 1, function(x) {
        n <- floor(sum(!is.na(x)) * 0.4);
        replace(rep(NA, ncol), 1:n, order(x, decreasing = T)[1:n])
    }))
    #     [,1] [,2] [,3] [,4]
    #[1,]    5    3    4   NA
    #[2,]    8   10    7   NA
    #[3,]    7    9   NA   NA
    #[4,]    8    5    7    4
    #[5,]    3    4    1   NA
    
    
    # Top 40% smallest
    t(apply(mat, 1, function(x) {
        n <- floor(sum(!is.na(x)) * 0.4);
        replace(rep(NA, ncol), 1:n, order(x, decreasing = F)[1:n])
    }))
    #     [,1] [,2] [,3] [,4]
    #[1,]    1    8   10   NA
    #[2,]    9    6    3   NA
    #[3,]    6   10   NA   NA
    #[4,]    9   10    2    1
    #[5,]    8    2    9   NA
    

    Explanation: We first determine the max number of columns for both output matrices; we then loop through mat row-by-row, determine the row-specific number n of non-NA entries corresponding to 40% of all non-NA numbers in that row, and return a column vector of the top 40% decreasing/increasing entries padded with NAs. Final transpose gives the expected output.