I have a problem with picking the top n% largest and smallest element's from each data matrix row. Specifically, I would like to find the column numbers of those top n% elements. This would not be a problem if each row had the same number of non-NA-elements, but in this situation the number of picked elements is different for each row. Here's an example of the situation (the real data matrix is 195x1030 so I'wont be using it here), where top 40% are picked
data=
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1 NA 100 98 200 78 80 35 NA 55
[2,] 32 67 15 73 NA 12 91 230 3 99
[3,] NA NA NA 45 53 26 112 64 80 41
[4,] 54 38 60 70 163 69 109 205 5 31
[5,] 107 28 296 254 30 40 NA 18 28 90
The resulting top 40% column numbers matrixes should look like these (the number of picked elements is calculated by rounding down, as the function as.integer does)
largest= smallest=
[,1] [,2] [,3] [,4] [,1] [,2] [,3] [,4]
[1,] 5 3 4 NA [1,] 1 8 10 NA
[2,] 8 10 7 NA [2,] 9 6 3 NA
[3,] 7 9 NA NA [3,] 6 10 NA NA
[4,] 8 5 7 4 [4,] 9 10 2 1
[5,] 3 4 1 10 [5,] 8 9 2 5
So the top numbers are selected looking only at the non-NA-elements of the rows. For example the first row of data matrix contains only 8 non-NA-numbers and thus 40%*8=3,2~ 3 elements are selected. This creates the NA's to the resulting matrixes.
Once again, I tried using a for-loop (this code is to finding the largest 40%):
largest <- matrix(rep(NA, 20), nrow = 5)
for(i in 1:5){
largest[i,]<-order(data[i,], decreasing=T)
[1:as.integer(0.4*nrow(data[complete.cases(data[,i]),]))]
}
but R returns an error: "number of items to replace is not a multiple of replacement length", which I think means that since not all the elements of the original largest-matrix are not replaced while looping, this for-loop can't be used. Am I right?
How could this sort of picking be done?
The following reproduces your expected output
# Determine number of columns for output matrix as
# maximum of 40% of all non-NA values per row
ncol <- max(floor(apply(mat, 1, function(x) sum(!is.na(x))) * 0.4))
# Top 40% largest
t(apply(mat, 1, function(x) {
n <- floor(sum(!is.na(x)) * 0.4);
replace(rep(NA, ncol), 1:n, order(x, decreasing = T)[1:n])
}))
# [,1] [,2] [,3] [,4]
#[1,] 5 3 4 NA
#[2,] 8 10 7 NA
#[3,] 7 9 NA NA
#[4,] 8 5 7 4
#[5,] 3 4 1 NA
# Top 40% smallest
t(apply(mat, 1, function(x) {
n <- floor(sum(!is.na(x)) * 0.4);
replace(rep(NA, ncol), 1:n, order(x, decreasing = F)[1:n])
}))
# [,1] [,2] [,3] [,4]
#[1,] 1 8 10 NA
#[2,] 9 6 3 NA
#[3,] 6 10 NA NA
#[4,] 9 10 2 1
#[5,] 8 2 9 NA
Explanation: We first determine the max number of columns for both output matrices; we then loop through mat
row-by-row, determine the row-specific number n
of non-NA
entries corresponding to 40% of all non-NA
numbers in that row, and return a column vector
of the top 40% decreasing/increasing entries padded with NA
s. Final transpose gives the expected output.