I am using R bigmemory package and Rcpp to handle big matrices (1 to 10 Million column x 1000 rows). Once I read an interger matrix consisting in 0, 2 and NA into a filebacked bigmemory matrix in R I would like to modify through C++ all the NA values in order to do imputation of the mean values per column or an arbitrary-value-imputation (I show here the latter).
Below is the Rcpp function I have written and that does not work. My hope was that calling BigNA(mybigmatrix@address)
from within R could find the elements in the matrix that are NAs and modify its values directly in the backing file.
I think the problem might be in the evaluation of std::isnan(mat[j][i])
. I checked this by creating an alternative function that counts the NA values with an accumulator and indeed did not count any NA. But once this is solved, I am also not sure if the expression mat[j][i] = 1
would modify the value in the backing file. Writing those statements feels intuitive for me having an R background but might be wrong.
Any help/suggestion would be very much appreciated.
#include <stdio.h>
#include <Rcpp.h>
#include <bigmemory/MatrixAccessor.hpp>
#include <numeric>
// [[Rcpp::depends(BH, bigmemory)]]
// [[Rcpp::depends(Rcpp)]]
// [[Rcpp::export]]
void BigNA(SEXP pBigMat) {
/*
* Imputation of "NA" values for "1" in a big 0, 2 NA matrix.
*/
// Create the external bigmatrix pointer and iniciate matrix accessor
XPtr<BigMatrix> xpMat(pBigMat);
MatrixAccessor<int> mat = (*xpMat);
// Iterater over the elements in a matrix and when NA is found, substitute for "1"
for(int i=0; i< xpMat->ncol(); i++){
for(int j=0; j< xpMat->nrow(); j++){
if(std::isnan(mat[j][i])){
mat[j][i] = 1;
}
}
}
}
The problem stems from the difference between NA
in R and NAN
in C++.
MatrixAccessor<int>
gives you an accessor for values of type int
. Any number in R can be NA
, but an int
in C++ is never NAN
. An optimizing compiler could completely ignore std::isnan(x)
where x
is of type int
, as in your case.
To fix this, you could either:
MatrixAccessor<float>
(or double
). This implies actually storing a different data type.NA
elements. I think you will find it is INT_MIN
in C++ (-2147483648). Replace isnan(x)
with x == INT_MIN
.Related: Extracting a column with NA's from a bigmemory object in Rcpp