I'm confused on how to handle missing values and transformations with the Hmisc R package

I'm completely lost attempting to use the Hmisc package for my analysis. I have a high dimensional lipidomics data with several missing values. I usually remove lipids with over 20% missing values and use KNN imputation. However, I'm trying to get into Frank Harrell's methodology by using sparse PCA for data reduction becuase I think it's the solution to many of the problems I keep facing.

The problem is that I understand I have to handle the missing values and data transformation before I do the PCA, but I just don't understand how.

I tried to use the transcan() function like this:

set.seed(123)

n_rows <- 30
n_cols <- 100

data_matrix <- matrix(rnorm(n_rows * n_cols, mean = 0, sd = 1), nrow = n_rows, ncol = n_cols)

df <- as.data.frame(data_matrix)
colnames(df) <- paste0("lipid", 1:n_cols)

df[1:5,1] <- NA
df[6,2:5] <- NA
df[c(10,50,100),c(3,6,7)] <- NA


result <- transcan(df, transformed = TRUE,imputed = TRUE)
imputed_data <- impute(result, data = mydata)
transformed_data <- predict(result, type = "transformed")

But I get an error saying that:

Convergence criterion:Warning in f$xcoef[, 1] * f$xcenter :
  longer object length is not a multiple of shorter object length
Error in cbind(1, xx[j, , drop = FALSE]) %*% xcof : 
  requires numeric/complex matrix/vector arguments

What am I doing wrong?

Solution

It looks like there are a few different possible problems.

The first two are minor details:

Your data has 30 simulated rows, but the line df[c(10,50,100),c(3,6,7)] <- NA expands it to 100 rows. This means rows 31-100 are all-NA. I fixed that by changing that line to df[c(10,15,20),c(3,6,7)] <- NA.
df is a data.frame, but transcan() expects a matrix as input. I'm not sure if this makes any difference or not, but I changed df to a matrix to rule it out as a cause of your problems.

The third problem is more involved, and is ultimately a statistical issue rather than a code one. It is what causes the warning Warning in f$xcoef[, 1] * f$xcenter : longer object length is not a multiple of shorter object length and then later the error Error in cbind(1, xx[j, , drop = FALSE]) %*% xcof : non-conformable arguments. There is a step in the process of what transcan() does where it needs to estimate coefficients for 99 lipids as part of filling in the missing values in the 100th, and it can't do that because with < 100 observations you don't have enough data to make all of the estimates identifiable.

To follow along with what's happening for the third error, you can look at the code for the transcan() function on GitHub, or you can use debug(transcan) in R so that you can step through the code of that function interactively and watch what it's doing the next time you call it.

The line that causes the warning is line 349 in GitHub:

xcof <- c(intercept=-sum(f$xcoef[,1] * f$xcenter),
                  f$xcoef[,1])*varconst

This happens because f$xcoef is 24x24 matrix, and f$xcenter is a length-99 vector.

To understand how those objects wound up in that state, you can look at the loop they're in, which starts on line 268 in GitHub:

for(i in 1:p) {
  lab <- nam[i]
  catg <- lab %in% categorical
  xx <- xt[,-i,drop=FALSE]

Your error is happening on the very first iteration of the loop, so i=1.

Here p is your number number of lipids, and xt is a 30x100 matrix that's a processed version of your input data (30x100 after issue #1 above is fixed). The last line quoted above creates a matrix xx that's all of your lipids except the one that the loop is currently processing (eg. lipids that will be used to impute missing values for the lipid that's currently being processed).

I'll jump ahead a bit, but first I'll mention on line 295 it creates j as j <- is.na(x[,i]), eg. j is a T/F vector marking where the lipid currently being processed has missing values in the original data.

Here is the line where your problems start, line 374 by GitHub's numbering:

f <- cancor(xx[!j,,drop=FALSE], R[[i]][!j,,drop=FALSE])

This is the line that creates f, the list mentioned in the warning about dimensions of xcoef and xcenter not matching.

xx[!j,,drop=FALSE] is a 25x99 matrix consisting of "values for lipids 2-99, on the 25 rows where lipid1 does not have missing data".
R[[i]][!j,,drop=FALSE] is a 25x3 matrix created on line 210 with R[[i]] <- rcspline.eval(y, nk=nk, inclx=TRUE) (y was x[,i], eg. the values for the current lipid, and nk was 4, chosen based on your sample size)

cancor() is a function from the stats package that does canonical correlation for two data matrices. It refers to its two input matrices as x and y and returns a list with results. According to the documentation, xcoef is "estimated coefficients for the x variables" and xcenter is "the values used to adjust the x variables".

So when you pass cancor() a 25x99 matrix as x, the results have 99 values for xcenter because there are 99 columns/variables, but xcoef is only 24x24 because that's the maximum number of variables for which estimates were identifiable with only 25 rows of data.

So ultimately, problem #3 is "your input data has more variables than observations".

But I gave you this whole long explanation because I know variables > observations is common with certain types of -omics data, and I don't think that the fact that you have more variables than observations means you have to abandon this approach. Hopefully my explanation (that ultimately this all comes down to an issue of under-identification when doing canonical correlation) will help you find your way to a solution.

It may come down to something as simple as "For each lipid, use only the most relevant other lipids to predict the missing values, and set the coefficients for the remaining lipids to 0", although you would need to choose the "most relevant lipids" yourself, cancor() is just using the first 24. Or it may require something more involved to avoid creating patterns in your data related to which lipids were/weren't used to impute other lipids. The for loop that's looping over your columns and doing all of this is inside of a second for loop, for(iter in 1:iter.max) (line 265), so you might even be able to vary which lipids are used to predict the missing values across iterations of that outer loop. Or you might be able to change your approach for imputing missing values and use another method that won't have this problem.

Good luck!!