rplotpcaggbiplotbiplot

R: How to use ggbiplot with pcaRes object? plot PCA results of data with missing values


I usually perform Principal Component Analyses with the prcomp function, and plot the results in a fancy way with ggbiplot (or otherwise just with ggplot2 extracting pca.obj$x).

Like this:

#install_github("vqv/ggbiplot")
library(ggbiplot)
data(iris)
pca.obj <- prcomp(iris[,1:4], center=TRUE, scale.=TRUE)
P <- ggbiplot(pca.obj,
         obs.scale = 1, 
         var.scale=1,
         ellipse=T,
         circle=F,
         varname.size=3,
         var.axes=T,
         groups=iris$Species, #no need for coloring, I'm making the points invisible
         alpha=0) #invisible points, I add them below
P$layers <- c(geom_point(aes(color=iris$Species), cex=5), P$layers) #add geom_point in a layer underneath (only way I have to change the size of the points in ggbiplot)
png(filename="test.png", height=600, width=600)
print(#or ggsave()
    P
)
dev.off()

test1

HOWEVER, now I am facing data with some amount of NAs, and I am using the pca wrapper function from the pcaMethods package, applying the nipals method (an iterative method capable of handling small amounts of missing values).

pca returns an object of class pcaRes, and ggbiplot returns the following error:

#introduce NAs
iris$Sepal.Length[sample(1:150, 5)] <- NA
iris$Sepal.Width[sample(1:150, 5)] <- NA
iris$Petal.Length[sample(1:150, 5)] <- NA
iris$Petal.Width[sample(1:150, 5)] <- NA
#pca.obj2 <- prcomp(iris[,1:4], center=TRUE, scale.=TRUE) #cannot use prcomp with NAs
#source("https://bioconductor.org/biocLite.R")
#biocLite("pcaMethods")
library(pcaMethods)
pca.obj2 <- pca(iris[,1:4], method="nipals", nPcs=3, center=TRUE, scale.=TRUE)
class(pca.obj2)
ggbiplot(pca.obj2)

Error in ggbiplot(pca.obj2) : Expected a object of class prcomp, princomp, PCA, or lda

MY QUESTIONS are:

How can I apply ggbiplot to a pcaRes object?

How can I convert this object to a prcomp object?

Can I obtain the same kind of plot with another function instead of ggbiplot that accepts a pcaRes object?

Should I just replace the NA values with the mean of the variable and apply the prcomp function as usual?

Many thanks!


Solution

  • First of all, good job finding a PCA package that will handle NAs.

    Since ggbiplot will not accept pcaRes objects, we can use the data obtained by the pcaRes and sneak it into the original prcomp object.

    Obviously your real data will already contain the NA values, so we'll start with that data set and swap them out for some dummy values to allow us to run the first prcomp pca.

    iris_na<-iris
    
    iris_na$Sepal.Length[sample(1:150, 5)] <- NA
    iris_na$Sepal.Width[sample(1:150, 5)] <- NA
    iris_na$Petal.Length[sample(1:150, 5)] <- NA
    iris_na$Petal.Width[sample(1:150, 5)] <- NA
    
    iris_dummy<-iris_na
    
    iris_dummy[is.na(iris_dummy)]<-7777 #swap out your NAs with a dummy number so prcomp will run
    

    We then run the first pca as you did:

    pca.obj <- prcomp(iris_dummy[,1:4], center=TRUE, scale.=TRUE)
    

    There are 5 components of this object, x (scores), rotation (loadings), sdev (standard deviation), center and scale. Although I suspect that only the scores and the loadings are used by ggbiplot, we'll swap them all out just to be sure.

    Looking at the scores component pca.obj$x shows us that four principal components have been calculated in the prcomp function.

    head(pca.obj$x)

    #           PC1        PC2         PC3         PC4
    #[1,] -2.656740  0.3176722  0.03763067 -0.04122948
    #[2,] -2.688275 -0.1821744  0.19912795  0.07297624
    #[3,] -2.862673 -0.1447518 -0.02134749 -0.02462359
    #[4,] -2.718294 -0.3189371 -0.03318459 -0.11675762
    #[5,] -2.700864  0.3274887 -0.07503096 -0.11347939
    #[6,] -2.252918  0.7436711 -0.14611455 -0.08218007
    

    So when we run the next pca with pcaRes, we make sure to specify that 4 principal components are calculated using the nPcs argument. Here we are using the true data, which contains the NAs.

    pca.obj2 <- pca(iris_na[,1:4], method="nipals", nPcs=4, center=TRUE, scale.=TRUE)

    It is then just a matter of swapping out the pcaRes values for the prcomp values and passing this to ggbiplot

    pca.obj$x<-pca.obj2@scores 
    
    pca.obj$rotation<-pca.obj2@loadings 
    
    pca.obj$sdev<-pca.obj2@sDev
    
    pca.obj$center<-pca.obj2@center
    
    pca.obj$scale<-pca.obj2@scale
    
    P2 <- ggbiplot(pca.obj,
                  obs.scale = 1, 
                  var.scale=1,
                  ellipse=T,
                  circle=F,
                  varname.size=3,
                  var.axes=T,
                  groups=iris$Species, 
                  alpha=0) 
    P2$layers <- c(geom_point(aes(color=iris$Species), cex=5), P2$layers)
    

    enter image description here