I usually perform Principal Component Analyses with the prcomp
function, and plot the results in a fancy way with ggbiplot
(or otherwise just with ggplot2
extracting pca.obj$x
).
Like this:
#install_github("vqv/ggbiplot")
library(ggbiplot)
data(iris)
pca.obj <- prcomp(iris[,1:4], center=TRUE, scale.=TRUE)
P <- ggbiplot(pca.obj,
obs.scale = 1,
var.scale=1,
ellipse=T,
circle=F,
varname.size=3,
var.axes=T,
groups=iris$Species, #no need for coloring, I'm making the points invisible
alpha=0) #invisible points, I add them below
P$layers <- c(geom_point(aes(color=iris$Species), cex=5), P$layers) #add geom_point in a layer underneath (only way I have to change the size of the points in ggbiplot)
png(filename="test.png", height=600, width=600)
print(#or ggsave()
P
)
dev.off()
HOWEVER, now I am facing data with some amount of NAs, and I am using the pca
wrapper function from the pcaMethods package, applying the nipals
method (an iterative method capable of handling small amounts of missing values).
pca
returns an object of class pcaRes
, and ggbiplot
returns the following error:
#introduce NAs
iris$Sepal.Length[sample(1:150, 5)] <- NA
iris$Sepal.Width[sample(1:150, 5)] <- NA
iris$Petal.Length[sample(1:150, 5)] <- NA
iris$Petal.Width[sample(1:150, 5)] <- NA
#pca.obj2 <- prcomp(iris[,1:4], center=TRUE, scale.=TRUE) #cannot use prcomp with NAs
#source("https://bioconductor.org/biocLite.R")
#biocLite("pcaMethods")
library(pcaMethods)
pca.obj2 <- pca(iris[,1:4], method="nipals", nPcs=3, center=TRUE, scale.=TRUE)
class(pca.obj2)
ggbiplot(pca.obj2)
Error in ggbiplot(pca.obj2) : Expected a object of class prcomp, princomp, PCA, or lda
MY QUESTIONS are:
How can I apply ggbiplot
to a pcaRes
object?
How can I convert this object to a prcomp
object?
Can I obtain the same kind of plot with another function instead of ggbiplot
that accepts a pcaRes
object?
Should I just replace the NA values with the mean of the variable and apply the prcomp
function as usual?
Many thanks!
First of all, good job finding a PCA package that will handle NAs.
Since ggbiplot
will not accept pcaRes
objects, we can use the data obtained by the pcaRes
and sneak it into the original prcomp
object.
Obviously your real data will already contain the NA
values, so we'll start with that data set and swap them out for some dummy values to allow us to run the first prcomp
pca
.
iris_na<-iris
iris_na$Sepal.Length[sample(1:150, 5)] <- NA
iris_na$Sepal.Width[sample(1:150, 5)] <- NA
iris_na$Petal.Length[sample(1:150, 5)] <- NA
iris_na$Petal.Width[sample(1:150, 5)] <- NA
iris_dummy<-iris_na
iris_dummy[is.na(iris_dummy)]<-7777 #swap out your NAs with a dummy number so prcomp will run
We then run the first pca
as you did:
pca.obj <- prcomp(iris_dummy[,1:4], center=TRUE, scale.=TRUE)
There are 5 components of this object, x
(scores), rotation
(loadings), sdev
(standard deviation), center
and scale
. Although I suspect that only the scores and the loadings are used by ggbiplot
, we'll swap them all out just to be sure.
Looking at the scores component pca.obj$x
shows us that four principal components have been calculated in the prcomp
function.
head(pca.obj$x)
# PC1 PC2 PC3 PC4
#[1,] -2.656740 0.3176722 0.03763067 -0.04122948
#[2,] -2.688275 -0.1821744 0.19912795 0.07297624
#[3,] -2.862673 -0.1447518 -0.02134749 -0.02462359
#[4,] -2.718294 -0.3189371 -0.03318459 -0.11675762
#[5,] -2.700864 0.3274887 -0.07503096 -0.11347939
#[6,] -2.252918 0.7436711 -0.14611455 -0.08218007
So when we run the next pca with pcaRes
, we make sure to specify that 4 principal components are calculated using the nPcs
argument. Here we are using the true data, which contains the NAs
.
pca.obj2 <- pca(iris_na[,1:4], method="nipals", nPcs=4, center=TRUE, scale.=TRUE)
It is then just a matter of swapping out the pcaRes
values for the prcomp
values and passing this to ggbiplot
pca.obj$x<-pca.obj2@scores
pca.obj$rotation<-pca.obj2@loadings
pca.obj$sdev<-pca.obj2@sDev
pca.obj$center<-pca.obj2@center
pca.obj$scale<-pca.obj2@scale
P2 <- ggbiplot(pca.obj,
obs.scale = 1,
var.scale=1,
ellipse=T,
circle=F,
varname.size=3,
var.axes=T,
groups=iris$Species,
alpha=0)
P2$layers <- c(geom_point(aes(color=iris$Species), cex=5), P2$layers)