I know that there are lots of discussions out there that treat this subject matter... but every time I encounter this, I never find a consistent, satisfying answer.
I'm trying to create a very basic graphical depiction of a principal components analysis model. I always aim to not use packages that automatically generate plots for me because I want the control.
Every time I try to make a PCA plot with loadings, I am stumped by how the canned functions relate their site-specific scores with the model's loading vectors. This is despite the myriad nodes out there treating this matter, most of which just use the canned functions without explaining how the numbers got from a basic PCA model to the biplot (they just use the canned function).
For the example code below, I'll use autoplot. If I make a PCA model and use autoplot, I get a very cool graph. But I want to know how they get these numbers- the scores get rescaled, and I have no idea how the vectors are relativized the way they are on the plot. Can anyone walk me through how I would get these numbers relativized data in dataframes of my own (both scores and vectors) so I can make the aesthetic changes I want without using autoplot??
d <- iris
m1 <- prcomp(d[,1:4], scale=T)
scores <- data.frame(m1$x[,1:2])
library(ggplot2)
#Scores range from about -2.5 to +3
ggplot(scores, aes(x=PC1, y=PC2))+
geom_point()
#Scores range from about -0.15 to 0.22, no clue where the relativized loadings come from
autoplot(m1, loadings = T)
I'll attempt to walk you through and simplify the steps that autoplot
uses to draw a PCA plot, so you can do this yourself quite easily in ggplot.
autoplot
is actually an S3 generic function, so it's more accurate to talk about the method ggfortify:::autoplot.prcomp
uses, since this is the function that is dispatched when you call autoplot
on a prcomp
object.
Let's start with your own example:
library(ggfortify)
library(ggplot2)
d <- iris
m1 <- prcomp(d[, 1:4], scale = TRUE)
scores <- data.frame(m1$x[, 1:2])
The scores are normalized by dividing each column by its own root mean squared error
scores[] <- lapply(scores, function(x) x / sqrt(sum((x - mean(x))^2)))
The loadings are simply obtained from the rotation
member of the prcomp
object:
loadings <- as.data.frame(m1$rotation)[1:2]
There is some internal scaling to ensure that the loadings appear on the same scale as the adjusted PC scores, but as far as I can tell this is simply for visual effect. The scaling amounts to about 0.2 here, and is calculated as follows:
scale <- min(max(abs(scores$PC1))/max(abs(loadings$PC1)),
max(abs(scores$PC2))/max(abs(loadings$PC2))) * 0.8
scale
#> [1] 0.1987812
We now have enough to recreate the autoplot using vanilla ggplot code.
ggplot(scores, aes(x = PC1, y = PC2))+
geom_point() +
geom_segment(data = loadings * scale,
aes(x = 0, y = 0, xend = PC1, yend = PC2),
color = "red", arrow = arrow(angle = 25, length = unit(4, "mm")))
Aside from the axis titles, this is identical to the autoplot:
autoplot(m1, loadings = TRUE)