I am currently studying principal component analysis and playing around with the R prcomp
function. My code is as follows:
library(dplyr)
iris1 = mutate( iris,
Species = factor( Species),
logSepalLength = log10( Sepal.Length ),
logSepalWidth = log10( Sepal.Width ),
logPetalLength = log10( Petal.Length ),
logPetalWidth = log10( Petal.Width ),
) %>%
dplyr::select(Species, starts_with("log") )
iris1.PCA = prcomp( ~ logSepalLength +
logSepalLength +
logSepalWidth +
logPetalLength +
logPetalWidth,
data = iris1, scale. = FALSE )
summary(iris1.PCA)
The output of summary(iris1.PCA)
is as follows:
Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 0.4979 0.06009 0.05874 0.02337
Proportion of Variance 0.9702 0.01413 0.01350 0.00214
Cumulative Proportion 0.9702 0.98436 0.99786 1.00000
I want to use ggplot to generate a nice scree plot that shows the the cumulative contribution to total variance for each principal component. I can do this calculation manually, starting from the covariance matrix, using something like cumsum(eigenvals)/iris1.cov.trace
. However, according to summary(iris1.PCA)
, the prcomp
output already calculates the cumulative proportion for us! So how do we utilise that part of the summary(iris1.PCA)
object and ggplot
to generate a nice scree plot? I know we can manually copy the output values, but I'm looking for a more automated solution (since hard-copying values is not good software engineering practice).
I found this example of a scree plot using ggplot
(although, it does not use cumulative contribution to total variance):
var_explained_df %>%
ggplot(aes(x=PC,y=var_explained, group=1))+
geom_point(size=4)+
geom_line()+
labs(title="Scree plot: PCA on scaled data")
Here's an example using the output from the PCA. The sdev
element from the summary is the standard deviation explained. The variance explained is the squared standard deviation (i.e., the variance) divided by the sum of all of the squared standard deviations.
s <- summary(iris1.PCA)
dat <- data.frame(
component = factor(1:length(s$sdev), labels=paste0("PC", 1:length(s$sdev))),
var_explained = s$sdev^2/sum(s$sdev^2)
)
library(scales)
ggplot(dat, aes(y=var_explained)) +
geom_line(aes(x=component, group=1)) +
geom_point(aes(x=component)) +
labs(x="Component", y="% Variance Explained") +
scale_y_continuous(labels=percent) +
theme_bw() +
ggtitle("Scree plot: PCA on Scaled Data")