rpcaprcomp

Confidence intervals in PCA by group with prcomp


I'm working on an RNA-seq analysis and I'm interested in what genes are driving tissue-specific variation in gene expression. PCAsare common in RNA-seq analyses, but most packages (e.g. DESeq2) only take it as far as the 2D plot. Therefore, I've used prcomp and fviz_pca_ind to produce my 2D plot so I can take the analysis a bit further after. However, due to the input data structure, I can't seem to get my grouping information (i.e. tissues) to be included in my prcomp object and as a result can't color-code my samples or produce confidence intervals around my groups.

Packages:

library(ggplot2)
library(factoextra)
library(Deseq2)

I start with a variance-stabilized transformation of a DESeq2 object (apologies for how long this is...):

Even subsetted, the data set was too big - here's a link (hopefully this is acceptable) https://drive.google.com/file/d/1Gtw5GUCAyBVr3MI6CpgsF4n81KZuq8WI/view?usp=sharing

And my PCA code (this is essentially just the function within DESeq2)

####################################################################################################
# Principle component analysis - PRComp
####################################################################################################

# set number of genes to include in PCA
ntop = 500

# calculate the variance for each gene
rv <- rowVars(assay(vst))

# select the ntop genes by variance
select <- order(rv, decreasing=TRUE)[seq_len(min(ntop, length(rv)))]

# perform a PCA on the data in assay(x) for the selected genes
tissue_pca <- prcomp(t(assay(vst)[select,]))

And my visualization code:

# Visualize samples on PC1 and PC2
fviz_pca_ind(tissue_pca,
            # col.var = ,
             palette = cbPalette,
             ellipse.type = "confidence",
             repel = TRUE,
             mean = FALSE) 

Produces this:

enter image description here

As you'll see in the prcomp object, there is no grouping information. Any idea on how I could either 1) include this information within the prcomp object or 2) incorporate this information in the graphics code?fa


Solution

  • Your dput seemed a bit big, but here is a reproducible example - just add habillage and colors and you are good to go.

    suppressPackageStartupMessages(invisible(
      lapply(c("DESeq2", "factoextra"),
             require, character.only = TRUE)))
    set.seed(123)
    dds <- makeExampleDESeqDataSet(n=5e3, betaSD=.8)
    dds$group <- gl(4,3, labels = paste0("Group_", LETTERS[1:4]))
    counts(dds)[, 1:3] <- as.integer(counts(dds)[, 1:3] *
      stats::runif(1.5e4, 0.8, 1.2))
    vst <- vst(dds)
    ntop = 500
    rv <- rowVars(assay(vst))
    select <- order(rv, decreasing=TRUE)[seq_len(min(ntop, length(rv)))]
    tissue_pca <- prcomp(t(assay(vst)[select,]))
    
    fviz_pca_ind(tissue_pca, 
      habillage = vst$group,
      palette = c("blue", "red", "cyan", "magenta"),
      ellipse.type = "confidence",
      repel = TRUE,
      mean = FALSE)
    

    Created on 2020-08-05 by the reprex package (v0.3.0)