I'm working on an RNA-seq analysis and I'm interested in what genes are driving tissue-specific variation in gene expression. PCAsare common in RNA-seq analyses, but most packages (e.g. DESeq2) only take it as far as the 2D plot. Therefore, I've used prcomp and fviz_pca_ind to produce my 2D plot so I can take the analysis a bit further after. However, due to the input data structure, I can't seem to get my grouping information (i.e. tissues) to be included in my prcomp object and as a result can't color-code my samples or produce confidence intervals around my groups.
Packages:
library(ggplot2)
library(factoextra)
library(Deseq2)
I start with a variance-stabilized transformation of a DESeq2 object (apologies for how long this is...):
Even subsetted, the data set was too big - here's a link (hopefully this is acceptable) https://drive.google.com/file/d/1Gtw5GUCAyBVr3MI6CpgsF4n81KZuq8WI/view?usp=sharing
And my PCA code (this is essentially just the function within DESeq2)
####################################################################################################
# Principle component analysis - PRComp
####################################################################################################
# set number of genes to include in PCA
ntop = 500
# calculate the variance for each gene
rv <- rowVars(assay(vst))
# select the ntop genes by variance
select <- order(rv, decreasing=TRUE)[seq_len(min(ntop, length(rv)))]
# perform a PCA on the data in assay(x) for the selected genes
tissue_pca <- prcomp(t(assay(vst)[select,]))
And my visualization code:
# Visualize samples on PC1 and PC2
fviz_pca_ind(tissue_pca,
# col.var = ,
palette = cbPalette,
ellipse.type = "confidence",
repel = TRUE,
mean = FALSE)
Produces this:
As you'll see in the prcomp object, there is no grouping information. Any idea on how I could either 1) include this information within the prcomp object or 2) incorporate this information in the graphics code?fa
Your dput seemed a bit big, but here is a reproducible example - just add habillage
and colors and you are good to go.
suppressPackageStartupMessages(invisible(
lapply(c("DESeq2", "factoextra"),
require, character.only = TRUE)))
set.seed(123)
dds <- makeExampleDESeqDataSet(n=5e3, betaSD=.8)
dds$group <- gl(4,3, labels = paste0("Group_", LETTERS[1:4]))
counts(dds)[, 1:3] <- as.integer(counts(dds)[, 1:3] *
stats::runif(1.5e4, 0.8, 1.2))
vst <- vst(dds)
ntop = 500
rv <- rowVars(assay(vst))
select <- order(rv, decreasing=TRUE)[seq_len(min(ntop, length(rv)))]
tissue_pca <- prcomp(t(assay(vst)[select,]))
fviz_pca_ind(tissue_pca,
habillage = vst$group,
palette = c("blue", "red", "cyan", "magenta"),
ellipse.type = "confidence",
repel = TRUE,
mean = FALSE)
Created on 2020-08-05 by the reprex package (v0.3.0)