I'm using limma::removeBatchEffect() to correct batch effects in RNA-seq data after variance-stabilizing transformation. However, my PCA still shows strong clustering by batch.
Here’s a minimal example of the code I'm using:
library(DESeq2)
library(limma)
# Example VST matrix from DESeq2
vsd_mat <- assay(vsd) # genes × samples matrix
# Metadata with batch and biological group (e.g. cancer subtype)
design_mat <- model.matrix(~ category, data = metadata)
# Attempt batch correction
corrected <- removeBatchEffect(vsd_mat,
batch = metadata$batch,
design = design_mat)
After correction, I run PCA:
pca <- prcomp(t(corrected))
plot(pca$x[,1:2], col = metadata$batch) # Still clusters by batch
Question:
Why is removeBatchEffect() not eliminating batch structure in PCA, and how can I improve this?
By default, removeBatchEffect only subtracts batch means. If your batches have different variances, you may also want to scale:
corrected2 <- removeBatchEffect(vsd_mat,
batch = metadata$batch,
design = design_mat,
scale = TRUE)