I have multiple character vectors containing gene names found in some analyzed samples (cells). I would like to compare those sets and figure out whether they have some genes in common (and if they do, I would like to know how many and which genes are overlapping between which sets). My goal is to visualize those data in some informative, reader-friendly manner. I decided to use upset plot, since the venn diagram for so many groups would be too messy.
Here is a dummy dataset containing 4 groups (although my real life data are >10):
lista <- list(cell_1 = c("Tmed9", "Lrp1", "Dlx3", "Igfbp2", "Ubl3", "Rab5b",
"Rab1b", "Ppm1g", "Drap1", "Matr3", "Exosc2", "Fstl1", "Rpl38",
"Sarnp", "Ppp4c", "Znhit1", "Fam20a", "Cnbp", "Rpl27", "Grb14",
"Gng11", "Cyfip1"), cell_2 = c("Tmed9", "Clec2d", "Cd74", "Uba5",
"Lars2", "mt-Co1", "mt-Co3", "Lax1", "Cnbp", "Insig1", "Eaf2",
"Pcbp2", "Top1", "Gm2000", "Rexo2", "Hnrnpa0", "Cox17", "Grk2",
"Vmp1", "1810037I17Rik", "Pim1", "Lmnb1", "Gm9844", "Eif4a3"),
cell_3 = c("Rab1b", "Lars2", "Rpl38", "Cnbp", "H2aj", "M6pr",
"B2m", "Il10ra", "H2-D1", "mt-Co3", "Cd74", "Pld4", "Ptp4a2",
"Pomp", "mt-Co1", "Ivns1abp", "Plac8", "Myl12b", "Rpl27",
"Eif3i", "Csk", "Rgs2", "mt-Cytb"), cell_4 = c("Spon1", "Phactr1",
"Lrp1", "Tmed9", "Cit", "Fads2", "Igfbp2", "Camk2b", "Ubqln2",
"Ddn", "Kifc2", "Ripor1", "Ubl3", "Rab1b", "Ppm1g", "Samd4b",
"Lrrc4b", "Cmtm5", "Git1", "Rph3a", "Matr3", "Pgp", "Fstl1",
"Nrgn"))
I want to use this nested list of gene names to create an upset plot in R.
Here is my example code:
library(UpSetR)
UpSetR::upset(fromList(lista), order.by = "freq",nintersects = NA,nsets = 4)
I would like to incorporate there, somehow, the actual gene labels, for instance "Tmed9" which contribute to the respective intersections, so I would not only know how many items are common between which sets, but also know which elements exactly are shared between the sets.
I did not find similar question on SO, did not find clues in the vignette nor on Github, so I am posting my question here.
That's an excellent idea. I do not know if this is possible using UpSetR, but if you are willing to try ComplexUpset, you can achieve the following result:
First, we need to define fromList
equivalent, lets call it from_list
:
from_list <- function(list_data) {
members = unique(unlist(list_data))
data.frame(
lapply(list_data, function(set) members %in% set),
row.names=members,
check.names=FALSE
)
}
Then we need to add the gene names; here I will re-use the row names shortening the longest gene name so it fits nicely:
matrix = from_list(lista)
matrix$gene_name = rownames(matrix)
matrix[matrix$gene_name == '1810037I17Rik', 'gene_name'] = 'RIKEN'
head(matrix)
A data.frame: 6 × 5 cell_1 cell_2 cell_3 cell_4 gene_name <lgl> <lgl> <lgl> <lgl> <chr> Tmed9 TRUE TRUE FALSE TRUE Tmed9 Lrp1 TRUE FALSE FALSE TRUE Lrp1 Dlx3 TRUE FALSE FALSE FALSE Dlx3 Igfbp2 TRUE FALSE FALSE TRUE Igfbp2 Ubl3 TRUE FALSE FALSE TRUE Ubl3 Rab5b TRUE FALSE FALSE FALSE Rab5b
And finally, it's time to plot! We will swap the default base_annotations
with a custom layer which re-uses intersection_size
, adding ggplot2' geom_text
layer:
library(ComplexUpset)
upset(
matrix,
intersect=c('cell_1', 'cell_2', 'cell_3', 'cell_4'),
base_annotations=list(
'Intersection size'=(
intersection_size(
bar_number_threshold=1,
color='grey9',
fill='grey80'
)
+ geom_text(
mapping=aes(label=gene_name),
position=position_stack(),
na.rm=TRUE,
vjust=1.3
)
)
),
width_ratio=0.15,
height_ratio=1/4
)
For more complex-upset customization see examples here.
Or without the bars in background:
library(ComplexUpset)
upset(
matrix,
intersect=c('cell_1', 'cell_2', 'cell_3', 'cell_4'),
base_annotations=list(
'Intersection size'=(
ggplot()
+ aes(
x=intersection,
y=!!ComplexUpset:::get_mode_presence('distinct'),
label=gene_name
)
+ geom_text(
position=position_stack(),
na.rm=TRUE,
vjust=1.3
)
)
),
width_ratio=0.15,
height_ratio=1/4
)
Second improvement: if one needs to fit an occasionally longer label, they can replace geom_text
with ggfittext::geom_bar_text
as follows:
+ ggfittext::geom_bar_text(
mapping=aes(label=gene_name),
min.size=0,
position='stack',
contrast=FALSE,
vjust=1.1,
)
Disclaimer: I am the author of complex-upset.