rupsetrupsetplot

Add labels to upset plot, so the values of intersection would be visible along the number of intersection


I have multiple character vectors containing gene names found in some analyzed samples (cells). I would like to compare those sets and figure out whether they have some genes in common (and if they do, I would like to know how many and which genes are overlapping between which sets). My goal is to visualize those data in some informative, reader-friendly manner. I decided to use upset plot, since the venn diagram for so many groups would be too messy.

Here is a dummy dataset containing 4 groups (although my real life data are >10):

lista <- list(cell_1 = c("Tmed9", "Lrp1", "Dlx3", "Igfbp2", "Ubl3", "Rab5b", 
"Rab1b", "Ppm1g", "Drap1", "Matr3", "Exosc2", "Fstl1", "Rpl38", 
"Sarnp", "Ppp4c", "Znhit1", "Fam20a", "Cnbp", "Rpl27", "Grb14", 
"Gng11", "Cyfip1"), cell_2 = c("Tmed9", "Clec2d", "Cd74", "Uba5", 
"Lars2", "mt-Co1", "mt-Co3", "Lax1", "Cnbp", "Insig1", "Eaf2", 
"Pcbp2", "Top1", "Gm2000", "Rexo2", "Hnrnpa0", "Cox17", "Grk2", 
"Vmp1", "1810037I17Rik", "Pim1", "Lmnb1", "Gm9844", "Eif4a3"), 
    cell_3 = c("Rab1b", "Lars2", "Rpl38", "Cnbp", "H2aj", "M6pr", 
    "B2m", "Il10ra", "H2-D1", "mt-Co3", "Cd74", "Pld4", "Ptp4a2", 
    "Pomp", "mt-Co1", "Ivns1abp", "Plac8", "Myl12b", "Rpl27", 
    "Eif3i", "Csk", "Rgs2", "mt-Cytb"), cell_4 = c("Spon1", "Phactr1", 
    "Lrp1", "Tmed9", "Cit", "Fads2", "Igfbp2", "Camk2b", "Ubqln2", 
    "Ddn", "Kifc2", "Ripor1", "Ubl3", "Rab1b", "Ppm1g", "Samd4b", 
    "Lrrc4b", "Cmtm5", "Git1", "Rph3a", "Matr3", "Pgp", "Fstl1", 
    "Nrgn"))

I want to use this nested list of gene names to create an upset plot in R.

Here is my example code:

library(UpSetR)
UpSetR::upset(fromList(lista), order.by = "freq",nintersects = NA,nsets = 4)

Which produces this figure: enter image description here

I would like to incorporate there, somehow, the actual gene labels, for instance "Tmed9" which contribute to the respective intersections, so I would not only know how many items are common between which sets, but also know which elements exactly are shared between the sets.

I did not find similar question on SO, did not find clues in the vignette nor on Github, so I am posting my question here.


Solution

  • That's an excellent idea. I do not know if this is possible using UpSetR, but if you are willing to try ComplexUpset, you can achieve the following result:

    enter image description here

    First, we need to define fromList equivalent, lets call it from_list:

    from_list <- function(list_data) {
        members = unique(unlist(list_data))
        data.frame(
            lapply(list_data, function(set) members %in% set),
            row.names=members,
            check.names=FALSE
        )
    }
    

    Then we need to add the gene names; here I will re-use the row names shortening the longest gene name so it fits nicely:

    matrix = from_list(lista)
    matrix$gene_name = rownames(matrix)
    matrix[matrix$gene_name == '1810037I17Rik', 'gene_name'] = 'RIKEN'
    head(matrix)
    
    A data.frame: 6 × 5
           cell_1 cell_2  cell_3  cell_4  gene_name
           <lgl>  <lgl>   <lgl>   <lgl>   <chr>
    Tmed9  TRUE   TRUE    FALSE   TRUE    Tmed9
    Lrp1   TRUE   FALSE   FALSE   TRUE    Lrp1
    Dlx3   TRUE   FALSE   FALSE   FALSE   Dlx3
    Igfbp2 TRUE   FALSE   FALSE   TRUE    Igfbp2
    Ubl3   TRUE   FALSE   FALSE   TRUE    Ubl3
    Rab5b  TRUE   FALSE   FALSE   FALSE   Rab5b
    

    And finally, it's time to plot! We will swap the default base_annotations with a custom layer which re-uses intersection_size, adding ggplot2' geom_text layer:

    library(ComplexUpset)
    upset(
        matrix,
        intersect=c('cell_1', 'cell_2', 'cell_3', 'cell_4'),
        base_annotations=list(
            'Intersection size'=(
                intersection_size(
                    bar_number_threshold=1,
                    color='grey9',
                    fill='grey80'
                )
                + geom_text(
                    mapping=aes(label=gene_name),
                    position=position_stack(),
                    na.rm=TRUE,
                    vjust=1.3
                )
            )
        ),
        width_ratio=0.15,
        height_ratio=1/4
    )
    

    For more complex-upset customization see examples here.

    Or without the bars in background:

    library(ComplexUpset)
    upset(
        matrix,
        intersect=c('cell_1', 'cell_2', 'cell_3', 'cell_4'),
        base_annotations=list(
            'Intersection size'=(
                ggplot()
                + aes(
                    x=intersection,
                    y=!!ComplexUpset:::get_mode_presence('distinct'),
                    label=gene_name
                )
                + geom_text(
                    position=position_stack(),
                    na.rm=TRUE,
                    vjust=1.3
                )
            )
        ),
        width_ratio=0.15,
        height_ratio=1/4
    )
    

    Second improvement: if one needs to fit an occasionally longer label, they can replace geom_text with ggfittext::geom_bar_text as follows:

                + ggfittext::geom_bar_text(
                    mapping=aes(label=gene_name),
                    min.size=0,
                    position='stack',
                    contrast=FALSE,
                    vjust=1.1,
                )
    

    enter image description here

    Disclaimer: I am the author of complex-upset.