rlistlapplynamed

How do I cross-reference listA with a named listB and extract the names of listB for which members of listA are within?


I am writing a function in R which allows for Differential Gene Expression data to be plotted while grouping different genes based on a user's interest. For example, GO terms contain genes that are functionally related however the problem lies within the fact that many genes are shared between groups. I want to warn the user of their degeneracy in gene grouping.

Consider some genes of the Integrated Stress Response (ISR_Genes), Perk response (Perk_Genes) which is a subset the ISR, and genes which are transcription factors (Transcription_Genes)

FocusedGenes is a named list that will highlight our data corresponding to their group

FocusedGenes <- list(
  ISR_Genes = c("Ddit3", "Ptpn2", "Atf4", "Nfe2l2", "Eif2ak4", "Gcn1", "Eif2ak3", "Qrich1", "Bok"),
  Perk_Genes = c("Ptpn2", "Atf4", "Nfe2l2", "Eif2ak3", "Qrich1", "Bok"),
  Transcription_Genes = c("Ddit3", "Ptpn2", "Atf4", "Nfe2l2", "Hsf1", "Snw1", "Ighmbp2", "Mef2c")
)

All of the Perk_Genes are also ISR_Genes, and some of those genes are involved in Transcription.

DuplicateFocus <- character()
DuplicateFocus <- unlist(FocusedGenes, use.names = FALSE)[duplicated(unlist(FocusedGenes, use.names = FALSE))] %>% unique()
print(DuplicateFocus)

Creating this list of duplicated focus genes, I'd like to return a list for each element containing the names of the groups they belong to.

In order to preserve the names when checking which group they are in I did this:

names(FocusedGenes)[lapply(X = lapply(FocusedGenes,unlist),FUN = function(x) {DuplicateFocus[1] %in% x}) == TRUE]

This feels ridiculous and seems like this could probably be done much simpler.

My next thought was to utilize another layer of lapply but then I thought I would run into scope issues passing variables into nested functions defined in other functions.

lapply(DuplicateFocus, function(y) {
  names(FocusedGenes)[lapply(X = lapply(FocusedGenes,unlist),FUN = function(x) {y %in% x}) == TRUE]
})

I was under the impression because the formal argument in lapply(X = var, FUN = function(x)) that the variable in the function had to call upon x, but using y seemed to work to avoid duplicating the x parameter.

To wrap it all up

DuplicateFocus <- character()
DuplicateFocus <- unlist(FocusedGenes, use.names = FALSE)[duplicated(unlist(FocusedGenes, use.names = FALSE))] %>% unique()
DuplicateFocus <- data.frame(Duplicated_Gene = DuplicateFocus)
DuplicateFocus <- DuplicateFocus %>% mutate(Groups = paste(lapply(Duplicated_Gene, function(y) {
  names(FocusedGenes)[lapply(X = lapply(FocusedGenes,unlist),FUN = function(x) {y %in% x}) == TRUE]
})))
print(DuplicateFocus)

In the end this works but feels very sloppy/indirect. Is there some elegant way to do this using purrr or dplyr functions that I haven't understood?

Even if this is the best way to do things. I figured I'd post this since I couldn't find anything online to help so I hope this helps someone.

Output:


Solution

  • stack(FocusedGenes) |>
      aggregate(ind~values, data=_, \(x) if(length(x)>1L) toString(x) else NA) |>
      na.omit()
    
       values                                        ind
    1     Atf4 ISR_Genes, Perk_Genes, Transcription_Genes
    2      Bok                      ISR_Genes, Perk_Genes
    3    Ddit3             ISR_Genes, Transcription_Genes
    4  Eif2ak3                      ISR_Genes, Perk_Genes
    10  Nfe2l2 ISR_Genes, Perk_Genes, Transcription_Genes
    11   Ptpn2 ISR_Genes, Perk_Genes, Transcription_Genes
    12  Qrich1                      ISR_Genes, Perk_Genes
    

    We can change names if relevant. values and ind is default from stack().