I am writing a function in R which allows for Differential Gene Expression data to be plotted while grouping different genes based on a user's interest. For example, GO terms contain genes that are functionally related however the problem lies within the fact that many genes are shared between groups. I want to warn the user of their degeneracy in gene grouping.
Consider some genes of the Integrated Stress Response (ISR_Genes), Perk response (Perk_Genes) which is a subset the ISR, and genes which are transcription factors (Transcription_Genes)
FocusedGenes is a named list that will highlight our data corresponding to their group
FocusedGenes <- list(
ISR_Genes = c("Ddit3", "Ptpn2", "Atf4", "Nfe2l2", "Eif2ak4", "Gcn1", "Eif2ak3", "Qrich1", "Bok"),
Perk_Genes = c("Ptpn2", "Atf4", "Nfe2l2", "Eif2ak3", "Qrich1", "Bok"),
Transcription_Genes = c("Ddit3", "Ptpn2", "Atf4", "Nfe2l2", "Hsf1", "Snw1", "Ighmbp2", "Mef2c")
)
All of the Perk_Genes are also ISR_Genes, and some of those genes are involved in Transcription.
DuplicateFocus <- character()
DuplicateFocus <- unlist(FocusedGenes, use.names = FALSE)[duplicated(unlist(FocusedGenes, use.names = FALSE))] %>% unique()
print(DuplicateFocus)
Creating this list of duplicated focus genes, I'd like to return a list for each element containing the names of the groups they belong to.
In order to preserve the names when checking which group they are in I did this:
names(FocusedGenes)[lapply(X = lapply(FocusedGenes,unlist),FUN = function(x) {DuplicateFocus[1] %in% x}) == TRUE]
This feels ridiculous and seems like this could probably be done much simpler.
My next thought was to utilize another layer of lapply but then I thought I would run into scope issues passing variables into nested functions defined in other functions.
lapply(DuplicateFocus, function(y) {
names(FocusedGenes)[lapply(X = lapply(FocusedGenes,unlist),FUN = function(x) {y %in% x}) == TRUE]
})
I was under the impression because the formal argument in lapply(X = var, FUN = function(x))
that the variable in the function had to call upon x, but using y seemed to work to avoid duplicating the x parameter.
To wrap it all up
DuplicateFocus <- character()
DuplicateFocus <- unlist(FocusedGenes, use.names = FALSE)[duplicated(unlist(FocusedGenes, use.names = FALSE))] %>% unique()
DuplicateFocus <- data.frame(Duplicated_Gene = DuplicateFocus)
DuplicateFocus <- DuplicateFocus %>% mutate(Groups = paste(lapply(Duplicated_Gene, function(y) {
names(FocusedGenes)[lapply(X = lapply(FocusedGenes,unlist),FUN = function(x) {y %in% x}) == TRUE]
})))
print(DuplicateFocus)
In the end this works but feels very sloppy/indirect. Is there some elegant way to do this using purrr or dplyr functions that I haven't understood?
Even if this is the best way to do things. I figured I'd post this since I couldn't find anything online to help so I hope this helps someone.
stack(FocusedGenes) |>
aggregate(ind~values, data=_, \(x) if(length(x)>1L) toString(x) else NA) |>
na.omit()
values ind 1 Atf4 ISR_Genes, Perk_Genes, Transcription_Genes 2 Bok ISR_Genes, Perk_Genes 3 Ddit3 ISR_Genes, Transcription_Genes 4 Eif2ak3 ISR_Genes, Perk_Genes 10 Nfe2l2 ISR_Genes, Perk_Genes, Transcription_Genes 11 Ptpn2 ISR_Genes, Perk_Genes, Transcription_Genes 12 Qrich1 ISR_Genes, Perk_Genes
We can change names if relevant. values
and ind
is default from stack()
.