I'm trying to write a code myself to run the hypergeometric test in R using phyper
.
I have a character vector of upregulated genes: (or these are "red" balls I pulled out from my urn)
gene.up <- c("A", "B", "C", "D")
I also have a character vector of all genes found in my experiment: (or these are all balls - "white" and "red" - that I pulled out from my urn)
gene.background <- c("A", "B", "C", "D", "E", "F")
I also have a list of characters containing pathway information: (or each of the "pathway" is a subset of balls I pulled out from my urn, and in this case, my urn has 5 white and 4 red balls)
gene.pathway.list <- list("pathwayA" = c("A", "F", "G"),
"pathwayB" = c("A", "B", "E", "H"),
"pathwayC" = c("D", "G", "I"))
Now I need to run a hypergeometric test for each pathway in my gene.pathway.list
. So I created an empty data frame to store the pathway names and p values from the hypergeometric test, and created a loop of tests like below.
df <- data.frame(pathway=character(length(gene.pathway.list)), pvalue=numeric(length(gene.pathway.list)))
for (i in c(1:length(gene.pathway.list))) {
df[i,1] <- names(gene.pathway.list[i])
df[i,2] <- phyper(sum(gene.pathway.list[[i]] == gene.up), length(gene.pathway.list[[i]]),
length(unique(unlist(gene.pathway.list))) - length(gene.pathway.list[[i]]),
length(gene.background))
}
However, the output values don't make any sense - for example, my p value for pathway C
is zero, but how can the possibility of pulling out "C"
and "D"
be zero? I'm trying to figure out what went wrong, what did I set up incorrectly?
We could use %in%
instead of ==
for (i in c(1:length(gene.pathway.list))) {
df[i,1] <- names(gene.pathway.list[i])
df[i,2] <- phyper(sum(gene.pathway.list[[i]] %in% gene.up), length(gene.pathway.list[[i]]),
length(unique(unlist(gene.pathway.list))) - length(gene.pathway.list[[i]]),
length(gene.background))
}
-output
> df
pathway pvalue
1 pathwayA 0.1071429
2 pathwayB 0.2142857
3 pathwayC 0.1071429
==
is elementwise comparison operator. The length
s of the lhs and rhs
elements were not the same, thus the shorter length recycles and creates the anomaly. Instead, use %in%