I am dealing with extensive proteomic data between two independent groups (i.e., E1 and E2) with replicates. My data is structured in the following:
> shark_long_AN_E1_E2
# A tibble: 11,154 × 9
MatchGroup_AUTO genename `Unique peptides` name value Extraction Tissue replicate imputation
<fct> <chr> <fct> <chr> <dbl> <fct> <fct> <fct> <chr>
1 XM_038815272.1.p1 UNANNOTATED 8 E1_AN_1_Imp 1.10 E1 AN 1 E1_AN_1_Imp
2 XM_038815272.1.p1 UNANNOTATED 8 E1_AN_2_Imp 1.96 E1 AN 2 E1_AN_2_Imp
3 XM_038815272.1.p1 UNANNOTATED 8 E1_AN_3_Imp 0.994 E1 AN 3 E1_AN_3_Imp
4 XM_038815272.1.p1 UNANNOTATED 8 E2_AN_1_Imp1 7.05 E2 AN 1 Imp1
5 XM_038815272.1.p1 UNANNOTATED 8 E2_AN_2_Imp1 7.45 E2 AN 2 Imp1
6 XM_038815272.1.p1 UNANNOTATED 8 E2_AN_3_Imp1 8.07 E2 AN 3 Imp1
I need to do a t-test between E1 and E2 (Extraction) for each MatchGroup_AUTO separately (multiple hypotheses). Because I have thousands of levels in MatchGroup_AUTO I would like to correct for the FDR by performing a permutation. However, most of the information I have found is not considering multiple hypotheses, and I am struggling to find a solution.
What I have done so far is the following:
I have calculated the difference of the mean between E1 and E2 for each MatchGroup_AUTO from the non-permuted data
mean_diff <- shark_filtered %>%
group_by(MatchGroup_AUTO) %>%
summarise(mean_diff = mean(value[Extraction == 'E1']) - mean(value[Extraction == 'E2']))
Then I permutate the values:
set.seed(1979)
n <- nrow(shark_filtered)
P <- 100
variable <- shark_filtered$value
PermSamples <- matrix(0, nrow = n, ncol = P)
for (i in 1:P){
PermSamples[, i] <- sample(variable, size = n, replace = FALSE)
}
After this step I get lost, because each value permutated I need to be related to a level in MatchGroup_AUTO, in order to identify the mean differences per each level on MatchGroup_AUTO.
I would appreciate any advice on this.
Thanks!
First, calculate t.tests
over all match groups; you can use vapply
.
> match_groups <- unique(d$MatchGroup_AUTO)
>
> t_obs <- t(vapply(match_groups |> setNames(match_groups), \(x)
+ unlist(t.test(value ~ Extraction, d[d$MatchGroup_AUTO == x, ])[
+ c('estimate', 'statistic', 'p.value')
+ ]),
+ FUN.VALUE=numeric(4))) |>
+ as.data.frame() |>
+ transform(dif=`estimate.mean in group E1` - `estimate.mean in group E2`) |>
+ subset(select=c(1, 2, 5, 3, 4))
Next, you do essentially the same, with the difference, that you permute the group label by doing sample(Extraction)
(the default replace=FALSE
is used) in the formula and replicate
P
times. Note, that your permutation universe is limited which we can check with choose()
beforehand.
> P <- 299
> k <- length(unique(d$Extraction))
> stopifnot(choose(median(table(d$MatchGroup_AUTO)), ## for safety
+ median(table(d$MatchGroup_AUTO))/k) > P)
> t_star <- vapply(match_groups |> setNames(match_groups), \(x)
+ replicate(P, unname(t.test(value ~ sample(Extraction),
+ d[d$MatchGroup_AUTO == x, ])$statistic)),
+ FUN.VALUE=numeric(P))
Finally, to get the so-called Monte Carlo P value, according to
p_mc = (Σ (I(|t*_i| ≥ |t_obs|)) + 1) / (P + 1)
we do:
> p_mc <- (colSums(t(t(abs(t_star)) >= abs(t_obs[, 'statistic.t']))) + 1)/(P + 1)
Multiple t tests with naïve p values, MC p values, and as bonus BH-corrected ones:
> cbind(t_obs, p_mc, p_bh=p.adjust(t_obs[, 'p.value'], 'BH'))
estimate.mean in group E1 estimate.mean in group E2 dif statistic.t p.value p_mc p_bh
1 4.597076 4.597901 -0.0008253037 -0.0009231967 0.99926953 1.00000000 0.9992695
2 4.643047 6.321108 -1.6780613879 -1.8938434438 0.06823153 0.04666667 0.4747697
3 4.540582 5.031221 -0.4906389452 -0.5561249747 0.58232250 0.56000000 0.8873486
4 5.917108 4.755134 1.1619739948 1.2391911517 0.22495015 0.23666667 0.5141718
5 4.713262 5.386373 -0.6731112674 -0.7835043642 0.43986433 0.42000000 0.8279799
6 5.379462 4.391234 0.9882280469 1.0270542115 0.31320792 0.35666667 0.6681769
7 5.932996 6.479547 -0.5465517359 -0.6270228226 0.53545149 0.55000000 0.8816326
8 4.811961 4.727288 0.0846727284 0.0817098129 0.93542079 0.93000000 0.9992695
9 6.662828 4.894009 1.7688190528 1.8998771278 0.06765757 0.08666667 0.4747697
10 6.321617 4.503693 1.8179238454 1.8503059797 0.07418277 0.10000000 0.4747697
11 6.630730 5.102590 1.5281404531 1.4996137429 0.14428645 0.18000000 0.5102062
12 6.051498 6.492977 -0.4414792845 -0.4464648353 0.65846774 0.67666667 0.9030372
13 5.657552 5.671949 -0.0143969188 -0.0162466510 0.98715067 0.99666667 0.9992695
14 5.059838 5.427452 -0.3676143223 -0.4192296000 0.67805173 0.65333333 0.9030372
15 4.659772 5.417235 -0.7574631747 -0.8456147296 0.40448328 0.37666667 0.8089666
16 5.140637 5.445461 -0.3048237813 -0.3145401554 0.75535823 0.75333333 0.9296717
17 5.296246 6.581564 -1.2853180949 -1.3553979275 0.18540987 0.17000000 0.5102062
18 4.614938 5.165485 -0.5505467871 -0.6031703703 0.55102037 0.57000000 0.8816326
19 5.881448 4.002544 1.8789041787 2.1722002594 0.03810542 0.05666667 0.4747697
20 5.289458 4.232691 1.0567672827 1.3097270311 0.20036480 0.22666667 0.5102062
21 5.924970 5.339297 0.5856728367 0.6646850732 0.51142915 0.56666667 0.8816326
22 4.425766 4.419620 0.0061465230 0.0066061789 0.99477367 0.99666667 0.9992695
23 5.665854 4.161895 1.5039585807 1.6702859383 0.10526746 0.14333333 0.5102062
24 4.517780 5.883866 -1.3660867324 -1.3156712574 0.19834572 0.20000000 0.5102062
25 4.698634 5.051978 -0.3533435981 -0.3815466113 0.70549778 0.70333333 0.9030372
26 5.080060 6.424642 -1.3445819647 -1.2889702875 0.20727129 0.24000000 0.5102062
27 6.135660 5.016718 1.1189425180 1.3489422833 0.18751982 0.18666667 0.5102062
28 5.642470 5.226990 0.4154799856 0.3985038805 0.69311626 0.61666667 0.9030372
29 5.797826 5.887192 -0.0893660131 -0.1007856340 0.92039513 0.89666667 0.9992695
30 6.521022 4.888472 1.6325504372 1.6386794699 0.11224926 0.12000000 0.5102062
31 5.132761 5.020190 0.1125713450 0.1157839337 0.90861092 0.91666667 0.9992695
32 4.954973 7.008125 -2.0531525050 -2.3425604209 0.02676272 0.04666667 0.4747697
Notes:
For genomic data sets even vapply
might run forever. Consider to use multi-threading; on unix-alikes: parallel::mclapply
, on Windows parallel::parLapply
instead.
I am not sure, though, if you really can address multiple hypotheses testing using this sort of permutation, but this might be the technique what you're looking for. You might address stats questions on Cross Validated.
Data:
set.seed(42)
mg <- 32; ex <- 2; r <- 16
d <- expand.grid(MatchGroup_AUTO=1:mg, Extraction=paste0('E', 1:ex), rep=1:r) |>
transform(value=runif(mg*ex*r, .9, 10))