rpermutationt-test

Permutation-based t test for multiple comparison problem


I am dealing with extensive proteomic data between two independent groups (i.e., E1 and E2) with replicates. My data is structured in the following:

> shark_long_AN_E1_E2
# A tibble: 11,154 × 9
   MatchGroup_AUTO   genename    `Unique peptides` name          value Extraction Tissue replicate imputation 
   <fct>             <chr>       <fct>             <chr>         <dbl> <fct>      <fct>  <fct>     <chr>      
 1 XM_038815272.1.p1 UNANNOTATED 8                 E1_AN_1_Imp   1.10  E1         AN     1         E1_AN_1_Imp
 2 XM_038815272.1.p1 UNANNOTATED 8                 E1_AN_2_Imp   1.96  E1         AN     2         E1_AN_2_Imp
 3 XM_038815272.1.p1 UNANNOTATED 8                 E1_AN_3_Imp   0.994 E1         AN     3         E1_AN_3_Imp
 4 XM_038815272.1.p1 UNANNOTATED 8                 E2_AN_1_Imp1  7.05  E2         AN     1         Imp1       
 5 XM_038815272.1.p1 UNANNOTATED 8                 E2_AN_2_Imp1  7.45  E2         AN     2         Imp1       
 6 XM_038815272.1.p1 UNANNOTATED 8                 E2_AN_3_Imp1  8.07  E2         AN     3         Imp1       

I need to do a t-test between E1 and E2 (Extraction) for each MatchGroup_AUTO separately (multiple hypotheses). Because I have thousands of levels in MatchGroup_AUTO I would like to correct for the FDR by performing a permutation. However, most of the information I have found is not considering multiple hypotheses, and I am struggling to find a solution.

What I have done so far is the following:

I have calculated the difference of the mean between E1 and E2 for each MatchGroup_AUTO from the non-permuted data

mean_diff <- shark_filtered %>%
  group_by(MatchGroup_AUTO) %>%
  summarise(mean_diff = mean(value[Extraction == 'E1']) - mean(value[Extraction == 'E2']))

Then I permutate the values:

set.seed(1979)  

n <- nrow(shark_filtered)  

P <- 100 

variable <- shark_filtered$value  

PermSamples <- matrix(0, nrow = n, ncol = P)

for (i in 1:P){
  PermSamples[, i] <- sample(variable, size = n, replace = FALSE)
}


After this step I get lost, because each value permutated I need to be related to a level in MatchGroup_AUTO, in order to identify the mean differences per each level on MatchGroup_AUTO.

I would appreciate any advice on this.

Thanks!


Solution

  • First, calculate t.tests over all match groups; you can use vapply.

    > match_groups <- unique(d$MatchGroup_AUTO)
    > 
    > t_obs <- t(vapply(match_groups |> setNames(match_groups), \(x) 
    +                   unlist(t.test(value ~ Extraction, d[d$MatchGroup_AUTO == x, ])[
    +                     c('estimate', 'statistic', 'p.value')
    +                   ]),
    +                   FUN.VALUE=numeric(4))) |> 
    +   as.data.frame() |> 
    +   transform(dif=`estimate.mean in group E1` - `estimate.mean in group E2`) |> 
    +   subset(select=c(1, 2, 5, 3, 4))
    

    Next, you do essentially the same, with the difference, that you permute the group label by doing sample(Extraction) (the default replace=FALSE is used) in the formula and replicate P times. Note, that your permutation universe is limited which we can check with choose() beforehand.

    > P <- 299
    > k <- length(unique(d$Extraction))
    > stopifnot(choose(median(table(d$MatchGroup_AUTO)),  ## for safety
    +                  median(table(d$MatchGroup_AUTO))/k) > P)
    
    > t_star <- vapply(match_groups |> setNames(match_groups), \(x) 
    +               replicate(P, unname(t.test(value ~ sample(Extraction), 
    +                                          d[d$MatchGroup_AUTO == x, ])$statistic)),
    +               FUN.VALUE=numeric(P))
    

    Finally, to get the so-called Monte Carlo P value, according to

    p_mc = (Σ (I(|t*_i| ≥ |t_obs|)) + 1) / (P + 1)

    we do:

    > p_mc <- (colSums(t(t(abs(t_star)) >= abs(t_obs[, 'statistic.t']))) + 1)/(P + 1)
    

    Gives

    Multiple t tests with naïve p values, MC p values, and as bonus BH-corrected ones:

    > cbind(t_obs, p_mc, p_bh=p.adjust(t_obs[, 'p.value'], 'BH'))
       estimate.mean in group E1 estimate.mean in group E2           dif   statistic.t    p.value       p_mc      p_bh
    1                   4.597076                  4.597901 -0.0008253037 -0.0009231967 0.99926953 1.00000000 0.9992695
    2                   4.643047                  6.321108 -1.6780613879 -1.8938434438 0.06823153 0.04666667 0.4747697
    3                   4.540582                  5.031221 -0.4906389452 -0.5561249747 0.58232250 0.56000000 0.8873486
    4                   5.917108                  4.755134  1.1619739948  1.2391911517 0.22495015 0.23666667 0.5141718
    5                   4.713262                  5.386373 -0.6731112674 -0.7835043642 0.43986433 0.42000000 0.8279799
    6                   5.379462                  4.391234  0.9882280469  1.0270542115 0.31320792 0.35666667 0.6681769
    7                   5.932996                  6.479547 -0.5465517359 -0.6270228226 0.53545149 0.55000000 0.8816326
    8                   4.811961                  4.727288  0.0846727284  0.0817098129 0.93542079 0.93000000 0.9992695
    9                   6.662828                  4.894009  1.7688190528  1.8998771278 0.06765757 0.08666667 0.4747697
    10                  6.321617                  4.503693  1.8179238454  1.8503059797 0.07418277 0.10000000 0.4747697
    11                  6.630730                  5.102590  1.5281404531  1.4996137429 0.14428645 0.18000000 0.5102062
    12                  6.051498                  6.492977 -0.4414792845 -0.4464648353 0.65846774 0.67666667 0.9030372
    13                  5.657552                  5.671949 -0.0143969188 -0.0162466510 0.98715067 0.99666667 0.9992695
    14                  5.059838                  5.427452 -0.3676143223 -0.4192296000 0.67805173 0.65333333 0.9030372
    15                  4.659772                  5.417235 -0.7574631747 -0.8456147296 0.40448328 0.37666667 0.8089666
    16                  5.140637                  5.445461 -0.3048237813 -0.3145401554 0.75535823 0.75333333 0.9296717
    17                  5.296246                  6.581564 -1.2853180949 -1.3553979275 0.18540987 0.17000000 0.5102062
    18                  4.614938                  5.165485 -0.5505467871 -0.6031703703 0.55102037 0.57000000 0.8816326
    19                  5.881448                  4.002544  1.8789041787  2.1722002594 0.03810542 0.05666667 0.4747697
    20                  5.289458                  4.232691  1.0567672827  1.3097270311 0.20036480 0.22666667 0.5102062
    21                  5.924970                  5.339297  0.5856728367  0.6646850732 0.51142915 0.56666667 0.8816326
    22                  4.425766                  4.419620  0.0061465230  0.0066061789 0.99477367 0.99666667 0.9992695
    23                  5.665854                  4.161895  1.5039585807  1.6702859383 0.10526746 0.14333333 0.5102062
    24                  4.517780                  5.883866 -1.3660867324 -1.3156712574 0.19834572 0.20000000 0.5102062
    25                  4.698634                  5.051978 -0.3533435981 -0.3815466113 0.70549778 0.70333333 0.9030372
    26                  5.080060                  6.424642 -1.3445819647 -1.2889702875 0.20727129 0.24000000 0.5102062
    27                  6.135660                  5.016718  1.1189425180  1.3489422833 0.18751982 0.18666667 0.5102062
    28                  5.642470                  5.226990  0.4154799856  0.3985038805 0.69311626 0.61666667 0.9030372
    29                  5.797826                  5.887192 -0.0893660131 -0.1007856340 0.92039513 0.89666667 0.9992695
    30                  6.521022                  4.888472  1.6325504372  1.6386794699 0.11224926 0.12000000 0.5102062
    31                  5.132761                  5.020190  0.1125713450  0.1157839337 0.90861092 0.91666667 0.9992695
    32                  4.954973                  7.008125 -2.0531525050 -2.3425604209 0.02676272 0.04666667 0.4747697
    

    Notes:


    Data:

    set.seed(42)
    mg <- 32; ex <- 2; r <- 16
    d <- expand.grid(MatchGroup_AUTO=1:mg, Extraction=paste0('E', 1:ex), rep=1:r) |> 
      transform(value=runif(mg*ex*r, .9, 10))