rloopsfor-loopstatisticskruskal-wallis

R: Kruskal-Wallis test in loop over specified columns in data frame


I would like to run a KW-test over certain numerical variables from a data.frame, using one grouping variable. I'd prefer to do this in a loop, instead of typing out all the tests, as they are many variables (more than in the example below).

Simulated data:

library(dplyr)
set.seed(123)
Data <- tbl_df(
data.frame(
muttype = as.factor(rep(c("missense", "frameshift", "nonsense"), each = 80)),
ados.tsc   = runif(240, 0, 10),
ados.sa    = runif(240, 0, 10),
ados.rrb   = runif(240, 0, 10))
) %>%
group_by(muttype)
ados.sim <- as.data.frame(Data)

The following code works just fine outside of the loop.

kruskal.test(formula (paste((colnames(ados.sim)[2]), "~ muttype")), data = 
ados.sim)

But it doesn't inside the loop:

for(i in names(ados.sim[,2:4])){  
ados.mtp <- kruskal.test(formula (paste((colnames(ados.sim)[i]), "~ muttype")), 
data = ados.sim)
}

I get the error:

Error in terms.formula(formula, data = data) : invalid term in model formula

Anybody who knows how to solve this?


Solution

  • Try:

    results <- list()
    for(i in names(ados.sim[,2:4])){  
      results[[i]] <- kruskal.test(formula(paste(i, "~ muttype")), data = ados.sim)
    }
    

    This also saves your results in a list and avoids overwriting your results as ados.mtp in every iteration, which I think is not what you intended to do.

    Note the following:

    for(i in names(ados.sim[,2:4])){  
       print(i)
    }
    [1] "ados.tsc"
    [1] "ados.sa"
    [1] "ados.rrb"
    

    That is, i already gives you the name of the column. The problem in your code was that you tried to use it like an integer for subsetting, which turned the outcome into NA.

    for(i in names(ados.sim[,2:4])){  
       print(paste((colnames(ados.sim)[i]), "~ muttype"))
    }
    [1] "NA ~ muttype"
    [1] "NA ~ muttype"
    [1] "NA ~ muttype"
    

    And just for reference, all of this could also be done in the following two ways that I often prefer since it makes subsequent analysis slightly easier:

    First, store all test objects in a dataframe:

    library(tidyr)
    df <- ados.sim %>% gather(key, value, -muttype) %>% 
          group_by(key) %>% 
          do(test = kruskal.test(x= .$value, g = .$muttype))
    

    You can then subset the dataframe to get the test outcomes:

    df[df$key == "ados.rrb",]$test
    [[1]]
    
        Kruskal-Wallis rank sum test
    
    data:  .$value and .$muttype
    Kruskal-Wallis chi-squared = 2.2205, df = 2, p-value = 0.3295
    

    Alternatively, get all results directly in a dataframe, without storing the test objects:

    library(broom)
    df2 <- ados.sim %>% gather(key, value, -muttype) %>% 
           group_by(key) %>% 
           do(tidy(kruskal.test(x= .$value, g = .$muttype)))
    df2
    # A tibble: 3 x 5
    # Groups:   key [3]
           key statistic   p.value parameter                       method
         <chr>     <dbl>     <dbl>     <int>                       <fctr>
    1 ados.rrb 2.2205031 0.3294761         2 Kruskal-Wallis rank sum test
    2  ados.sa 0.1319554 0.9361517         2 Kruskal-Wallis rank sum test
    3 ados.tsc 0.3618102 0.8345146         2 Kruskal-Wallis rank sum test