rdataframefor-looplapplykruskal-wallis

Kruskal-wallis test in R gives an error: Error in model.frame.default: variable lengths differ


I am trying to run Kruskal wallis tests for multiple columns in my example dataframe (df) in R, but I am stuck with the following error:

 Error in model.frame.default(formula = as.numeric(x) ~ as.factor(Groups),  : 
  variable lengths differ (found for 'as.factor(Groups)') 

Here is my example dataframe (df):

Groups  Gene1   Gene2   Gene3   Gene4   Gene5   Gene6   Gene7   Gene8   Gene9   Gene10
Group1  120.67  69.33   1.24    2.31    0.39    6.57    2.49    383.84  415.23  NA
Group1  157 110.67  0.4 0.84    0.28    2.62    2.11    245.42  325.23  NA
Group1  113.5   66.75   1.07    4.53    0.33    2.37    2.35    421.25  352.03  73.51
Group1  131 79.67   1.13    5.03    0.72    3.36    2.24    305.32  432.81  71.11
Group1  120 79.67   0.91    3.84    0.74    3.77    1.92    298.91  382.43  66.49
Group2  125.67  83.67   2.07    1.73    0.38    3.89    2.09    233.81  377.21  72.1
Group2  103.33  68.67   1.01    4.89    0.3 4.5 1.75    231.5   381.73  53
Group2  121.33  74.67   0.54    2.39    3.95    3.7 2.46    310.66  355.97  143.61
Group2  136 83.67   1.6 1.75    0.32    5.17    2.36    410.21  389.62  170.34
Group2  143.67  71.33   0.56    1.22    0.26    4.48    2.62    294.01  491.57  96.72
Group2  134.67  69.67   0.85    1.77    0.45    3.58    2.44    236.61  441.32  69.06
Group2  158.33  98.33   0.87    3.69    0.51    2.53    2.6 257.66  396.96  41.94
Group2  147.33  88.33   NA  NA  NA  NA  NA  NA  NA  NA
Group2  95.67   59  1.39    0.56    0.31    2.49    2.09    395.38  420.28  64.83
Group3  135 82  13.31   24.05   1.21    3.83    2.83    313.71  327.84  66.8
Group3  124.67  78  1.12    2   0.71    3.77    2.42    334.36  358.9   131.35
Group3  152 98.33   1.11    1.54    0.35    2.11    2.21    297.68  433.48  117.18
Group3  135.33  73.67   0.13    2.99    0.3 2.4 1.86    296.82  415.13  112.97
Group3  135.33  87  0.91    3.73    0.65    2.92    1.85    335.31  412.16  103.18
Group4  124.67  77.67   0.28    0.81    0.49    2.62    1.96    251.49  468.19  80.27
Group4  125.67  72.33   1.01    1.82    0.35    3.65    1.62    335.18  264.74  145.15
Group4  169 105 0.6 3.12    0.29    3.9 2.22    311.01  459.85  82.89
Group4  123.67  76.33   0.65    1.78    0.47    2.77    1.57    253.56  283.38  59.07
Group5  132.67  76.33   2.94    17.01   0.27    3.99    2.55    354.78  493.02  145.36
Group5  NA  NA  1.34    1.42    0.4 4.21    2.02    243.26  345.2   43.91
Group5  144.33  75  NA  NA  0.55    3.26    2.85    312.16  419.86  55.71
Group5  136.25  78.25   NA  1.32    0.65    3.63    1.52    267.13  256.18  53.49
Group5  123.67  69.33   1.81    1.52    0.67    3.89    2   303.89  346.57  112.16
Group5  116.67  66.33   0.7 1.68    0.27    3.55    2.16    284.96  407.04  102.97
Group5  136.67  76  2.68    4.3 0.33    7.36    2.26    237.28  423.29  88.65
Group6  122 63.33   0.87    4.2 0.17    3.92    2.11    159.04  300.24  60.13
Group6  130.67  82.67   0.8 1.85    1   5.26    2.46    388.61  558.51  66.76
Group6  136.33  70.33   0.54    2.26    0.35    NA  NA  388.81  551.69  113.39
Group6  127.33  73  1.32    2.19    0.99    4.42    2.59    378.57  501.12  85.56
Group7  186.67  89.67   0.79    1.77    0.53    5.22    2.73    269.87  490.25  77.74
Group7  203 93  5.63    22.08   0.82    6.97    2.92    341.87  611.33  92.7
Group7  127 72.67   0.55    1.07    0.38    3.2 1.69    310.9   410.19  65.62
Group7  142 79.67   1.61    1.35    3.24    3.73    2.08    304.52  495.79  60.15

Here is my code:

   kw.tests <- lapply(
         data[, -1],
         function(x) { kruskal.test(as.numeric(x) ~ as.factor(Groups), data = data_test, na.action=na.omit) }
   )

     Error in model.frame.default(formula = as.numeric(x) ~ as.factor(Groups),  : 
      variable lengths differ (found for 'as.factor(Groups)') 

This code runs perfectly when I am running each of the gene individually, for example, for Gene1:

kruskal.test(Gene1 ~ as.factor(Groups), data = data_test, na.action=na.omit)

    Kruskal-Wallis rank sum test

data:  Gene1 by as.factor(Groups)
Kruskal-Wallis chi-squared = 5.6607, df = 6, p-value = 0.4622

However, it gives me this error when I use lapply or even a for loop. I have already googled this error several times, but none of the following answers are helping me.

  1. I learn that it could be due to the NAs in the file. However, I cannot avoid NAs as my dataframe is much larger than this. Also, that this test runs perfectly for each Gene separately without lapply or loops, even though there are NAs.
  2. The variable length of the 'Groups' variable is the same as that of all other variables, so this is also not an issue.

I here post snippet of my data:

> dput(data_test)
structure(list(Groups = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L), .Label = c("Group1", 
"Group2", "Group3", "Group4", "Group5", "Group6", "Group7"), class = "factor"), 
    Gene1 = c(120.67, 157, 113.5, 131, 120, 125.67, 103.33, 121.33, 
    136, 143.67, 134.67, 158.33, 147.33, 95.67, 135, 124.67, 
    152, 135.33, 135.33, 124.67, 125.67, 169, 123.67, 132.67, 
    NA, 144.33, 136.25, 123.67, 116.67, 136.67, 122, 130.67, 
    136.33, 127.33, 186.67, 203, 127, 142), Gene2 = c(69.33, 
    110.67, 66.75, 79.67, 79.67, 83.67, 68.67, 74.67, 83.67, 
    71.33, 69.67, 98.33, 88.33, 59, 82, 78, 98.33, 73.67, 87, 
    77.67, 72.33, 105, 76.33, 76.33, NA, 75, 78.25, 69.33, 66.33, 
    76, 63.33, 82.67, 70.33, 73, 89.67, 93, 72.67, 79.67), Gene3 = c(1.24, 
    0.4, 1.07, 1.13, 0.91, 2.07, 1.01, 0.54, 1.6, 0.56, 0.85, 
    0.87, NA, 1.39, 13.31, 1.12, 1.11, 0.13, 0.91, 0.28, 1.01, 
    0.6, 0.65, 2.94, 1.34, NA, NA, 1.81, 0.7, 2.68, 0.87, 0.8, 
    0.54, 1.32, 0.79, 5.63, 0.55, 1.61), Gene4 = c(2.31, 0.84, 
    4.53, 5.03, 3.84, 1.73, 4.89, 2.39, 1.75, 1.22, 1.77, 3.69, 
    NA, 0.56, 24.05, 2, 1.54, 2.99, 3.73, 0.81, 1.82, 3.12, 1.78, 
    17.01, 1.42, NA, 1.32, 1.52, 1.68, 4.3, 4.2, 1.85, 2.26, 
    2.19, 1.77, 22.08, 1.07, 1.35), Gene5 = c(0.39, 0.28, 0.33, 
    0.72, 0.74, 0.38, 0.3, 3.95, 0.32, 0.26, 0.45, 0.51, NA, 
    0.31, 1.21, 0.71, 0.35, 0.3, 0.65, 0.49, 0.35, 0.29, 0.47, 
    0.27, 0.4, 0.55, 0.65, 0.67, 0.27, 0.33, 0.17, 1, 0.35, 0.99, 
    0.53, 0.82, 0.38, 3.24), Gene6 = c(6.57, 2.62, 2.37, 3.36, 
    3.77, 3.89, 4.5, 3.7, 5.17, 4.48, 3.58, 2.53, NA, 2.49, 3.83, 
    3.77, 2.11, 2.4, 2.92, 2.62, 3.65, 3.9, 2.77, 3.99, 4.21, 
    3.26, 3.63, 3.89, 3.55, 7.36, 3.92, 5.26, NA, 4.42, 5.22, 
    6.97, 3.2, 3.73), Gene7 = c(2.49, 2.11, 2.35, 2.24, 1.92, 
    2.09, 1.75, 2.46, 2.36, 2.62, 2.44, 2.6, NA, 2.09, 2.83, 
    2.42, 2.21, 1.86, 1.85, 1.96, 1.62, 2.22, 1.57, 2.55, 2.02, 
    2.85, 1.52, 2, 2.16, 2.26, 2.11, 2.46, NA, 2.59, 2.73, 2.92, 
    1.69, 2.08), Gene8 = c(383.84, 245.42, 421.25, 305.32, 298.91, 
    233.81, 231.5, 310.66, 410.21, 294.01, 236.61, 257.66, NA, 
    395.38, 313.71, 334.36, 297.68, 296.82, 335.31, 251.49, 335.18, 
    311.01, 253.56, 354.78, 243.26, 312.16, 267.13, 303.89, 284.96, 
    237.28, 159.04, 388.61, 388.81, 378.57, 269.87, 341.87, 310.9, 
    304.52), Gene9 = c(415.23, 325.23, 352.03, 432.81, 382.43, 
    377.21, 381.73, 355.97, 389.62, 491.57, 441.32, 396.96, NA, 
    420.28, 327.84, 358.9, 433.48, 415.13, 412.16, 468.19, 264.74, 
    459.85, 283.38, 493.02, 345.2, 419.86, 256.18, 346.57, 407.04, 
    423.29, 300.24, 558.51, 551.69, 501.12, 490.25, 611.33, 410.19, 
    495.79), Gene10 = c(NA, NA, 73.51, 71.11, 66.49, 72.1, 53, 
    143.61, 170.34, 96.72, 69.06, 41.94, NA, 64.83, 66.8, 131.35, 
    117.18, 112.97, 103.18, 80.27, 145.15, 82.89, 59.07, 145.36, 
    43.91, 55.71, 53.49, 112.16, 102.97, 88.65, 60.13, 66.76, 
    113.39, 85.56, 77.74, 92.7, 65.62, 60.15)), class = "data.frame", row.names = c(NA, 
-38L))

Any further help appreciated. Thanking you.


Solution

  • You used the wrong dataset name in your lapply / apply call

    apply(data_test[,-1],2,function(x){kruskal.test(as.numeric(x)~as.factor(data_test$Groups))})
    

    works for me.