rp-valuestatistical-testkruskal-wallis

Kruskal - Wallis p-value matrix for data subsets with R


Consider a dataset Data which has several factor and several numerical continuous variables. Some of these variables, let's say slice_by_1 (with classes "Male", "Female") and slice_by_2 (with classes "Sad", "Neutral", "Happy"), are used to 'slice' data into subsets. For every subset Kruskal-Wallis test should be run on variables length, preasure,pulse each grouped by the other factor variable called compare_by. Is there a quick way in R to accomplish this task and put calculated p values to a matrix?

I used dplyr package to prepare data.

Sample dataset:

library(dplyr)
set.seed(123)
Data <- tbl_df(
   data.frame(
       slice_by_1 = as.factor(rep(c("Male", "Female"), times = 120)),
       slice_by_2 = as.factor(rep(c("Happy", "Neutral", "Sad"), each = 80)),
       compare_by = as.factor(rep(c("blue", "green", "brown"), times = 80)),
       length   = c(sample(1:10, 120, replace=T), sample(5:12, 120, replace=T)),
       pulse    = runif(240, 60, 120),
       preasure = c(rnorm(80,1,2),rnorm(80,1,2.1),rnorm(80,1,3))
   )
   ) %>%
group_by(slice_by_1, slice_by_2)

Let's look at data:

Source: local data frame [240 x 6]
Groups: slice_by_1, slice_by_2

   slice_by_1 slice_by_2 compare_by length     pulse     preasure
1        Male      Happy       blue     10  69.23376  0.508694601
2      Female      Happy      green      1  68.57866 -1.155632020
3        Male      Happy      brown      8 112.72132  0.007031799
4      Female      Happy       blue      3 116.61283  0.383769524
5        Male      Happy      green      7 110.06851 -0.717791526
6      Female      Happy      brown      8 117.62481  2.938658488
7        Male      Happy       blue      9 105.59749  0.735831389
8      Female      Happy      green      2  83.44101  3.881268679
9        Male      Happy      brown      5 101.48334  0.025572561
10     Female      Happy       blue     10  62.87331 -0.715108893
..        ...        ...        ...    ...       ...          ...

An example of desired output:

    Data_subsets    length  preasure     pulse
1     Male_Happy <p-value> <p-value> <p-value>
2   Female_Happy <p-value> <p-value> <p-value>
3   Male_Neutral <p-value> <p-value> <p-value>
4 Female_Neutral <p-value> <p-value> <p-value>
5       Male_Sad <p-value> <p-value> <p-value>
6     Female_Sad <p-value> <p-value> <p-value>

Solution

  • We could use Map within do for doing the multiple column kruskal.test and then use unite from library(tidyr) to join the 'slice_by_1' and 'slice_by_2' columns to a single column 'Data_subsets'.

    library(dplyr)
    library(tidyr)
    nm1 <- names(Data)[4:6]
    f1 <- function(x,y) kruskal.test(x~y)$p.value
    
    Data %>% 
         do({data.frame(Map(f1, .[nm1], list(.$compare_by)))}) %>% 
         unite(Data_subsets, slice_by_1, slice_by_2, sep="_")
    #     Data_subsets    length     pulse  preasure
    #1   Female_Happy 0.4369918 0.8767561 0.1937327
    #2 Female_Neutral 0.3750688 0.2858796 0.8588069
    #3     Female_Sad 0.7958502 0.5801208 0.6274940
    #4     Male_Happy 0.3099704 0.3796494 0.6929493
    #5   Male_Neutral 0.4953853 0.2418708 0.2986860
    #6       Male_Sad 0.7159970 0.5686672 0.8528201
    

    Or we can do this using data.table. We convert the 'data.frame' to 'data.table' (setDT(Data)), create grouping variable ('Data_subsets') by pasteing the 'slice_by_1' and 'slice_by_2' columns, then we subset the columns of the dataset and pass that as input to Map, do the krusal.test and extract the p.value.

    library(data.table)    
    setDT(Data)[, Map(f1, .SD[, nm1, with=FALSE], list(compare_by)) ,
                 by = .(Data_subsets= paste(slice_by_1, slice_by_2, sep='_'))]
    #     Data_subsets    length     pulse  preasure
    #1:     Male_Happy 0.3099704 0.3796494 0.6929493
    #2:   Female_Happy 0.4369918 0.8767561 0.1937327
    #3:   Male_Neutral 0.4953853 0.2418708 0.2986860
    #4: Female_Neutral 0.3750688 0.2858796 0.8588069
    #5:       Male_Sad 0.7159970 0.5686672 0.8528201
    #6:     Female_Sad 0.7958502 0.5801208 0.6274940