rdplyrstat

statistical test (wilcox or t-test, or any, really) in a "group_by" pipe


I have a dataset looking at water quality for various conditions. Here is a subset of the data (called tempdf1:

    Material Unit  Condition Rep Measurement
1    Bromine  ppm Unfiltered   1         4.0
2  Carbonate  ppm Unfiltered   1        80.0
3    Bromine  ppm Unfiltered   2         6.0
4  Carbonate  ppm Unfiltered   2       120.0
5    Bromine  ppm Unfiltered   3         6.0
6  Carbonate  ppm Unfiltered   3       100.0
7    Bromine  ppm   Filtered   1         0.0
8  Carbonate  ppm   Filtered   1       120.0
9    Bromine  ppm   Filtered   2         0.0
10 Carbonate  ppm   Filtered   2       100.0
11   Bromine  ppm   Filtered   3         0.5
12 Carbonate  ppm   Filtered   3       100.0

I would like to run a statistical test (I was leaning toward a t-test, but since my data size is not normally distributed, I was thinking of running a Wilcoxen test). However, no matter what I do, I can't run this successfully.

I'd like to group my tests by Material (Bromine, Carbonate) and then compare the measurements for "Unfiltered" against those that are "Filtered". However, I keep getting errors when I try to run this. I've restructured my data so that the measurements for "Unfiltered" & "Filtered" are in separate columns. Here is an example of how I've restructured the data, and the analysis I've tried to run:

tempdf2 <- tempdf1 %>%
  tidyr::pivot_wider(id_cols=c(Material,Rep),names_from=Condition,values_from=Measurement)
tempdf2 %>%
  dplyr::group_by(Material) %>%
  dplyr::summarize(w=wilcox.test(Filtered~Unfiltered,paired=FALSE)$p.value)

This is the error I receive

Error in `dplyr::summarize()`:
! Problem while computing `w = wilcox.test(Filtered ~ Unfiltered, paired = FALSE)$p.value`.
ℹ The error occurred in group 2: Material = "Carbonate".
Caused by error in `wilcox.test.formula()`:
! grouping factor must have exactly 2 levels
Run `rlang::last_error()` to see where the error occurred.

I've read through a few articles about running various statistical tests using the "group_by" method first, but haven't been able to follow it. Would someone help me better understand how to run a statistical test (Ex: Wilcoxon) on a table that has been grouped by a specific variable?

Thanks!


Solution

  • As we are using the formula method, use the 'Condition' as independent variable - which is specified in ?wilcox.test

    formula - a formula of the form lhs ~ rhs where lhs is a numeric variable giving the data values and rhs either 1 for a one-sample or paired test or a factor with two levels giving the corresponding groups. If lhs is of class "Pair" and rhs is 1, a paired test is done

    library(dplyr) #version >= 1.1.0
    tempdf1 %>%
       reframe(pvalue = wilcox.test(Measurement ~ Condition, 
         paired = FALSE)$p.value, .by = Material)
       Material    pvalue
    1   Bromine 0.0721982
    2 Carbonate 0.8136637
    

    data

    tempdf1 <- structure(list(Material = c("Bromine", "Carbonate", "Bromine", 
    "Carbonate", "Bromine", "Carbonate", "Bromine", "Carbonate", 
    "Bromine", "Carbonate", "Bromine", "Carbonate"), Unit = c("ppm", 
    "ppm", "ppm", "ppm", "ppm", "ppm", "ppm", "ppm", "ppm", "ppm", 
    "ppm", "ppm"), Condition = c("Unfiltered", "Unfiltered", "Unfiltered", 
    "Unfiltered", "Unfiltered", "Unfiltered", "Filtered", "Filtered", 
    "Filtered", "Filtered", "Filtered", "Filtered"), Rep = c(1L, 
    1L, 2L, 2L, 3L, 3L, 1L, 1L, 2L, 2L, 3L, 3L), Measurement = c(4, 
    80, 6, 120, 6, 100, 0, 120, 0, 100, 0.5, 100)), 
    class = "data.frame", row.names = c("1", 
    "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))