I have a dataset looking at water quality for various conditions. Here is a subset of the data (called tempdf1
:
Material Unit Condition Rep Measurement
1 Bromine ppm Unfiltered 1 4.0
2 Carbonate ppm Unfiltered 1 80.0
3 Bromine ppm Unfiltered 2 6.0
4 Carbonate ppm Unfiltered 2 120.0
5 Bromine ppm Unfiltered 3 6.0
6 Carbonate ppm Unfiltered 3 100.0
7 Bromine ppm Filtered 1 0.0
8 Carbonate ppm Filtered 1 120.0
9 Bromine ppm Filtered 2 0.0
10 Carbonate ppm Filtered 2 100.0
11 Bromine ppm Filtered 3 0.5
12 Carbonate ppm Filtered 3 100.0
I would like to run a statistical test (I was leaning toward a t-test, but since my data size is not normally distributed, I was thinking of running a Wilcoxen test). However, no matter what I do, I can't run this successfully.
I'd like to group my tests by Material (Bromine, Carbonate) and then compare the measurements for "Unfiltered" against those that are "Filtered". However, I keep getting errors when I try to run this. I've restructured my data so that the measurements for "Unfiltered" & "Filtered" are in separate columns. Here is an example of how I've restructured the data, and the analysis I've tried to run:
tempdf2 <- tempdf1 %>%
tidyr::pivot_wider(id_cols=c(Material,Rep),names_from=Condition,values_from=Measurement)
tempdf2 %>%
dplyr::group_by(Material) %>%
dplyr::summarize(w=wilcox.test(Filtered~Unfiltered,paired=FALSE)$p.value)
This is the error I receive
Error in `dplyr::summarize()`:
! Problem while computing `w = wilcox.test(Filtered ~ Unfiltered, paired = FALSE)$p.value`.
ℹ The error occurred in group 2: Material = "Carbonate".
Caused by error in `wilcox.test.formula()`:
! grouping factor must have exactly 2 levels
Run `rlang::last_error()` to see where the error occurred.
I've read through a few articles about running various statistical tests using the "group_by" method first, but haven't been able to follow it. Would someone help me better understand how to run a statistical test (Ex: Wilcoxon) on a table that has been grouped by a specific variable?
Thanks!
As we are using the formula
method, use the 'Condition' as independent variable - which is specified in ?wilcox.test
formula - a formula of the form lhs ~ rhs where lhs is a numeric variable giving the data values and rhs either 1 for a one-sample or paired test or a factor with two levels giving the corresponding groups. If lhs is of class "Pair" and rhs is 1, a paired test is done
library(dplyr) #version >= 1.1.0
tempdf1 %>%
reframe(pvalue = wilcox.test(Measurement ~ Condition,
paired = FALSE)$p.value, .by = Material)
Material pvalue
1 Bromine 0.0721982
2 Carbonate 0.8136637
tempdf1 <- structure(list(Material = c("Bromine", "Carbonate", "Bromine",
"Carbonate", "Bromine", "Carbonate", "Bromine", "Carbonate",
"Bromine", "Carbonate", "Bromine", "Carbonate"), Unit = c("ppm",
"ppm", "ppm", "ppm", "ppm", "ppm", "ppm", "ppm", "ppm", "ppm",
"ppm", "ppm"), Condition = c("Unfiltered", "Unfiltered", "Unfiltered",
"Unfiltered", "Unfiltered", "Unfiltered", "Filtered", "Filtered",
"Filtered", "Filtered", "Filtered", "Filtered"), Rep = c(1L,
1L, 2L, 2L, 3L, 3L, 1L, 1L, 2L, 2L, 3L, 3L), Measurement = c(4,
80, 6, 120, 6, 100, 0, 120, 0, 100, 0.5, 100)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))