I'm running a 2-way ANOVA to compare how two variables affect the price of a service. However, when I run post-hoc pairwise comparisons I would like to only see rows that are comparable (i.e. when one of the independent variables matches). Here is an example to show the issue:
Variable 1 - Species: tuna, halibut, salmon, flounder
Variable 2 - Sex: Male, female
Dependent variable: mass
Here is what I'm using for 2-way ANOVA and pairwise comparisons:
> dat <- read.csv(<file location>)
> aov2 <- aov(mass ~ species * sex, data = dat)
> TukeyHSD(aov2, which = "species:sex")
When I run the pairwise comparisons of mass using TukeyHSD, I would only like to see ones such as Tuna:Male - Tuna:Female or Halibut:Male - Flounder:Male, but I would not like to see Flounder:Female - Halibut:Male. Basically, I just want to see comparisons where one of the variables has the same value.
Honestly am not even sure what to try, or if this is even possible. I had some NA values for some of the comparisons and I tried using:
TukeyHSD(aov2, which = "species:sex") %>% filter(is.na(diff))
but I get the following:
Error in UseMethod("filter") :
no applicable method for 'filter' applied to an object of class "c('TukeyHSD', 'multicomp')"
If this is not possible, is there a way to filter data from a dataframe before it goes into 1-way ANOVA, for example by a certain value in one of the columns? Then I can try to run multiple 1-way ANOVAs to accomplish a similar goal. I would rather not have to create separate data files because the actual dataset I'm working with is very large and would be time-consuming to manually separate.
This is not possible with the current TukeyHSD
function. However, with some modifications of this function, you can get the desired output. Below is the modified function, which adds a new argument called "match", the default being FALSE:
TukeyHSD2 <- function (x, which = seq_along(tabs), ordered = FALSE, conf.level = 0.95,
match=FALSE, ...)
{
mm <- model.tables(x, "means")
if (is.null(mm$n))
stop("no factors in the fitted model")
tabs <- mm$tables
if (names(tabs)[1L] == "Grand mean")
tabs <- tabs[-1L]
tabs <- tabs[which]
nn <- mm$n[names(tabs)]
nn_na <- is.na(nn)
if (all(nn_na))
stop("'which' specified no factors")
if (any(nn_na)) {
warning("'which' specified some non-factors which will be dropped")
tabs <- tabs[!nn_na]
nn <- nn[!nn_na]
}
out <- setNames(vector("list", length(tabs)), names(tabs))
MSE <- sum(x$residuals^2)/x$df.residual
for (nm in names(tabs)) {
tab <- tabs[[nm]]
means <- as.vector(tab)
nms <- if (length(dim(tab)) > 1L) {
dn <- dimnames(tab)
apply(do.call("expand.grid", dn), 1L, paste, collapse = ":")
}
else names(tab)
n <- nn[[nm]]
if (length(n) < length(means))
n <- rep.int(n, length(means))
if (as.logical(ordered)) {
ord <- order(means)
means <- means[ord]
n <- n[ord]
if (!is.null(nms))
nms <- nms[ord]
}
center <- outer(means, means, `-`)
keep <- lower.tri(center)
dnames <- list(outer(nms, nms, paste, sep = "-"), c("diff", "lwr", "upr", "p adj"))
dk1 <- dnames[[1L]][keep]
if(match) {
keep2 <- sapply(1:length(dk), function(x) {
pattern <- '(.+):(.+)-(.+):(.+)'
d1 <- sub(pattern, '\\1', dk1[x])
d2 <- sub(pattern, '\\2', dk1[x])
d3 <- sub(pattern, '\\3', dk1[x])
d4 <- sub(pattern, '\\4', dk1[x])
d1==d3 | d2==d4 } )
dnames[[1L]] <- dk1[keep2]
} else dnames[[1L]] <- dk1
center <- center[keep]
width <- qtukey(conf.level, length(means), x$df.residual) *
sqrt((MSE/2) * outer(1/n, 1/n, `+`))[keep]
est <- center/(sqrt((MSE/2) * outer(1/n, 1/n, `+`))[keep])
if(match) {
center <- center[keep2]
width <- width[keep2]
est <- est[keep2]
}
pvals <- ptukey(abs(est), length(means), x$df.residual,
lower.tail = FALSE)
out[[nm]] <- array(c(center, center - width, center +
width, pvals), c(length(width), 4L), dnames)
}
class(out) <- c("TukeyHSD", "multicomp")
attr(out, "orig.call") <- x$call
attr(out, "conf.level") <- conf.level
attr(out, "ordered") <- ordered
out
}
Send the above function to your R console and then test it on your data. Here is the output using the "warpbreaks" example dataset.
fm1 <- aov(breaks ~ wool * tension, data = warpbreaks)
> TukeyHSD(fm1, which="wool:tension")
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = breaks ~ wool * tension, data = warpbreaks)
$`wool:tension`
diff lwr upr p adj
B:L-A:L -16.3333333 -31.63966 -1.027012 0.0302143
A:M-A:L -20.5555556 -35.86188 -5.249234 0.0029580
B:M-A:L -15.7777778 -31.08410 -0.471456 0.0398172
A:H-A:L -20.0000000 -35.30632 -4.693678 0.0040955
B:H-A:L -25.7777778 -41.08410 -10.471456 0.0001136
A:M-B:L -4.2222222 -19.52854 11.084100 0.9626541
B:M-B:L 0.5555556 -14.75077 15.861877 0.9999978
A:H-B:L -3.6666667 -18.97299 11.639655 0.9797123
B:H-B:L -9.4444444 -24.75077 5.861877 0.4560950
B:M-A:M 4.7777778 -10.52854 20.084100 0.9377205
A:H-A:M 0.5555556 -14.75077 15.861877 0.9999978
B:H-A:M -5.2222222 -20.52854 10.084100 0.9114780
A:H-B:M -4.2222222 -19.52854 11.084100 0.9626541
B:H-B:M -10.0000000 -25.30632 5.306322 0.3918767
B:H-A:H -5.7777778 -21.08410 9.528544 0.8705572
And here is the output from the modified version with "match" set to TRUE.
> TukeyHSD2(fm1, which="wool:tension", match=TRUE)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = breaks ~ wool * tension, data = warpbreaks)
$`wool:tension`
diff lwr upr p adj
B:L-A:L -16.3333333 -31.63966 -1.027012 0.0302143
A:M-A:L -20.5555556 -35.86188 -5.249234 0.0029580
A:H-A:L -20.0000000 -35.30632 -4.693678 0.0040955
B:M-B:L 0.5555556 -14.75077 15.861877 0.9999978
B:H-B:L -9.4444444 -24.75077 5.861877 0.4560950
B:M-A:M 4.7777778 -10.52854 20.084100 0.9377205
A:H-A:M 0.5555556 -14.75077 15.861877 0.9999978
B:H-B:M -10.0000000 -25.30632 5.306322 0.3918767
B:H-A:H -5.7777778 -21.08410 9.528544 0.8705572