factorC <- paste(factorA, factorB)
Is factorA*factorB
equivalent to factorA+factorB+factorC
in the formula of a linear mixed effects model?
I use the function ancombc2()
from the package ANCOMBC
. ANCOMBC
is a package used for differential abundance testing with species frequency data. For details see here: https://www.nature.com/articles/s41592-023-02092-7 However, I hope we do not have to deal with the specificities of this package for this question, since ANCOMBC
is using lmer()
from lmerTest
/ lme4
.
My question is about the manual inclusion of an interaction term. ANCOMBC
does not support interaction terms. However, the authors write the following:
Q: Can the ancombc or ancombc2 function handle interaction terms in the analysis?
A: Unfortunately, the inclusion of interaction terms in the fix_formula argument of ancombc or ancombc2 can lead to complexities and potential confusion in the multi-group comparisons. To address this, it is recommended to manually create the interaction term of interest outside of the formula and perform the analysis accordingly.
By manually creating the interaction term, you can ensure that the analysis accurately captures the interaction effect between variables. Once the interaction term is created, you can include it in the fix_formula argument or any other relevant part of the analysis, depending on your specific research question and design.
lmer()
I use lmer()
from lmeTest
for testing. I will show how I interpreted the authors explanation using the following example data:
In an experiment observations were made for samples from two treatment groups and from two age groups. Additionally, the experiment was repeated in two years. Thus, the experiment includes the following factors:
Here is a dummy dataset:
# create example data
set.seed(123)
dat_year1 <- data.frame(year = factor(1),
treatment = c(rep("control", 20),
rep("treated", 20)),
age = c(rep("old", 10),
rep("new", 10),
rep("old", 10),
rep("new", 10)),
value = c(rnorm(n = 10, mean = 10, sd = 2),
rnorm(n = 10, mean = 10, sd = 2),
rnorm(n = 10, mean = 15, sd = 2),
rnorm(n = 10, mean = 16, sd = 2)))
set.seed(321)
dat_year2 <- data.frame(year = factor(2),
treatment = c(rep("control", 20),
rep("treated", 20)),
age = c(rep("old", 10),
rep("new", 10),
rep("old", 10),
rep("new", 10)),
value = c(rnorm(n = 10, mean = 10, sd = 2),
rnorm(n = 10, mean = 10, sd = 2),
rnorm(n = 10, mean = 15, sd = 2),
rnorm(n = 10, mean = 18, sd = 2)))
dat <- rbind(dat_year1, dat_year2)
And a visualization of it:
I am interested in testing the interaction between "treatment" and "age". With lmer()
I would do that as follows:
# run lmer()
test <- lmerTest::lmer(value~treatment*age+(1|year), data = dat)
summary(test)
Returning:
Linear mixed model fit by REML. t-tests use
Satterthwaite's method [lmerModLmerTest]
Formula: value ~ treatment * age + (1 | year)
Data: dat
REML criterion at convergence: 322.2
Scaled residuals:
Min 1Q Median 3Q Max
-2.41222 -0.69075 0.07493 0.44019 2.22631
Random effects:
Groups Name Variance Std.Dev.
year (Intercept) 0.05503 0.2346
Residual 3.44757 1.8568
Number of obs: 80, groups: year, 2
Fixed effects:
Estimate Std. Error df t value
(Intercept) 10.6492 0.4471 7.6713 23.819
treatmenttreated 6.6717 0.5872 75.0000 11.363
ageold -0.4259 0.5872 75.0000 -0.725
treatmenttreated:ageold -2.6641 0.8304 75.0000 -3.208
Pr(>|t|)
(Intercept) 1.81e-08 ***
treatmenttreated < 2e-16 ***
ageold 0.47044
treatmenttreated:ageold 0.00196 **
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) trtmnt ageold
tretmnttrtd -0.657
ageold -0.657 0.500
trtmnttrtd: 0.464 -0.707 -0.707
I interpreted the recommendation to create the interaction term as meaning that a new factor should be manually created that combines the levels of the factors for which an interaction is to be tested:
dat$treatment_age <- paste(dat$treatment, "_", dat$age, sep = "")
Now I included this in the model formula as another fixed factor:
test_manual <- lmerTest::lmer(value~treatment+age+treatment_age+(1|year),
data = dat)
summary(test_manual)
Returning:
Linear mixed model fit by REML. t-tests use
Satterthwaite's method [lmerModLmerTest]
Formula:
value ~ treatment + age + treatment_age + (1 | year)
Data: dat
REML criterion at convergence: 322.2
Scaled residuals:
Min 1Q Median 3Q Max
-2.41222 -0.69075 0.07493 0.44019 2.22631
Random effects:
Groups Name Variance Std.Dev.
year (Intercept) 0.05503 0.2346
Residual 3.44757 1.8568
Number of obs: 80, groups: year, 2
Fixed effects:
Estimate Std. Error df
(Intercept) 10.6492 0.4471 7.6713
treatmenttreated 6.6717 0.5872 75.0000
ageold -3.0900 0.5872 75.0000
treatment_agecontrol_old 2.6641 0.8304 75.0000
t value Pr(>|t|)
(Intercept) 23.819 1.81e-08 ***
treatmenttreated 11.363 < 2e-16 ***
ageold -5.263 1.30e-06 ***
treatment_agecontrol_old 3.208 0.00196 **
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) trtmnt ageold
tretmnttrtd -0.657
ageold 0.000 -0.500
trtmnt_gcn_ -0.464 0.707 -0.707
fit warnings:
fixed-effect model matrix is rank deficient so dropping 2 columns / coefficients
For "treatment" all statistics are identical, for "age" they are much different, and for "treatment:age" the estimate and t-value are now negative. Additionally, there is a fit warning "fixed-effect model matrix is rank deficient so dropping 2 columns / coefficients".
How can I improve this to correctly include the interaction term manually?
ancombc2
When I try to use ancombc2()
with the manually included interaction term it does not execute but returns an error:
library(ANCOMBC)
library(phyloseq)
ancombc2(data = phyloseq_object,
tax_level = NULL,
fix_formula = "treatment+age+treatment_age",
verbose = TRUE)
Returning:
Obtaining initial estimates ...
Error: Estimation failed for the following covariates:
treated_old, treated_new
Please ensure that these covariates do not have missing values and check for multicollinearity before re-estimating the model
In your short version, factorC
on it's own is equivalent to factorA*factorB
- you don't need to include factorA+factorB
. That's why you get the rank deficiency warning in your longer version, two of the levels in treatment_age
are already covered by the treatment
and age
main effects.