Apologies if I'm missing anything obvious but I have this very simple dataset (just for illustration of the problem I'm encountering): 15 participants, 3 conditions, each participant experienced all 3 conditions and produced a binary response in each condition. I want to tell if there is significant difference in the responses across the 3 conditions.
Here is what an example data look like
ID,Condition,Outcome
1,A,FALSE
1,B,TRUE
1,C,FALSE
2,A,FALSE
2,B,TRUE
2,C,FALSE
3,A,FALSE
3,B,TRUE
3,C,TRUE
4,A,FALSE
4,B,TRUE
4,C,TRUE
5,A,FALSE
5,B,TRUE
5,C,FALSE
6,A,FALSE
6,B,TRUE
6,C,TRUE
7,A,FALSE
7,B,TRUE
7,C,TRUE
8,A,FALSE
8,B,TRUE
8,C,TRUE
9,A,FALSE
9,B,TRUE
9,C,FALSE
10,A,FALSE
10,B,TRUE
10,C,FALSE
11,A,FALSE
11,B,TRUE
11,C,TRUE
12,A,FALSE
12,B,TRUE
12,C,TRUE
13,A,FALSE
13,B,TRUE
13,C,FALSE
14,A,FALSE
14,B,TRUE
14,C,TRUE
15,A,FALSE
15,B,TRUE
15,C,TRUE
You can clearly see that condition A is always associated with a False outcome, and condition C is always associated with a True outcome, so we know there should be a significant difference even before we do any formal testing.
So to perform formal testing we need a repeated measures anova for binomial data, and from what I have searched the best option is mixed effects logistic regression with condition as fixed effect and participant ID as random effect. So I did this in R
model <- glmer(Outcome ~ Condition + (1 | ID), data = data, family = binomial)
summary(model)
and I got this result
> model <- glmer(Outcome ~ Condition + (1 | ID), data = data, family = binomial)
boundary (singular) fit: see help('isSingular')
> summary(model)
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
Family: binomial ( logit )
Formula: Outcome ~ Condition + (1 | ID)
Data: data
AIC BIC logLik deviance df.resid
28.2 35.4 -10.1 20.2 41
Scaled residuals:
Min 1Q Median 3Q Max
-1.2247 0.0000 0.0000 0.0000 0.8165
Random effects:
Groups Name Variance Std.Dev.
ID (Intercept) 0 0
Number of obs: 45, groups: ID, 15
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.095e+01 2.122e+07 0 1
ConditionB 1.153e+02 2.740e+07 0 1
ConditionC 4.135e+01 2.122e+07 0 1
Correlation of Fixed Effects:
(Intr) CndtnB
ConditionB -0.775
ConditionC -1.000 0.775
optimizer (Nelder_Mead) convergence code: 0 (OK)
boundary (singular) fit: see help('isSingular')
Warning messages:
1: In vcov.merMod(object, use.hessian = use.hessian) :
variance-covariance matrix computed from finite-difference Hessian is
not positive definite or contains NA values: falling back to var-cov estimated from RX
2: In vcov.merMod(object, correlation = correlation, sigm = sig) :
variance-covariance matrix computed from finite-difference Hessian is
not positive definite or contains NA values: falling back to var-cov estimated from RX
I'm not exactly sure why I'm getting this or how to interpret it but it seems the hessian is singular, the model has converged but it does not seems to be useful at all (with incredibly large p value for the estimates and no significance shown at all). With such extreme difference between conditions I was expecting an easily achieved huge significance.
I would really appreciate any suggestions on how I should fix this, whether there's more suitable tests for this situation, or pointing out what I have done wrong!
I made a suggestion about using a Bayesian model in the comments. You could use brm()
from the brms
package. The model is specified similarly to the glmer()
model. The addition is the prior on the coefficients which you can specify as normal(0, 2.5)
. This is a bit more informative than we might normally use, but it will help prevent the coefficients from getting too big. Here's an example:
dat <- read.table(textConnection("ID,Condition,Outcome
1,A,FALSE
1,B,TRUE
1,C,FALSE
2,A,FALSE
2,B,TRUE
2,C,FALSE
3,A,FALSE
3,B,TRUE
3,C,TRUE
4,A,FALSE
4,B,TRUE
4,C,TRUE
5,A,FALSE
5,B,TRUE
5,C,FALSE
6,A,FALSE
6,B,TRUE
6,C,TRUE
7,A,FALSE
7,B,TRUE
7,C,TRUE
8,A,FALSE
8,B,TRUE
8,C,TRUE
9,A,FALSE
9,B,TRUE
9,C,FALSE
10,A,FALSE
10,B,TRUE
10,C,FALSE
11,A,FALSE
11,B,TRUE
11,C,TRUE
12,A,FALSE
12,B,TRUE
12,C,TRUE
13,A,FALSE
13,B,TRUE
13,C,FALSE
14,A,FALSE
14,B,TRUE
14,C,TRUE
15,A,FALSE
15,B,TRUE
15,C,TRUE"), sep=",", header=TRUE)
dat$Outcome <- as.numeric(dat$Outcome)
library(brms)
bp <- prior(normal(0, 2.5), class="b")
model <- brm(Outcome ~ Condition + (1 | ID),
data = dat,
family = bernoulli,
backend="cmdstanr",
prior=bp)
summary(model)
#> Family: bernoulli
#> Links: mu = logit
#> Formula: Outcome ~ Condition + (1 | ID)
#> Data: dat (Number of observations: 45)
#> Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
#> total post-warmup draws = 4000
#>
#> Group-Level Effects:
#> ~ID (Number of levels: 15)
#> Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> sd(Intercept) 0.65 0.52 0.03 1.96 1.00 2107 2275
#>
#> Population-Level Effects:
#> Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> Intercept -2.71 0.92 -4.78 -1.16 1.00 4155 2366
#> ConditionB 6.00 1.38 3.57 9.01 1.00 3302 2295
#> ConditionC 3.01 1.05 1.11 5.23 1.00 4759 2477
#>
#> Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
#> and Tail_ESS are effective sample size measures, and Rhat is the potential
#> scale reduction factor on split chains (at convergence, Rhat = 1).
Created on 2023-09-12 with reprex v2.0.2