I am attempting to build a nonlinear mixed effects model
for COVID-19 data that fits a bell curve to daily case numbers from different countries (random effects being at the country level).
The data table is too large to include in the post but here is the structure of the dataframe:
> str(dat)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2642 obs. of 7 variables:
$ Country.Region: Factor w/ 181 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
$ date : Date, format: "2020-03-24" "2020-03-25" "2020-03-26" "2020-03-27" ...
$ day : num 0 1 2 3 4 5 6 7 8 9 ...
$ total_cases : int 74 84 94 110 110 120 170 174 237 273 ...
$ new_cases : int 34 10 10 16 0 10 50 4 63 36 ...
$ total_deaths : int 1 2 4 4 4 4 4 4 4 6 ...
$ new_deaths : int 0 1 2 0 0 0 0 0 0 2 ...
Attempt to fit the model:
library(nlme)
dat <- readRDS("dat.rds")
# Bell curve function defined by three parameters
bellcurve.model <- function(d, mu, var, x) {
f <- d*exp(-((x - mu)^2) / (2*var))
return(f)
}
# NLME Model
baseModel <- nlme(new_cases ~ bellcurve.model(d, mu, var, x = day),
data = dat,
fixed = list(d ~ 1, mu ~ 1, var ~ 1),
random = d + mu + var ~ 1|Country.Region,
start = list(fixed = c(1000, 20, 20)),
na.action = na.omit)
However, this is the resulting error:
Error in nlme.formula(new_cases ~ bellcurve.model(d, mu, var, x = day), : Singularity in backsolve at level 0, block 1
I have read other StackOverflow posts suggesting that certain covariates may be confounded in the model, but the only covariate I am using here to predict new_cases
is day
. Any advice on what this means or tips on debugging would be greatly appreciated, especially any advice on how this may be fixed.
Briefly in overview:
I started tinkering by just blindly changing starting values. Changing c(1000, 20, 20)
to c(20000, 10, 10)
I got the model to run error/warning free as baseModel.naive
with a Log-likelihood: -23451.37
. Using increased maxima for iteration and function calls and acquiring informed starting values enabled the model to improve substantially with baseModel.bellcurve
reporting a Log-likelihood: -20487.86
.
I adjusted the starting values. Also, I showed how to increase the maximum number of functional calls and iterations and use the verbose option. More can be found using ?nlme
and ?nlmeControl
. In my experience, these types of models can be sensitive to starting values and maximum iterations and function calls.
baseModel.naive <- nlme(new_cases ~ bellcurve.model(d, mu, var, x = day),
data = dat,
fixed = list(d ~ 1, mu ~ 1, var ~ 1),
random = d + mu + var ~ 1|Country.Region,
start = list(fixed = c(20000, 10, 10)),
na.action = na.omit,
control=list(maxIter=1e2, msMaxIter = 1e6, msVerbose = TRUE))
baseModel.naive
Output:
> baseModel.naive <- nlme(new_cases ~ bellcurve.model(d, mu, var, x = day),
+ data = dat,
+ fixed = list(d ~ 1, mu ~ 1, var ~ 1),
+ random = d + mu + var ~ 1|Country.Region,
+ start = list(fixed = c(20000, 10, 10)),
+ na.action = na.omit,
+ control=list(maxIter=1e2, msMaxIter = 1e6, msVerbose = TRUE))
0: 30455.774: 2.23045 10.6880 8.80935 -11177.6 3277.46 -1877.96
1: 30450.763: 2.80826 11.2726 9.37889 -11177.6 3277.46 -1877.96
2: 30449.789: 2.94771 11.5283 9.80691 -11177.6 3277.46 -1877.96
3: 30449.150: 3.30454 11.8277 10.0329 -11177.6 3277.46 -1877.96
4: 30448.727: 3.64209 12.2237 10.4571 -11177.6 3277.46 -1877.96
5: 30448.540: 3.97875 12.5637 10.8217 -11177.6 3277.46 -1877.96
6: 30448.445: 4.31754 12.9055 11.1826 -11177.6 3277.46 -1877.96
7: 30448.397: 4.65727 13.2480 11.5420 -11177.6 3277.46 -1877.96
8: 30448.372: 4.99746 13.5908 11.9008 -11177.6 3277.46 -1877.96
9: 30448.360: 5.33789 13.9337 12.2591 -11177.6 3277.46 -1877.96
10: 30448.354: 5.67844 14.2767 12.6173 -11177.6 3277.46 -1877.96
11: 30448.351: 6.01905 14.6197 12.9753 -11177.6 3277.46 -1877.96
12: 30448.349: 6.35969 14.9627 13.3333 -11177.6 3277.46 -1877.96
13: 30448.349: 6.70035 15.3058 13.6913 -11177.6 3277.46 -1877.96
14: 30448.348: 7.04102 15.6489 14.0493 -11177.6 3277.46 -1877.96
15: 30448.348: 7.38170 15.9919 14.4073 -11177.6 3277.46 -1877.96
16: 30448.348: 7.72237 16.3350 14.7652 -11177.6 3277.46 -1877.96
17: 30448.348: 8.06305 16.6781 15.1232 -11177.6 3277.46 -1877.96
18: 30448.348: 8.40373 17.0211 15.4811 -11177.6 3277.46 -1877.96
19: 30448.348: 8.74441 17.3642 15.8391 -11177.6 3277.46 -1877.96
20: 30448.348: 9.08508 17.7073 16.1971 -11177.6 3277.46 -1877.96
21: 30448.348: 9.42576 18.0503 16.5550 -11177.6 3277.46 -1877.96
0: 30108.384: 9.42573 18.0503 16.5550 -11183.6 3279.09 -1879.43
1: 30108.384: 9.42568 18.0503 16.5550 -11183.6 3279.09 -1879.43
2: 30108.384: 9.42544 18.0502 16.5548 -11183.6 3279.09 -1879.43
0: 30108.056: 9.42539 18.0502 16.5548 -11168.0 3239.13 -1879.51
1: 30108.056: 9.42526 18.0502 16.5549 -11168.0 3239.13 -1879.51
0: 30108.055: 9.42523 18.0502 16.5549 -11150.0 3223.70 -1879.28
1: 30108.055: 9.42523 18.0502 16.5549 -11150.0 3223.70 -1879.28
> baseModel.naive
Nonlinear mixed-effects model fit by maximum likelihood
Model: new_cases ~ bellcurve.model(d, mu, var, x = day)
Data: dat
Log-likelihood: -23451.37
Fixed: list(d ~ 1, mu ~ 1, var ~ 1)
d mu var
4290.47415 35.32178 38.44048
Random effects:
Formula: list(d ~ 1, mu ~ 1, var ~ 1)
Level: Country.Region
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
d 1.402353e-01 d mu
mu 2.518250e-05 0
var 1.123208e-04 0 0
Residual 1.738523e+03
Number of Observations: 2641
Number of Groups: 126
But then maybe that isn't enough. See the Corr
with the 0's? Something seems off. So I took a page from Roland ( https://stackoverflow.com/a/27549369/2727349) and decided to get a self starting function that was similar to your bellcurve.model
.
I also tried subsetting the data to Country.Region
s that have a minimum number of days in case that was an issue. I am detailing my results/code here in sections and at the end (in the Appendix) it is all together for a quick copy and paste.
To explore things, I tried limiting the data to countries that had a minimum number of days. I chose 45 days (5 countries) and slowly increased it up to 1 day (full dataset). I used data.table
to calculate and display pertinent things.
library(nlme)
library(nlraa)
library(data.table)
dat <- readRDS("dat.rds")
str(dat)
setDT(dat)
# Bell curve function defined by three parameters
bellcurve.model <- function(d, mu, var, x) {
f <- d*exp(-((x - mu)^2) / (2*var))
return(f)
}
dim(dat)
head(dat)
## how many countries?
dat[,uniqueN(Country.Region)]
## number of days/rows per country.region:
print(dat[,.N,by=Country.Region][order(N)], nrows=200)
dat[,N:=.N, by=Country.Region]
minimum_days <- 45
## model restricted to following countries/regions:
data.frame(Country.Region=(as.character(unique(dat[N>=minimum_days]$Country.Region))))
This is where Roland's answer in a related post comes into play. We need an automated way to get informed starting values, and one way of doing that is using nls()
and a self starting function. You can make a self starting function out of your bellcurve.model
but I found SSbell
that was similar and decided to make use of it. I run nls()
with SSbell
and get starting values for the SSbell
formulation which has ymax
(corresponds to d
of bellcurve.model
) and xc
(corresponds to mu
in bellcurve.model
). For the var
starting value in bellcurve.model
I first calculate each and every Country.Region
's variance and then selected one of the smaller ones, settling on the 20%-tile because that worked for minimum_days<-45
and minimum_days<-1
(full data).
## use this approach https://stackoverflow.com/a/27549369/2727349
## but instead of SSlogis
## use nlraa::SSbell
## we can get starting values for d and mu.
plot(new_cases~day, data=dat[N>=minimum_days])
nls_starting_values <- nls(new_cases ~ SSbell(day, ymax, a, b, xc), data=dat[N>=minimum_days])
summary(nls_starting_values)
## calculate a starting value for var: the median of Country.Region specific variances
variances <- sort(dat[N>=minimum_days, var(new_cases, na.rm=T), by=Country.Region]$V1)
range(variances)
quantile(variances, seq(0,1,0.05))
##var.start <- median(dat[N>=minimum_days, var(new_cases, na.rm=T), by=Country.Region]$V1, na.rm=T)
##var.start <- min(dat[N>=minimum_days, var(new_cases, na.rm=T), by=Country.Region]$V1, na.rm=T)
var.start <- quantile(variances, 0.20)
var.start
Again -- I had to play around a bit -- but if you take the 20%-tile of empirical variances the nlme
call for bellcurve.model
the code will run for minimum_days <- 45
as well as minimum_days <- 1
(full dataset). With our starting values calculated and potentially our dataset restricted, we fit two nlme
models: one using SSbell
and the other bellcurve.model
. Both will run for minimum_days<-45
and only bellcurve.model
will converge for minimum_days<-1
(full dataset). Lucky!
SSbell
, the other for bellcurve.model
# NLME Model: using ssbell and nls_starting values for ymax, a, b, xc
baseModel.ssbell <- nlme(new_cases ~ SSbell(day, ymax, a, b, xc),
data = dat[N>=minimum_days],
fixed = list(ymax ~ 1, a ~ 1, b~1, xc~1),
random = ymax + a + b + xc ~ 1|Country.Region,
start = list(fixed = c( coef(nls_starting_values) )),
na.action = na.omit,
control=list(maxIter=1e6, msMaxIter = 1e6, msVerbose = TRUE)
)
# NLME Model: using bellcurve.model and nls_starting values for ymax for d;
# NLME Model: using bellcurve.model and nls_starting values for xc for mu;
# NLME Model: using bellcurve.model and var.start for var
baseModel.bellcurve <- nlme(new_cases ~ bellcurve.model(d, mu, var, x = day),
data = dat[N>=minimum_days],
fixed = list(d ~ 1, mu ~ 1, var ~ 1),
random = d + mu + var ~ 1|Country.Region,
start = list(fixed = c(coef(nls_starting_values)[c(1,4)], var.start )),
na.action = na.omit,
control=list(maxIter=1e6,
msMaxIter = 1e6,
msVerbose = TRUE)
)
baseModel.ssbell
baseModel.bellcurve
Showing the output for minimum_days<-45
shows similar fits (look at likelihood).
## 5 countries DATA minimum_days <- 45
> baseModel.ssbell
Nonlinear mixed-effects model fit by maximum likelihood
Model: new_cases ~ SSbell(day, ymax, a, b, xc)
Data: dat[N >= minimum_days]
Log-likelihood: -2264.706
Fixed: list(ymax ~ 1, a ~ 1, b ~ 1, xc ~ 1)
ymax a b xc
1.026599e+03 -1.164625e-02 -1.626079e-04 1.288924e+01
Random effects:
Formula: list(ymax ~ 1, a ~ 1, b ~ 1, xc ~ 1)
Level: Country.Region
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
ymax 1.731582e+03 ymax a b
a 4.467475e-06 -0.999
b 1.016493e-04 -0.998 0.999
xc 3.528238e+00 1.000 -0.999 -0.999
Residual 8.045770e+02
Number of Observations: 278
Number of Groups: 5
## 5 countries DATA minimum_days <- 45
> baseModel.bellcurve
Nonlinear mixed-effects model fit by maximum likelihood
Model: new_cases ~ bellcurve.model(d, mu, var, x = day)
Data: dat[N >= minimum_days]
Log-likelihood: -2267.406
Fixed: list(d ~ 1, mu ~ 1, var ~ 1)
d mu var
965.81525 12.73549 58.22049
Random effects:
Formula: list(d ~ 1, mu ~ 1, var ~ 1)
Level: Country.Region
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
d 1.633432e+03 d mu
mu 2.815230e+00 1.00
var 5.582379e-03 -0.01 -0.01
Residual 8.127932e+02
Number of Observations: 278
Number of Groups: 5
Showing output for baseModel.bellcurve
for minimum_days<-1
(full dataset) shows an improvement from baseModel.naive
where I blindly and arbitrarily fiddled with the starting values for the sole purpose of getting rid of the errors and warnings.
## FULL DATA minimum_days <- 1
> baseModel.bellcurve
Nonlinear mixed-effects model fit by maximum likelihood
Model: new_cases ~ bellcurve.model(d, mu, var, x = day)
Data: dat[N >= minimum_days]
Log-likelihood: -20487.86
Fixed: list(d ~ 1, mu ~ 1, var ~ 1)
d mu var
1044.52101 25.00288 81.79004
Random effects:
Formula: list(d ~ 1, mu ~ 1, var ~ 1)
Level: Country.Region
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
d 4.285702e+03 d mu
mu 5.423452e+00 0.545
var 3.008379e-03 0.028 0.050
Residual 4.985698e+02
Number of Observations: 2641
Number of Groups: 126
Log-likelihood: -20487.86
for baseModel.bellcurve
vs Log-likelihood: -23451.37
for baseModel.naive
The Corr matrix in baseModel.bellcurve
looks better, too.
library(nlme)
library(nlraa)
library(data.table)
dat <- readRDS("dat.rds")
str(dat)
setDT(dat)
# Bell curve function defined by three parameters
bellcurve.model <- function(d, mu, var, x) {
f <- d*exp(-((x - mu)^2) / (2*var))
return(f)
}
dim(dat)
head(dat)
## how many countries?
dat[,uniqueN(Country.Region)]
## number of days/rows per country.region:
print(dat[,.N,by=Country.Region][order(N)], nrows=200)
dat[,N:=.N, by=Country.Region]
minimum_days <- 45
## model restricted to following countries/regions:
data.frame(Country.Region=(as.character(unique(dat[N>=minimum_days]$Country.Region))))
## use this approach https://stackoverflow.com/a/27549369/2727349
## but instead of SSlogis
## use nlraa::SSbell
## we can get starting values for d and mu.
plot(new_cases~day, data=dat[N>=minimum_days])
nls_starting_values <- nls(new_cases ~ SSbell(day, ymax, a, b, xc), data=dat[N>=minimum_days])
summary(nls_starting_values)
## calculate a starting value for var: the median of Country.Region specific variances
variances <- sort(dat[N>=minimum_days, var(new_cases, na.rm=T), by=Country.Region]$V1)
range(variances)
quantile(variances, seq(0,1,0.05))
##var.start <- median(dat[N>=minimum_days, var(new_cases, na.rm=T), by=Country.Region]$V1, na.rm=T)
##var.start <- min(dat[N>=minimum_days, var(new_cases, na.rm=T), by=Country.Region]$V1, na.rm=T)
var.start <- quantile(variances, 0.20)
var.start
# NLME Model: using ssbell and nls_starting values for ymax, a, b, xc
baseModel.ssbell <- nlme(new_cases ~ SSbell(day, ymax, a, b, xc),
data = dat[N>=minimum_days],
fixed = list(ymax ~ 1, a ~ 1, b~1, xc~1),
random = ymax + a + b + xc ~ 1|Country.Region,
start = list(fixed = c( coef(nls_starting_values) )),
na.action = na.omit,
control=list(maxIter=1e6, msMaxIter = 1e6, msVerbose = TRUE)
)
# NLME Model: using bellcurve.model and nls_starting values for ymax for d;
# NLME Model: using bellcurve.model and nls_starting values for xc for mu;
# NLME Model: using bellcurve.model and var.start for var
baseModel.bellcurve <- nlme(new_cases ~ bellcurve.model(d, mu, var, x = day),
data = dat[N>=minimum_days],
fixed = list(d ~ 1, mu ~ 1, var ~ 1),
random = d + mu + var ~ 1|Country.Region,
start = list(fixed = c(coef(nls_starting_values)[c(1,4)], var.start )),
na.action = na.omit,
control=list(maxIter=1e6, msMaxIter = 1e6, msVerbose = TRUE)
)
baseModel.ssbell
baseModel.bellcurve