I'm trying to implement a Bayesian model for football score prediction (English Premier League). I assume the number of scores follows the Poisson distribution with parameter lambda and this is the general formula:
log(home_lambda) = mu + hfa + home_attack - away_def
log(away_lambda) = mu + away_attack - home_def
where mu and hfa are the intercept and home field advantage respectively. And this is my data (380 x 8)
So basically I have 380x2 equations and 42 parameters (mu, hfa, attack ability of 20 teams and defence ability of 20 teams). Take the first match as an example:
FTHG[1] ~ dpois(home_lambda[1]) # FTHG is full time home goal
log(home_lambda[1]) = mu + hfa + att_Crystal_Palace - att_Arsenal
FTAG[1] ~ dpois(lam_a[1]) # FTAG is full time away goal
log(away_lambda[1]) = mu + att_Arsenal - def_Crystal_Palace
Below is my entire code
bayes_mod_string = " model{
for (i in 1:380){
FTHG[i] ~ dpois(lam_h[i])
log(lam_h[i]) = mu + hfa + h_att[i] - a_def[i]
FTAG[i] ~ dpois(lam_a[i])
log(lam_a[i]) = mu + a_att[i] - h_def[i]
h_att[i] ~ dnorm(0, 1/1e6) # the attack and defence ability of each team follows a Normal distribution
a_def[i] ~ dnorm(0, 1/1e6)
a_att[i] ~ dnorm(0, 1/1e6)
h_def[i] ~ dnorm(0, 1/1e6)
}
mu ~ dnorm(0, 1/1e6)
hfa ~ dnorm(0, 1/1e6)
}"
data_jags = as.list(test_set)
params = c('mu', 'hfa', unique(test_set$h_att), unique(test_set$a_def)) # 42 in total
bayes_mod_string = jags.model(textConnection(bayes_mod_string),
data = data_jags, n.chains = 3)
update(bayes_mod_string, 1e3)
bayes_mod_sim = coda.samples(model = bayes_mod_string,
variable.names = params, n.iter = 1e3)
bayes_mod_csim = as.mcmc(do.call(rbind, bayes_mod_sim))
The error message is
Warning: Failed to set trace monitor for att_Crystal_Palace Variable att_Crystal_Palace not found
Not only with the "att_Crystal_Palace" parameter but also with all parameters.
I tried typing out each team attack and defence ability separately, putting them in and out the for loop
att_Crystal_Palace ~ dnorm(0, 1/1e6)
att_Fulham ~ dnorm(0, 1/1e6)
att_Bournemouth ~ dnorm(0, 1/1e6)
att_Leeds ~ dnorm(0, 1/1e6)
att_Newcastle ~ dnorm(0, 1/1e6)
att_Tottenham ~ dnorm(0, 1/1e6)
att_Everton ~ dnorm(0, 1/1e6)
att_Leicester ~ dnorm(0, 1/1e6)
att_Man_United ~ dnorm(0, 1/1e6)
att_West_Ham ~ dnorm(0, 1/1e6)
att_Aston_Villa ~ dnorm(0, 1/1e6)
att_Arsenal ~ dnorm(0, 1/1e6)
att_Brighton ~ dnorm(0, 1/1e6)
att_Man_City ~ dnorm(0, 1/1e6)
att_Southampton ~ dnorm(0, 1/1e6)
att_Wolves ~ dnorm(0, 1/1e6)
att_Brentford ~ dnorm(0, 1/1e6)
att_Nottm_Forest ~ dnorm(0, 1/1e6)
att_Chelsea ~ dnorm(0, 1/1e6)
att_Liverpool ~ dnorm(0, 1/1e6)
def_Crystal_Palace ~ dnorm(0, 1/1e6)
def_Fulham ~ dnorm(0, 1/1e6)
def_Bournemouth ~ dnorm(0, 1/1e6)
def_Leeds ~ dnorm(0, 1/1e6)
def_Newcastle ~ dnorm(0, 1/1e6)
def_Tottenham ~ dnorm(0, 1/1e6)
def_Everton ~ dnorm(0, 1/1e6)
def_Leicester ~ dnorm(0, 1/1e6)
def_Man_United ~ dnorm(0, 1/1e6)
def_West_Ham ~ dnorm(0, 1/1e6)
def_Aston_Villa ~ dnorm(0, 1/1e6)
def_Arsenal ~ dnorm(0, 1/1e6)
def_Brighton ~ dnorm(0, 1/1e6)
def_Man_City ~ dnorm(0, 1/1e6)
def_Southampton ~ dnorm(0, 1/1e6)
def_Wolves ~ dnorm(0, 1/1e6)
def_Brentford ~ dnorm(0, 1/1e6)
def_Nottm_Forest ~ dnorm(0, 1/1e6)
def_Chelsea ~ dnorm(0, 1/1e6)
def_Liverpool ~ dnorm(0, 1/1e6)
This time the error is
RUNTIME ERROR: Compilation error on line 50. Attempt to redefine node def_Liverpool[1]
What the problem actually is and how can I tackle it? If you'd like to get the data, it's here https://www.football-data.co.uk/englandm.php, season 2022/2023
Your primary problem is here
log(lam_h[i]) = mu + hfa + h_att[i] - a_def[i]
h_att[i]
resolves to, say, att_Arsenal
, as you say, but you don't define a node (variable) called att_Arsenal
. You have similar problems for h_def
, a_att
and a_def
.
The easiest way to deal with this number of predictor variables is to use arrays rather than stand-alone scalars. I've used data from the 2023/24 season from the link you provided. My input data looks like this:
df
# A tibble: 39 × 7
Div Date Time HomeTeam AwayTeam FTHG FTAG
<chr> <chr> <chr> <int> <int> <int> <int>
1 E0 11/08/2023 20:00 6 13 0 3
2 E0 12/08/2023 12:30 1 16 2 1
3 E0 12/08/2023 15:00 3 19 1 1
4 E0 12/08/2023 15:00 5 12 4 1
5 E0 12/08/2023 15:00 9 10 0 1
6 E0 12/08/2023 15:00 17 8 0 1
7 E0 12/08/2023 17:30 15 2 5 1
8 E0 13/08/2023 14:00 4 18 2 2
9 E0 13/08/2023 16:30 7 11 1 1
10 E0 14/08/2023 20:00 14 20 1 0
# … with 29 more rows
where I have indexed both HomeTeam
and AwayTeam
alphabetically. So, Arsenal are 1
and Wolves are 20
.
Then I can write my model string as
bayes_mod_string = "
data {
mu ~ dnorm(0, 1/1e6)
hfa ~ dnorm(0, 1/1e6)
for (j in 1:nTeams) {
h_att[j] ~ dnorm(0, 1/1e6)
a_def[j] ~ dnorm(0, 1/1e6)
a_att[j] ~ dnorm(0, 1/1e6)
h_def[j] ~ dnorm(0, 1/1e6)
log(lam_h[j]) = mu + hfa + h_att[j] - a_def[j]
log(lam_a[j]) = mu + a_att[j] - h_def[j]
}
}
model {
for (i in 1:nMatches) {
FTHG[i] ~ dpois(lam_h[home[i]])
FTAG[i] ~ dpois(lam_a[away[i]])
}
}
"
Note the double-indexing of lam_a
and lam_h
in the model block.
I update the model and obtain samples with
library(rjags)
bayes_model = jags.model(textConnection(bayes_mod_string),
data = data_jags, n.chains = 3)
data_jags <- list(
nTeams = 20,
nMatches = nrow(df),
FTHG = df$FTHG,
FTAG = df$FTAG,
home = df$HomeTeam,
away = df$AwayTeam
)
update(bayes_model, 1e3)
samples <- jags.samples(
model = bayes_model,
variable.names = c('mu', 'hfa', paste0("h_att[", 1:20, "]"), paste0("a_def[", 1:20, "]")),
n.iter = 1e3
)
Note that I've changed coda.samples
to jags.samples
.
I omit printing samples
because the output is rather long.
Personally, I prefer runjags
to rjags
. You can use the model string directly rather than passing through a text connection or writing to a file, the model fitting code is more compact and I feel the output is both more compact and more informative.
library(runjags)
run.jags(
bayes_mod_string,
data = data_jags,
monitor = c('mu', 'hfa', paste0("h_att[", 1:20, "]"), paste0("a_def[", 1:20, "]"))
)
Compiling rjags model...
Calling the simulation using the rjags method...
Note: the model did not require adaptation
Burning in the model for 4000 iterations...
|**************************************************| 100%
Running the model for 10000 iterations...
|**************************************************| 100%
Simulation complete
Calculating summary statistics...
Note: The monitored variables 'mu', 'hfa', 'h_att[1]', 'h_att[2]', 'h_att[3]', 'h_att[4]', 'h_att[5]', 'h_att[6]', 'h_att[7]', 'h_att[8]', 'h_att[9]',
'h_att[10]', 'h_att[11]', 'h_att[12]', 'h_att[13]', 'h_att[14]', 'h_att[15]', 'h_att[16]', 'h_att[17]', 'h_att[18]', 'h_att[19]', 'h_att[20]', 'a_def[1]',
'a_def[2]', 'a_def[3]', 'a_def[4]', 'a_def[5]', 'a_def[6]', 'a_def[7]', 'a_def[8]', 'a_def[9]', 'a_def[10]', 'a_def[11]', 'a_def[12]', 'a_def[13]',
'a_def[14]', 'a_def[15]', 'a_def[16]', 'a_def[17]', 'a_def[18]', 'a_def[19]' and 'a_def[20]' appear to be non-stochastic; they will not be included in the
convergence diagnostic
Finished running the simulation
JAGS model summary statistics from 20000 samples (chains = 2; adapt+burnin = 5000):
Lower95 Upper95 Mean SD Mode MCerr MC%ofSD SSeff Lag.10 psrf
mu 827.99 827.99 827.99 0 -- -- -- -- -- --
hfa 1122.6 1122.6 1122.6 0 -- -- -- -- -- --
h_att[1] 287.62 287.62 287.62 0 -- -- -- -- -- --
h_att[2] -696.89 -696.89 -696.89 0 -- -- -- -- -- --
h_att[3] 332.89 332.89 332.89 0 -- -- -- -- -- --
h_att[4] 941.39 941.39 941.39 0 -- -- -- -- -- --
h_att[5] 795.37 795.37 795.37 0 -- -- -- -- -- --
<further output omitted>
IMPORTANT The lack of variability in the output indicates a problem with either the model specification or my implementation. Personally, I'd have fitted a hierarchical model. I don't have time to explore further right now, but at least I've given you clues about how to proceed.