I'm using the predict.hurdle
function from the pscl
package to estimate the probabilities of observing 0, 1, 2, ..., N events in a data set.
Using the example in ?predict.hurdle
:
data("bioChemists", package = "pscl")
fm_hp1 <- hurdle(art ~ ., data = bioChemists)
summary(fm_hp1)
head(predict(fm_hp1, newdata = bioChemists, type = "prob"))
# returns a matrix of probabilities too large to show here
Each row of this matrix is an observation and each column is the probability of that count, 0-19 in this case.
summary(rowSums(predict(fm_hp1, newdata = bioChemists, type = "prob")))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.9998 1.0000 1.0000 1.0000 1.0000 1.0000
But some rows don't sum to 1 as they should. But okay, they're close so maybe it isn't an issue....
But, I need to calibrate the intercept terms. "Calibration" in my industry is an acceptable way of saying "change the estimated parameters". Yes, I know there are many reasons for why this is not a good idea statistically (intentionally biasing estimates). However, I would still expect the code to work and the prediction to adhere to the rules of probability.
# Change the count model intercept
fm_hp1$coefficients$count["(Intercept)"] <- 3
summary(rowSums(predict(fm_hp1, newdata = bioChemists, type = "prob")))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.001521 0.434300 0.647400 0.602000 0.818400 0.983900
Now we see some major problems with the resultant probabilities.
I'm tempted to simply re-normalize these utilities on a 0-1 scale via:
old.p <- predict(fm_hp1, newdata = bioChemists, type = "prob")
new.p <- t(apply(X = old.p, MARGIN = 1, FUN = function(x) x/sum(x)))
summary(rowSums(new.p))
But I worry that the cause of the issues with the probabilities summing to 1 means that this wouldn't be appropriate.
Is my worry founded? Is there another element of fm_hp1
that I need to modify in order to to change the intercept term but still get correct probability predictions?
The count distributions supported in hurdle()
all have support 0, 1, 2, ... (up to infinity). Thus, in order to sum exactly to 1 you would have to sum up the probabilities for all these integers 0, 1, 2, ...
As infinitely many values are not useful in practice the predict()
method just provides the probability for a finite number of integers, by default 0, 1, 2, ..., max(y), i.e., up to the maximum response observed. In case of the bioChemists
data this is 0, 1, ..., 19.
Thus, by taking only the sum over these probabilities you ignore the probabilities for all higher counts. Usually, this probability weight is small as your first summary illustrates. However, if you increase the intercept(s) you make higher counts much much more likely than they were in the original data set (expectations increase by a factor of about 10!). Thus, you would need to sum up over a much larger support. You can do this by supplying the at
argument:
summary(rowSums(predict(fm_hp1, type = "prob", at = 0:50)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.001789 1.000000 1.000000 0.994000 1.000000 1.000000
summary(rowSums(predict(fm_hp1, type = "prob", at = 0:100)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9889 1.0000 1.0000 1.0000 1.0000 1.0000
summary(rowSums(predict(fm_hp1, type = "prob", at = 0:200)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 1 1 1 1 1
As already brought up in the comments above I doubt that this change of the intercept is really a good strategy here but that is a different debate...