I have a dataframe with 1M rows in which one column represents the log posterior probability. Many of these values are very close to 0 when exponentiated, so I have to use log_sum_exp()
rather than log(sum(exp()))
to obtain accurate results. However, I would like to obtain the cumulative probability density implied by these posterior probabilities. If simple summing would work, I could use log(cumsum(exp)))
but for reasons specified above, this does not give accurate results. Currently I am running a for
-loop:
for(k in 1:nrow(y)){
y$cumlogsum[k] <- log_sum_exp(y$log_pp[1:k])
}
This is obviously hopelessly inefficient and takes hours to run. Is there an efficient way to do this in R, e.g. by vectorizing?
The log-sum-exp trick is:
log_sum_exp <- function(x) {
max.x <- max(x)
return(max.x + log(sum(exp(x - max.x))))
}
And a cumsum version would simply be:
log_cumsum_exp <- function(x) {
max.x <- max(x)
return(max.x + log(cumsum(exp(x - max.x))))
}