I need to estimate entropy for many variables in vocabulary data, and some of these have only small samples. I have previously done this using the Chao-Shen entropy estimation in R, but now I would like to be able to do it in Python.
Does anyone know of an implementation in Python for a "coverage-adjusted" entropy estimator, Chao-Shen or similar?
I've looked at scipy.stats.entropy, and it doesn't seem to offer any coverage-adjusted estimator (though I've used it plenty for empirical entropy calculations).
Here is a python translation of the source code in R
import numpy as np
#Python translation of https://github.com/cran/entropy/blob/master/R/entropy.ChaoShen.R
def CAE_entropy(counts):
counts = counts[counts>0]
n = np.sum(counts)
p = counts/n
f1 = np.count_nonzero(counts==1)
if(f1 == n): f1 = n-1
C = 1 - f1 / n
pa = C*p
la = (1 - (1-pa)**n)
return -np.sum(pa*np.log(pa)/la)