I'm trying to approximate Beta distribution using a library pomegranate. However, when I try to approximate parameters from the generated data, I got very different parameters. The code to reproduce such error is as follows
import numpy as np
from pomegranate import *
X = np.random.beta(1, 5, size=10000).reshape(-1, 1) # sample from beta distribution with alpha = 1, beta = 5
print(BetaDistribution.from_samples(X).parameters) # approximate beta parameters
>>> [0.0, 10000.0] # error here
I'm not sure where the error comes from. It seems like the test file test_distributions.py produces the right answer. If there is any suggestion on how to fix pomegranate
or creating custom model in pomegranate
would be highly appreciated.
Note I'm using Python 3.6.8
Answer according to this issue, BetaDistribution
provided in the current library is beta-binomial distribution not beta distribution. That's why the model couldn't fit on the sample of beta distribution.
Workaround solution
I got the workaround solution using BayesianOptimization
library. Basically, I try to maximize log likelihood of the distribution from the given data using Bayesian Optimization library. The following code generalizes quite fine with mixture of distributions as well.
from bayes_opt import BayesianOptimization
data = np.random.beta(1, 5, size=10000) # create data
def beta_loss(a, b):
beta_loss = BetaDistribution(a, b).probability(data)
return np.log(beta_loss).sum()
optimizer = BayesianOptimization(
f=beta_loss,
pbounds={'a': (0.5, 5),
'b': (0.5, 20)},
random_state=10
)
# optimize the parameters
optimizer.maximize(
init_points=5,
n_iter=100
)
# plot approximated distribution vs. distribution of the data
x = np.arange(0, 1, 0.01)
plt.hist(data, density=True, bins=100, alpha=0.1)
a, b = [v for k, v in optimizer.max['params'].items()]
plt.plot(x, BetaDistribution(a, b).probability(x))
plt.show()
Here, I just give an example of how to optimize parameters of mixture of Beta distribution and Gaussian distribution:
from bayes_opt import BayesianOptimization
# example data of beta/gaussian distribution
data = np.hstack((np.random.beta(1, 10, size=2000),
np.random.randn(1000) * 0.2 + 0.6))
data = data[np.logical_and(data >= 0.0, data <= 1.0)]
def loss_bimodal(a, b, mu, sigma, w1):
beta_loss = BetaDistribution(a, b).probability(data)
norm_loss = NormalDistribution(mu, sigma).probability(data)
return np.log(w1 * beta_loss + (1 - w1) * norm_loss).sum()
def pdf_bimodal(a, b, mu, sigma, w1, x=np.arange(0, 1, 0.01)):
return w1 * BetaDistribution(a, b).probability(x) + \
(1 - w1) * NormalDistribution(mu, sigma).probability(x)
optimizer = BayesianOptimization(
f=loss_bimodal,
pbounds={'mu': (0., 1.),
'sigma': (0., 1.),
'a': (0.5, 5),
'b': (1, 25),
'w1': (0., 1.)},
random_state=1
)
optimizer.maximize(
init_points=5,
n_iter=100
)
Using the optimized parameters to plot the distribution as follows:
a, b, mu, sigma, w1 = [v for k, v in optimizer.max['params'].items()]
x = np.arange(0, 1, 0.01)
plt.plot(x, pdf(a, b, mu, sigma, w1, x))
plt.hist(data, density=True, bins=100)
plt.show()