I have different distribution which belong to biological data.
These distributions are expected to follow either a single modal distribution (mean = 0.5
), either a bimodal distribution (0.33
and 0.66
) or a trimodal distribution (0.25, 0.5, 0.75)
.
What I want is to simulate these "theoretical" distribution in order to compare the one I got from biological data, with Python or R?
More over, I wonder which parameter will be used to compare them... Shapes, standard deviation, skewedness and kurtosis?
Data that appear to follow a unimodal distribution can often be modelled as a mixture of one or two Gaussians. Likewise, data that appear to follow a bimodal distribution may best be modelled sometimes as a mixture of two or three. If you still have the raw data from which the histograms were created then you could use the facilities of sklearn to identify the 'best' mixed Gaussians for your data. There's code in http://www.astroml.org/book_figures/chapter4/fig_GMM_1D.html that shows how. Once you have such a model then you can use the technique shown in that code to generate pseudo-random samples.
I see that the code is:
gmm = GMM(3, n_iter=1)
gmm.means_ = np.array([[-1], [0], [3]])
gmm.covars_ = np.array([[1.5], [1], [0.5]]) ** 2
gmm.weights_ = np.array([0.3, 0.5, 0.2])
Thus it requires a statement of the number of Gaussians in the mixture, with their means, their covariance matrix and a set of weights, which is presumably the relative number of times each of the Gaussians is sampled.
The idea is to call GMM multiple times, once parameters have been set as above, with from one to (say) four Gaussians in the mixture, and then compare the measures of quality available for these models, given the sample, known as aic and bic in order to make a judgement about the best number.