I am trying to group ~11,000 genes of a species of plant (Arabidopsis thaliana), by data I have of their change in expression levels, in response to light exposure.
The raw values, per gene, are continuous random variables, but I wish to discretise the values to have, say, 20 discrete classes.
So instead of:
change_in_expression = array([-2.2, -1.1, -1.2, ..., 0.6, -1. , -0.9])
I have class outputs:
change_in_expression = array(["3_to_4","2_to_3","1_to_2",...])
I plot the distribution with seaborn's distplot()
, which I believe uses a KDE:
import seaborn as sns
d = array([-2.2, -1.1, -1.2, ..., 0.6, -1. , -0.9]) # = change_in_expression
dis = sns.distplot(d, fit=stats.laplace, kde=False)
plt.title("Distribution of Differential Expression")
plt.xlabel("Log2FoldChange in expression")
plt.ylabel("Frequency")
plt.show()
And I am aware that matplotlib.pyplot's
hist()
allows you to extract bins, when the default settings "automatically" generate these groupings...
The question is, how can I group my genes? This is a broader question than just asking for the seaborn
version of matplotlib's
hist()
... since seaborn's
distplot
uses a KDE.
Doesn't really seem like I can get the bins from the ax
object created by seaborn, by looking at the methods available from:
dir(sns.distplot(d, fit=stats.laplace, kde=False)
I guess, one way would be to inspect the guts of seaborn's distplot
source code, to figure out how they bin the data prior to plotting... but that is way beyond my unicorn skillset...
Seaborn calls pyplot.hist
, which in turn calls numpy.histogram
. So one can check what seaborn uses as bins
argument if none is specified. I.e. with a
being the data,
bins = min(_freedman_diaconis_bins(a), 50)
where 1
def _freedman_diaconis_bins(a):
"""Calculate number of hist bins using Freedman-Diaconis rule."""
# From https://stats.stackexchange.com/questions/798/
a = np.asarray(a)
if len(a) < 2:
return 1
h = 2 * iqr(a) / (len(a) ** (1 / 3))
# fall back to sqrt(a) bins if iqr is 0
if h == 0:
return int(np.sqrt(a.size))
else:
return int(np.ceil((a.max() - a.min()) / h))
and iqr
2
def iqr(a):
"""Calculate the IQR for an array of numbers."""
a = np.asarray(a)
q1 = stats.scoreatpercentile(a, 25)
q3 = stats.scoreatpercentile(a, 75)
return q3 - q1
All of this should roughly be the same as
bins = min(len(numpy.histogram_bin_edges(a, bins="fd")), 50)
or
bins = 50 if len(numpy.histogram_bin_edges(a, bins="fd")) > 50 else "fd"
then passing bins
to pyplot.hist
,
plt.hist(a, bins=bins)