pythongroupinghistogramseaborndiscrete-mathematics

How can I extract the bins from seaborn's KDE distplot object?


Background

I am trying to group ~11,000 genes of a species of plant (Arabidopsis thaliana), by data I have of their change in expression levels, in response to light exposure.

The raw values, per gene, are continuous random variables, but I wish to discretise the values to have, say, 20 discrete classes.

So instead of:

change_in_expression = array([-2.2, -1.1, -1.2, ...,  0.6, -1. , -0.9])

I have class outputs:

change_in_expression = array(["3_to_4","2_to_3","1_to_2",...])

What I have tried

I plot the distribution with seaborn's distplot(), which I believe uses a KDE:

import seaborn as sns

d = array([-2.2, -1.1, -1.2, ...,  0.6, -1. , -0.9]) # = change_in_expression

dis = sns.distplot(d, fit=stats.laplace, kde=False)

plt.title("Distribution of Differential Expression")
plt.xlabel("Log2FoldChange in expression")
plt.ylabel("Frequency")
plt.show()

enter image description here

And I am aware that matplotlib.pyplot's hist() allows you to extract bins, when the default settings "automatically" generate these groupings...


Summary Question

The question is, how can I group my genes? This is a broader question than just asking for the seaborn version of matplotlib's hist()... since seaborn's distplot uses a KDE.

Doesn't really seem like I can get the bins from the ax object created by seaborn, by looking at the methods available from:

dir(sns.distplot(d, fit=stats.laplace, kde=False)

I guess, one way would be to inspect the guts of seaborn's distplot source code, to figure out how they bin the data prior to plotting... but that is way beyond my unicorn skillset...


Solution

  • Seaborn calls pyplot.hist, which in turn calls numpy.histogram. So one can check what seaborn uses as bins argument if none is specified. I.e. with a being the data,

    bins = min(_freedman_diaconis_bins(a), 50)
    

    where 1

    def _freedman_diaconis_bins(a):
        """Calculate number of hist bins using Freedman-Diaconis rule."""
        # From https://stats.stackexchange.com/questions/798/
        a = np.asarray(a)
        if len(a) < 2:
            return 1
        h = 2 * iqr(a) / (len(a) ** (1 / 3))
        # fall back to sqrt(a) bins if iqr is 0
        if h == 0:
            return int(np.sqrt(a.size))
        else:
            return int(np.ceil((a.max() - a.min()) / h))
    

    and iqr 2

    def iqr(a):
        """Calculate the IQR for an array of numbers."""
        a = np.asarray(a)
        q1 = stats.scoreatpercentile(a, 25)
        q3 = stats.scoreatpercentile(a, 75)
        return q3 - q1
    

    All of this should roughly be the same as

    bins = min(len(numpy.histogram_bin_edges(a, bins="fd")), 50)
    

    or

    bins = 50 if len(numpy.histogram_bin_edges(a, bins="fd")) > 50 else "fd"
    

    then passing bins to pyplot.hist,

    plt.hist(a, bins=bins)