pythonpandasaccumulatehistplot

X values in a cumulative histogram in Python


I have a dataframe with values representing one item's correspondence with another item (by percentage), for example, the amount of characters in one string matching another string. Here is some sample data:

**pident**
100
100
51.515
55.405
20
91.667
86.207
58.621
77.778

I would like to represent this in a cumulative histogram that will show the number of matching items, but since it's percentages, each one needs to contain the one above it, and not the one below, which is the opposite of what the cumulative option does by default. For example, in the bin of the 90 percentile, it should contain all of the items above 90.

In order to work around the fact the the accumilation is done from the lower values to the higher values, I manipulated the values themselves:

df1['abs_pident'] = np.abs(100 - df['pident'])
sns.histplot(data=df1, x=df1['abs_pident'], cumulative='True', bins=10)

This way, the bars show correctly, but the x axis values are wrong and this is where I am stuck. The values need to go down from 100 (highest) to 20 (lowest) and I can't find a way to do that. Any idea of how to that or create the chart the way I want it without having to manipulate the data, will be highly appreciated :)

This is how my chart looks like right now:

Chart


Solution

  • You can use a formatter to reverse the 100-pident operation and invert the axis:

    import pandas as pd
    import seaborn as sns
    import numpy as np
    import matplotlib.pyplot as plt
    df1 = pd.DataFrame()
    
    df1["pident"] = [100, 100,51.515,55.405,20,91.667,86.207,58.621,77.778]
    df1['abs_pident'] = np.abs(100 - df1['pident'])
    sns.histplot(data=df1, x=df1['abs_pident'], cumulative='True', bins=10)
    
    def numfmt(x, pos): # your custom formatter function: divide by 100.0
        s = '{}'.format(100-x)
        return s
    
    import matplotlib.ticker as tkr     # has classes for tick-locating and -formatting
    xfmt = tkr.FuncFormatter(numfmt)    # create your custom formatter function
    
    # your existing code can be inserted here
    
    plt.gca().xaxis.set_major_formatter(xfmt)
    plt.gca().invert_xaxis()
    

    enter image description here

    Note

    Not sure if it is intended, but it might be better to explicitly set bin edges instead of bin count. Setting bins=range(0,110,10) will give you bin edges at multiples of 10 upt to 100 (note how the bin edges in the above plot are at values not directly readable from the plot) enter image description here