pythonnumpymatplotlibseabornhistplot

How to plot the difference between two histograms


I'm plotting two distributions as histplots, and would like to visualize the difference between them. The distributions are rather similar:

my plots

The code I am using to generate one of these plots looks like this:

sns.histplot(
    data=dfs_downvoted_percentages["only_pro"],
    ax=axes[0],
    x="percentage_downvoted",
    bins=30,
    stat="percent",
)

My supervisor suggested plotting the difference between the normalized distributions, basically displaying the subtraction of one plot form the other. The end result should be a plot where some bins go below 0 (if the bins in plot 2 are larger than in plot 1). Thus, similarities between the plots are erased and differences highlighted.

  1. Does this make sense? The plots are part of a paper which will hopefully be published; I haven't seen such a plot before, but as he explained it, it makes sense to me. Are there better ways to visualize what I want to express? I already have another plot where I filter out all values with x=0, so that the other ones become more visible.
  2. Is there an easy way to achieve this utilizing seaborn?

If not: I know how I can normalize the data and calculate percentage for each bin by hand. But what I couldn't find is a kind of plot that consists of bins and offers the possibility to have negative bins. I know how I could create a lineplot with 30 data points showing the calculated difference, but I'd rather have it visually similar to the original plots with bins instead of a line. What kind of plot could I use for that?


Solution

  • import pandas as pd
    import seaborn as sns
    import matplotlib.pyplot as plt
    import numpy as np
    
    # sample data
    np.random.seed(2023)
    a = np.random.normal(50, 15, (100,))
    b = np.random.normal(30, 8, (100,))
    
    # dataframe from sample distributions
    df = pd.DataFrame({'a': a, 'b': b})
    
    # calculate the histogram for each distribution
    bin_edges = np.arange(10, 91, 10)
    
    a_hist, _ = np.histogram(df.a, bins=bin_edges) 
    b_hist, _ = np.histogram(df.b, bins=bin_edges) 
    
    # calculate the difference
    h_diff = a_hist - b_hist
    
    # plot
    fig, ax = plt.subplots(figsize=(7, 5))
    sns.barplot(x=bin_edges[:-1], y=h_diff, color='tab:blue', ec='k', width=1, alpha=0.8, ax=ax)
    ax.set_xticks(ticks=np.arange(0, 9)-0.5, labels=bin_edges)
    ax.margins(x=0.1)
    _ = ax.set(title='Difference between Sample A and B: hist(a) - hist(b)', ylabel='Difference', xlabel='Bin Ranges')
    

    enter image description here

    fig, ax = plt.subplots(figsize=(7, 5))
    sns.histplot(data=df, multiple='dodge', common_bins=True, ax=ax, bins=bin_edges)
    

    enter image description here