I'm plotting two distributions as histplots, and would like to visualize the difference between them. The distributions are rather similar:
The code I am using to generate one of these plots looks like this:
sns.histplot(
data=dfs_downvoted_percentages["only_pro"],
ax=axes[0],
x="percentage_downvoted",
bins=30,
stat="percent",
)
My supervisor suggested plotting the difference between the normalized distributions, basically displaying the subtraction of one plot form the other. The end result should be a plot where some bins go below 0 (if the bins in plot 2 are larger than in plot 1). Thus, similarities between the plots are erased and differences highlighted.
If not: I know how I can normalize the data and calculate percentage for each bin by hand. But what I couldn't find is a kind of plot that consists of bins and offers the possibility to have negative bins. I know how I could create a lineplot with 30 data points showing the calculated difference, but I'd rather have it visually similar to the original plots with bins instead of a line. What kind of plot could I use for that?
np.histogram
, which returns hist
and bin_edges
.
bin_edges
must be used for both function calls.hist
of each dataframe, and plot it against bin_edges
.h_diff
as a bar plot.
bin_edge
than there are bars, so select all but the last value, bin_edges[:-1]
, for the x-axis labels passed to x=
.sns.barplot
are 0-indexed, so reset the ticks with an extra tick, off-set them by -0.5
, and relabel the ticks with all the bin_edges
.import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
# sample data
np.random.seed(2023)
a = np.random.normal(50, 15, (100,))
b = np.random.normal(30, 8, (100,))
# dataframe from sample distributions
df = pd.DataFrame({'a': a, 'b': b})
# calculate the histogram for each distribution
bin_edges = np.arange(10, 91, 10)
a_hist, _ = np.histogram(df.a, bins=bin_edges)
b_hist, _ = np.histogram(df.b, bins=bin_edges)
# calculate the difference
h_diff = a_hist - b_hist
# plot
fig, ax = plt.subplots(figsize=(7, 5))
sns.barplot(x=bin_edges[:-1], y=h_diff, color='tab:blue', ec='k', width=1, alpha=0.8, ax=ax)
ax.set_xticks(ticks=np.arange(0, 9)-0.5, labels=bin_edges)
ax.margins(x=0.1)
_ = ax.set(title='Difference between Sample A and B: hist(a) - hist(b)', ylabel='Difference', xlabel='Bin Ranges')
fig, ax = plt.subplots(figsize=(7, 5))
sns.histplot(data=df, multiple='dodge', common_bins=True, ax=ax, bins=bin_edges)