pythonmatplotlibseabornboxplot

Boxplot with scatter points: preventing overlap when categories are similar


I am trying to create a boxplot with different categories and overlay scatter points on top of it. The problem I am encountering is that when the results across categories are very similar, the boxplots appear to overlap (see the figure, specifically panel B, TSS metric).

I have tried adjusting the width of the boxes, but this ends up misaligning the scatter points relative to the boxes, which is not ideal.

Could anyone suggest a better way to prevent the boxplots from overlapping while keeping the scatter points properly aligned?

Thank you so much!

enter image description here

Here is the code I have been using:

map_levels = {
    0.5: "High",
    0.2: "Medium",
    0.1: "Low",
    0.01: "Extremely low"
}

simulated_RF["Prevalence_level"] = simulated_RF["Prevalence"].map(map_levels)

order_levels = ["High", "Medium", "Low", "Extremely low"]
simulated_RF["Prevalence_level"] = pd.Categorical(
    simulated_RF["Prevalence_level"], categories=order_levels, ordered=True
)


metrics_to_plot = ["AUC", "TSS", "BrierScore", "LogLoss"]

palette_custom  = ["cornflowerblue", "orange"]

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

panel_labels = ["A)", "B)", "C)", "D)"]

for i, metric in enumerate(metrics_to_plot):
    
    ax = axes[i]
    df_m = simulated_RF[simulated_RF["Metric"] == metric]
    
    
    sns.boxplot(
        data=df_m,
        x="Prevalence_level",
        y="Value",
        hue="Type",
        dodge=True,
        ax=ax,
        palette = palette_custom 
    )
    
    
    sns.stripplot(
        data=df_m,
        x="Prevalence_level",
        y="Value",
        hue="Type",
        dodge=True,
        palette=["black", "black"],
        size=4,
        jitter=True,
        alpha=0.5,
        ax=ax
    )
    
   
    ax.set_title(metric, fontsize=16)
    ax.set_xlabel("Prevalence level", fontsize=13)
    ax.set_ylabel(metric, fontsize=13)
    ax.set_xticks(range(len(order_levels)))
    ax.set_xticklabels(order_levels, fontsize=11)
    ax.tick_params(axis="y", labelsize=11)
    
    ax.text(-0.1, 1.05, panel_labels[i],
            transform=ax.transAxes, fontsize=16, fontweight="bold")
    
    handles, labels = ax.get_legend_handles_labels()
    if i == 0:
        legend_handles = handles[:2]
        legend_labels = labels[:2]
    ax.get_legend().remove()

fig.legend(
    legend_handles, legend_labels,
    loc="lower center", ncol=2,
    fontsize=13, title_fontsize=13
)

fig.tight_layout(rect=[0, 0.05, 1, 0.95])
plt.show() 

Solution

  • sns.boxplot() has a parameter gap= to add a gap between dodged boxes. gap is measured over the x-axis (0.1 means 1/10 of the distance between the x-positions) and defaults to 0. The boxes themselves stay on the same positions, so they still align nicely with the stripplot.

    You can also set showfliers=False to suppress the boxplot's outliers, as they now are shown via the stripplot.

    import matplotlib.pyplot as plt
    import seaborn as sns
    
    # load a test dataframe
    df = sns.load_dataset('iris')
    
    fig, ax = plt.subplots(figsize=(14, 5))
    
    # convert to long format
    df_long = df.melt(id_vars='species', var_name='measurement', value_name='value',
                      value_vars=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'])
    
    sns.boxplot(data=df_long, hue='species', x='measurement', y='value', palette='turbo',
                showfliers=False, gap=0.1)
    sns.stripplot(data=df_long, hue='species', x='measurement', y='value', dodge=True, legend=False,
                  palette=['pink']*3, edgecolor='blue', linewidth=1)
    
    ax.set(xlabel='', ylabel='')
    sns.despine()
    plt.tight_layout()
    plt.show()
    

    dodged boxplot with stripplot