pythonpandasmatplotlibseabornplot-annotations

How does one insert statistical annotations (stars or p-values)


This seems like a trivial question, but I've been searching for a while and can't seem to find an answer. It also seems like something that should be a standard part of these packages. Does anyone know if there is a standard way to include statistical annotation between distribution plots in seaborn?

For example, between two box or swarmplots?

Example: the yellow distribution is significantly different than the others (by wilcoxon - how can i display that visually?


Solution

  • A brace / bracket can be plotted direct with matplotlib.pyplot.plot or matplotlib.axes.Axes.plot, and annotations can be added with matplotlib.pyplot.text or matplotlib.axes.Axes.text.

    seaborn categorical plots are 0 indexed, whereas box plots, by default, with matplotlib and pandas, start at range(1, N+1), which can be adjusted with the positions parameter.

    seaborn is a high-level API for matplotlib, and pandas.DataFrame.plot uses matplotlib as the default backend.

    Imports and DataFrame

    import seaborn as sns
    import matplotlib.pyplot as plt
    
    # dataframe in long form for seaborn
    tips = sns.load_dataset("tips")
    
    # dataframe in wide form for plotting with pandas.DataFrame.plot
    df = tips.pivot(columns='day', values='total_bill')
    
    # data as a list of lists for plotting directly with matplotlib (no nan values allowed)
    data = [df[c].dropna().tolist() for c in df.columns]
    

    seaborn

    sns.boxplot(x="day", y="total_bill", data=tips, palette="PRGn")
    
    # statistical annotation
    x1, x2 = 2, 3   # columns 'Sat' and 'Sun' (first column: 0, see plt.xticks())
    y, h, col = tips['total_bill'].max() + 2, 2, 'k'
    
    plt.plot([x1, x1, x2, x2], [y, y+h, y+h, y], lw=1.5, c=col)
    plt.text((x1+x2)*.5, y+h, "ns", ha='center', va='bottom', color=col)
    
    plt.show()
    

    box plot annotated

    pandas.DataFrame.plot

    ax = df.plot(kind='box', positions=range(len(df.columns)))
    
    x1, x2 = 2, 3
    y, h, col = df.max().max() + 2, 2, 'k'
    
    ax.plot([x1, x1, x2, x2], [y, y+h, y+h, y], lw=1.5, c=col)
    ax.text((x1+x2)*.5, y+h, "ns", ha='center', va='bottom', color=col)
    

    enter image description here

    matplotlib

    plt.boxplot(data, positions=range(len(data)))
    
    x1, x2 = 2, 3
    
    y, h, col = max(map(max, data)) + 2, 2, 'k'
    
    plt.plot([x1, x1, x2, x2], [y, y+h, y+h, y], lw=1.5, c=col)
    plt.text((x1+x2)*.5, y+h, "ns", ha='center', va='bottom', color=col)
    

    enter image description here


    tips.head()

       total_bill   tip     sex smoker  day    time  size
    0       16.99  1.01  Female     No  Sun  Dinner     2
    1       10.34  1.66    Male     No  Sun  Dinner     3
    2       21.01  3.50    Male     No  Sun  Dinner     3
    3       23.68  3.31    Male     No  Sun  Dinner     2
    4       24.59  3.61  Female     No  Sun  Dinner     4
    

    df.head()

    day  Thur  Fri  Sat    Sun
    0     NaN  NaN  NaN  16.99
    1     NaN  NaN  NaN  10.34
    2     NaN  NaN  NaN  21.01
    3     NaN  NaN  NaN  23.68
    4     NaN  NaN  NaN  24.59
    

    data

    [[27.2, 22.76, 17.29, ..., 20.53, 16.47, 18.78],
     [28.97, 22.49, 5.75, ..., 13.42, 16.27, 10.09],
     [20.65, 17.92, 20.29, ..., 29.03, 27.18, 22.67, 17.82],
     [16.99, 10.34, 21.01, ..., 18.15, 23.1, 15.69]]