numpymatplotlibscipyanova

Tukey-Test Grouping and plotting in SciPy


I'm trying to plot results from a Tukey test, but I am struggling with putting data into groups based on a P-Value. This is the equivalent in R which I am trying to replicate. I have been using the SciPy one-way ANOVA tests and the Tukey test statsmodel but can't get these groups done in the same way.

Any help is greatly appreciated

I've also just found this another example in R of what I want to do in python


Solution

  • I have been struggling to do the same thing. I found a paper that tells you how to code the letters.
    Hans-Peter Piepho (2004) An Algorithm for a Letter-Based Representation of All-Pairwise Comparisons, Journal of Computational and Graphical Statistics, 13:2, 456-466, DOI: 10.1198/1061860043515

    Doing the coding was a little tricky as you need to check and replicate columns and then combine columns. I tried to add some comments to the colde. I figured out a method where you can run tukeyhsd and then from the results compute the letters. It should be possible to turn this into a function. Or hopefully part of tukeyhsd. My data is not posted but it is a column of data and then a column describing the groups. The groups for me are the five boroughs of NYC. You can also just change the comments and use random data the first time.

    # Read data.  Comment out the next ones to use random data.  
    df=pd.read_excel('anova_test.xlsx')
    #n=1000
    #df = pd.DataFrame(columns=['Groups','Data'],index=np.arange(n))
    #df['Groups']=np.random.randint(1, 4,size=n)
    #df['Data']=df['Groups']*np.random.random_sample(size=n)
    
    
    # define columns for data and then grouping
    col_to_group='Groups'
    col_for_data='Data'
    
    #Now take teh data and regroup for anova
    samples = [cols[1] for cols in df.groupby(col_to_group)[col_for_data]]    #I am not sure how this works but it makes an numpy array for each group 
    f_val, p_val = stats.f_oneway(*samples)  # I am not sure what this star does but this passes all the numpy arrays correctly
    #print('F value: {:.3f}, p value: {:.3f}\n'.format(f_val, p_val))
    
    # this if statement can be uncommmented if you don't won't to go furhter with out p<0.05
    #if p_val<0.05:    #If the p value is less than 0.05 it then does the tukey
    mod = MultiComparison(df[col_for_data], df[col_to_group])
    thsd=mod.tukeyhsd()
    #print(mod.tukeyhsd())
    
    #this is a function to do Piepho method.  AN Alogrithm for a letter based representation of al-pairwise comparisons.  
    tot=len(thsd.groupsunique)
    #make an empty dataframe that is a square matrix of size of the groups. #set first column to 1
    df_ltr=pd.DataFrame(np.nan, index=np.arange(tot),columns=np.arange(tot))
    df_ltr.iloc[:,0]=1
    count=0
    df_nms = pd.DataFrame('', index=np.arange(tot), columns=['names'])  # I make a dummy dataframe to put axis labels into.  sd stands for signifcant difference
    
    for i in np.arange(tot):   #I loop through and make all pairwise comparisons. 
        for j in np.arange(i+1,tot):
            #print('i=',i,'j=',j,thsd.reject[count])
            if thsd.reject[count]==True:
                for cn in np.arange(tot):
                    if df_ltr.iloc[i,cn]==1 and df_ltr.iloc[j,cn]==1: #If the column contains both i and j shift and duplicat
                        df_ltr=pd.concat([df_ltr.iloc[:,:cn+1],df_ltr.iloc[:,cn+1:].T.shift().T],axis=1)
                        df_ltr.iloc[:,cn+1]=df_ltr.iloc[:,cn]
                        df_ltr.iloc[i,cn]=0
                        df_ltr.iloc[j,cn+1]=0
                    #Now we need to check all columns for abosortpion.
                    for cleft in np.arange(len(df_ltr.columns)-1):
                        for cright in np.arange(cleft+1,len(df_ltr.columns)):
                            if (df_ltr.iloc[:,cleft].isna()).all()==False and (df_ltr.iloc[:,cright].isna()).all()==False: 
                                if (df_ltr.iloc[:,cleft]>=df_ltr.iloc[:,cright]).all()==True:  
                                    df_ltr.iloc[:,cright]=0
                                    df_ltr=pd.concat([df_ltr.iloc[:,:cright],df_ltr.iloc[:,cright:].T.shift(-1).T],axis=1)
                                if (df_ltr.iloc[:,cleft]<=df_ltr.iloc[:,cright]).all()==True:
                                    df_ltr.iloc[:,cleft]=0
                                    df_ltr=pd.concat([df_ltr.iloc[:,:cleft],df_ltr.iloc[:,cleft:].T.shift(-1).T],axis=1)
    
            count+=1
    
    #I sort so that the first column becomes A        
    df_ltr=df_ltr.sort_values(by=list(df_ltr.columns),axis=1,ascending=False)
    
    # I assign letters to each column
    for cn in np.arange(len(df_ltr.columns)):
        df_ltr.iloc[:,cn]=df_ltr.iloc[:,cn].replace(1,chr(97+cn)) 
        df_ltr.iloc[:,cn]=df_ltr.iloc[:,cn].replace(0,'')
        df_ltr.iloc[:,cn]=df_ltr.iloc[:,cn].replace(np.nan,'') 
    
    #I put all the letters into one string
    df_ltr=df_ltr.astype(str)
    df_ltr.sum(axis=1)
    #print(df_ltr)
    #print('\n')
    #print(df_ltr.sum(axis=1))
    
    #Now to plot like R with a violing plot
    fig,ax=plt.subplots()
    df.boxplot(column=col_for_data, by=col_to_group,ax=ax,fontsize=16,showmeans=True
                        ,boxprops=dict(linewidth=2.0),whiskerprops=dict(linewidth=2.0))  #This makes the boxplot
    
    ax.set_ylim([-10,20])
    
    grps=pd.unique(df[col_to_group].values)   #Finds the group names
    grps.sort() # This is critical!  Puts the groups in alphabeical order to make it match the plotting
    
    props=dict(facecolor='white',alpha=1)
    for i,grp in enumerate(grps):   #I loop through the groups to make the scatters and figure out the axis labels. 
    
        x = np.random.normal(i+1, 0.15, size=len(df[df[col_to_group]==grp][col_for_data]))
        ax.scatter(x,df[df[col_to_group]==grp][col_for_data],alpha=0.5,s=2)
        name="{}\navg={:0.2f}\n(n={})".format(grp
                                ,df[df[col_to_group]==grp][col_for_data].mean()
                                ,df[df[col_to_group]==grp][col_for_data].count())
        df_nms['names'][i]=name 
        ax.text(i+1,ax.get_ylim()[1]*1.1,df_ltr.sum(axis=1)[i],fontsize=10,verticalalignment='top',horizontalalignment='center',bbox=props)
    
    
    ax.set_xticklabels(df_nms['names'],rotation=0,fontsize=10)
    ax.set_title('')
    fig.suptitle('')
    
    fig.savefig('anovatest.jpg',dpi=600,bbox_inches='tight')
    

    Results showing the letters above plots using the tukeyhsd