pythonstatsmodelsanovascipy.stats

One way Anova using STATSMODELS


I am trying to perform one way ANOVA between three groups. I have been able to get the F-statistic and the p-value of the F-distribution using SCIPY.STATS. However, my preference is to get the ANOVA table as R-like output with the sum of squares. My code for the SCIPY.STATS one way ANOVA is given below. All of the documentation for the STATSMODELS ANOVA is using the pandas dataframe. Any help on how can I tweak my existing code for STATSMODELS will be greatly appreciated.

import numpy as np
import pandas as pd
import scipy.stats as stats
from scipy.stats import f_oneway
data1= pd.read_table('/Users/Hrihaan/Desktop/Sample_A.txt', dtype=float, header=None, sep='\s+').values
data2= pd.read_table('/Users/Hrihaan/Desktop/Sample_B.txt', dtype=float, header=None, sep='\s+').values
data3= pd.read_table('/Users/Hrihaan/Desktop/Sample_C.txt', dtype=float, header=None, sep='\s+').values
Param_1=data1[:,0]
Param_2=data2[:,0]
Param_3=data3[:,0]
f_oneway(Param_1, Param_2, Param_3) 

Solution

  • You can have your data in a long format, first I generate something that looks like your data:

    import numpy as np
    import pandas as pd
    import scipy.stats as stats
    from scipy.stats import f_oneway
    
    np.random.seed(111)
    
    Param_1=np.random.normal(0,1,50)
    Param_2=np.random.normal(0,1,40)
    Param_3=np.random.normal(0,1,30)
    
    f_oneway(Param_1, Param_2, Param_3) 
    
    F_onewayResult(statistic=0.43761348608371037, pvalue=0.6466275522246159)
    

    You can make the long data frame like below or basically make it once you read in the files, and do a pd.concat:

    df = pd.DataFrame({'val':np.concatenate([Param_1,Param_2,Param_3]),
                 'data':np.repeat(['A','B','C'],[len(Param_1),len(Param_2),len(Param_3)])})
    
    df.head()
    
        val data
    0   -1.133838   A
    1   0.384319    A
    2   1.496554    A
    3   -0.355382   A
    4   -0.787534   A
    

    Now we fit a linear model, and anova on it:

    import statsmodels.api as sm
    from statsmodels.formula.api import ols
    
    mod = ols('val ~ data',data=df).fit()
    
    sm.stats.anova_lm(mod, typ=1) 
    
              df    sum_sq  mean_sq F   PR(>F)
    data    2.0 0.794858    0.397429    0.437613    0.646628
    Residual    117.0   106.256352  0.908174    NaN NaN