pythonscipychi-squared

chi squared hypothesis test with several binary variables


I have a data about plants grown in a nursery. I have a variable for plant health and several factors.

I wanted to test if any of the factors influenced plant health, so I thought the best method would be to use a chi squared test.

My method is below, but I get stuck after the cross tab

# Example Data
df = pd.DataFrame({'plant_health': ['a','b','c','a','b','b'],
                   'factor_1': ['yes','no','no','no','yes','yes'],
                   'factor_2': ['yes','yes','no','no','yes','yes'],
                   'factor_3': ['yes','no','no','yes','yes','yes'],
                   'factor_4': ['yes','yes','no','no','yes','yes'],
                   'factor_5': ['yes','no','yes','no','yes','yes'],
                   'factor_6': ['yes','no','no','no','yes','yes'],
                   'factor_7': ['yes','yes','no','yes','yes','yes'],
                   'factor_8': ['yes','no','yes','no','yes','yes'],
                   'factor_9': ['yes','yes','yes','yes','yes','yes'],
                   })

# Melt dataframe
df = df.melt(id_vars='plant_health', 
         value_vars=['factor_1', 'factor_2', 'factor_3', 'factor_4', 'factor_5',
       'factor_6', 'factor_7', 'factor_8', 'factor_9'])

# Create cross tab
pd.crosstab(df.plant_health, columns=[df.variable, df.value])

I can do the test with one factor but don't know how to expand that to all factors.

from scipy.stats import chisquare
from scipy import stats
from scipy.stats import chi2_contingency

# Example with only the first factor
tab_data = [[1,1], [1,2],[1,0]]
chi2_contingency(tab_data)

Solution

  • Try this please and let me know if it's what you expect:

    tab = pd.crosstab(df.plant_health, columns=[df.variable, df.value])
    
    chi2_contingency(tab)
    
    

    Output

    (20.666666666666668,
     0.9387023859836788,
     32,
     array([[1.        , 1.        , 0.66666667, 1.33333333, 0.66666667,
             1.33333333, 0.66666667, 1.33333333, 0.66666667, 1.33333333,
             1.        , 1.        , 0.33333333, 1.66666667, 0.66666667,
             1.33333333, 2.        ],
            [1.5       , 1.5       , 1.        , 2.        , 1.        ,
             2.        , 1.        , 2.        , 1.        , 2.        ,
             1.5       , 1.5       , 0.5       , 2.5       , 1.        ,
             2.        , 3.        ],
            [0.5       , 0.5       , 0.33333333, 0.66666667, 0.33333333,
             0.66666667, 0.33333333, 0.66666667, 0.33333333, 0.66666667,
             0.5       , 0.5       , 0.16666667, 0.83333333, 0.33333333,
             0.66666667, 1.        ]]))
    

    EDIT

    As you can do the individual chi-squared test by using a function like:

    # we can use this to first df (without melt)
    
    def  chi_squared_test(plant_health, factor_n):
    
        tab = pd.crosstab(plant_health, factor_n)
    
        return chi2_contingency(tab)
    
    chi_squared_test(df.plant_health, df.factor_9)
    

    Output

    (1.3333333333333333,
     0.5134171190325922,
     2,
     array([[1. , 1. ],
            [1.5, 1.5],
            [0.5, 0.5]]))