pythonscipyscipy.stats

Getting inaccurate statistics using scipy


I am conducting a simple independent t-test. I am pulling my data from a CSV file that looks something like this after melting:

Region Group variable value
IPN Control MMP-2 123
IPN Control MMP-2 456
IPN Experimental MMP-2 789
IPN Experimental MMP-2 111
IPN Control MMP-9 222
IPN Control MMP-9 333
IPN Experimental MMP-9 444
IPN Experimental MMP-9 555

In order to conduct a t-test on both the MMP-2 and MMP-9 variables, I first split the original df into two separate df's and then ran the stats.

mmp2_only = df.query('variable == "MMP-2" ').reset_index(drop=True)
mmp9_only = df.query('variable == "MMP-9" ').reset_index(drop=True)

stat_results = [
stats.ttest_ind(mmp2_only['Group'] == 'Control' , mmp2_only['Group'] == 'STZ'),
stats.ttest_ind(mmp9_only['Group'] == 'Control' , mmp9_only['Group'] == 'STZ ')
]

pvalues = [result.pvalue for result in stat_results]
print("MMP2 \n", stat_results[0], "\n")
print("MMP9 \n", stat_results[1], "\n")

This results in the following output (from my actual data):

MMP2 Ttest_indResult(statistic=2.6457513110645907, pvalue=0.019187621399825557)

MMP9 Ttest_indResult(statistic=0.0, pvalue=1.0)

The issues I'm running into: The p-value and statistic from the MMP-2 variable is inaccurate. I was suspicious so I double checked it using prism (a statistics software) and I got completely different values. Additionally, I don't understand why it is giving me a p-value of 1 for the MMP-9 variable.

I have also tried running the stats individually like this:

mmp9_only = new_df.query('variable == "MMP-9" ').reset_index(drop=True)
display(mmp9_only)

stats.ttest_ind(mmp9_only['Group'] == 'Control' , mmp9_only['Group'] == 'STZ')

And I got the following output for BOTH groups:

Ttest_indResult(statistic=2.6457513110645907, pvalue=0.019187621399825557)

I am especially confused because it doesn't make sense to me why when I run the stats simultaneously I get a different p-value in comparison to running them individually. It also makes no sense to me why I am getting the same output when I run them individually when the values being used to calculate the t-test are not the same in the MMP-2 and MMP-9 group. I have verified that they should be completely different p-values in prism.

I am stuck as to what the issue might be. I have gone through and made sure that I am calling all of my variables correctly and that there are not any syntax errors.


Solution

  • Here's how SciPy is arriving at those values. Let's focus on the first test:

    stats.ttest_ind(mmp2_only['Group'] == 'Control' , mmp2_only['Group'] == 'STZ'),
    

    It starts by finding the value of mmp2_only['Group'] == 'Control'. This is a Series of True/False values.

    0     True
    1     True
    2    False
    3    False
    Name: Group, dtype: bool
    

    Then it converts those to 0 or 1.

    0    1.0
    1    1.0
    2    0.0
    3    0.0
    Name: Group, dtype: float64
    

    Then, [1, 1, 0, 0] is treated as the first sample. In a similar manner, it finds the second sample as [0, 0, 0, 0]. In other words, it is basically testing if the number of matching rows in the dataframe is equivalent, not the values in those rows.

    If you want to get the values where group has a specific value, you could use something like this:

    mmp2_only.loc[mmp2_only['Group'] == 'Control', 'value']
    

    You could also wrap this in a function to make it less error-prone.

    def get_group(df, group_name):
        subset = df.loc[df['Group'] == group_name, 'value']
        if len(subset) == 0:
            raise Exception(f"{group_name} not found")
        return subset
    
    mmp2_only = df.query('variable == "MMP-2" ').reset_index(drop=True)
    mmp9_only = df.query('variable == "MMP-9" ').reset_index(drop=True)
    
    stat_results = [
        stats.ttest_ind(get_group(mmp2_only, 'Control'), get_group(mmp2_only, 'Experimental')),
        stats.ttest_ind(get_group(mmp9_only, 'Control'), get_group(mmp9_only, 'Experimental')),
    ]
    
    pvalues = [result.pvalue for result in stat_results]
    print("MMP2 \n", stat_results[0], "\n")
    print("MMP9 \n", stat_results[1], "\n")