pythonscipynormal-distributionunspecified-behavior

Understand scipy's shapiro behavior when inserting two columns dataframe instead of one


I have dataframe similar to this:

codes=[1,3,1,3,1,3,1,3,1,3,1,3,1,3,1,3]
values=[702,713,701,721,705,715,703,712,706,710,702,715,698,718,704]
df = pd.DataFrame(list(zip(codes, values)),
               columns =['code', 'val'])

>>>

   code val
0   1   702
1   3   713
2   1   701
3   3   721
4   1   705
5   3   715
6   1   703
7   3   712
8   1   706
9   3   710
10  1   702
11  3   715
12  1   698
13  3   718
14  1   704

I want to check if there is significant difference betweeen the values of group 1 and group 3. For that I have used scipy's shapiro test to check if data is normally distributed.

I did something I believe is mistake in my original code:

shapiro1=stats.shapiro(df[df['code'] == 1]
>>>
ShapiroResult(statistic=0.6468859314918518, pvalue=4.644487489713356e-05)

shapiro3=stats.shapiro(df[df['code'] == 3]
>>>
ShapiroResult(statistic=0.6508359909057617, pvalue=0.00011963312863372266)

as you can see I kind of filter the dataframe by the code and not by the values, so I insert the dataframe with one code value and two columns.

Then i did something I believe is fix:

stats.shapiro(df[df['code'] == 3]['val'])
>>>
ShapiroResult(statistic=0.967737078666687, pvalue=0.8816877007484436)

so then it is not normal distributed.

When I print the part I inserted to the shapiro:

df[df['code'] == 3]

I have dataframe with two columns, what does it check? the "codes" distribution? some mix of them?

My question here:
what does it check when I insert to two columns df to the shapiro test?

Edit: I have been able to add more columns and to run shapiro test on them (just with random numbers)


Solution

  • From the source on github, the first thing that happens on calling stats.shapiro() is that the input is passed to numpy.ravel(). This returns a view (if possible) or copy of your data as a flattened, contiguous, 1-D array.

    Basically, it puts all the columns into one big, long bucket and proceeds to calculate the Shapiro-Wilk test.