I have dataframe similar to this:
codes=[1,3,1,3,1,3,1,3,1,3,1,3,1,3,1,3]
values=[702,713,701,721,705,715,703,712,706,710,702,715,698,718,704]
df = pd.DataFrame(list(zip(codes, values)),
columns =['code', 'val'])
>>>
code val
0 1 702
1 3 713
2 1 701
3 3 721
4 1 705
5 3 715
6 1 703
7 3 712
8 1 706
9 3 710
10 1 702
11 3 715
12 1 698
13 3 718
14 1 704
I want to check if there is significant difference betweeen the values of group 1 and group 3. For that I have used scipy's shapiro test to check if data is normally distributed.
I did something I believe is mistake in my original code:
shapiro1=stats.shapiro(df[df['code'] == 1]
>>>
ShapiroResult(statistic=0.6468859314918518, pvalue=4.644487489713356e-05)
shapiro3=stats.shapiro(df[df['code'] == 3]
>>>
ShapiroResult(statistic=0.6508359909057617, pvalue=0.00011963312863372266)
as you can see I kind of filter the dataframe by the code and not by the values, so I insert the dataframe with one code value and two columns.
Then i did something I believe is fix:
stats.shapiro(df[df['code'] == 3]['val'])
>>>
ShapiroResult(statistic=0.967737078666687, pvalue=0.8816877007484436)
so then it is not normal distributed.
When I print the part I inserted to the shapiro:
df[df['code'] == 3]
I have dataframe with two columns, what does it check? the "codes" distribution? some mix of them?
My question here:
what does it check when I insert to two columns df to the shapiro test?
Edit: I have been able to add more columns and to run shapiro test on them (just with random numbers)
From the source on github, the first thing that happens on calling stats.shapiro()
is that the input is passed to numpy.ravel()
. This returns a view (if possible) or copy of your data as a flattened, contiguous, 1-D array.
Basically, it puts all the columns into one big, long bucket and proceeds to calculate the Shapiro-Wilk test.