I have some problems with my result:
dataCorr = data.corr(method='pearson')
dataCorr = dataCorr[abs(dataCorr) >= 0.7].stack().reset_index()
dataCorr = dataCorr[dataCorr.level_0!=dataCorr.level_1]
From my correlation matrix:
dataCorr = data.corr(method='pearson')
I convert this matrix to columns:
dataCorr = dataCorr[abs(dataCorr) >= 0.7].stack().reset_index()
And after I remove diagonal line of matrix:
dataCorr = dataCorr[dataCorr.level_0!=dataCorr.level_1]
But I still have duplicate pairs
level_0 level_1 0
LiftPushSpeed RT1EntranceSpeed 0.881714
RT1EntranceSpeed LiftPushSpeed 0.881714
How avoid this problem?
You can convert lower triangle of values to NaN
s and stack
remove it:
np.random.seed(12)
data = pd.DataFrame(np.random.randint(20, size=(5,6)))
print (data)
0 1 2 3 4 5
0 11 6 17 2 3 3
1 12 16 17 5 13 2
2 11 10 0 8 12 13
3 18 3 4 3 1 0
4 18 18 16 6 13 9
dataCorr = data.corr(method='pearson')
dataCorr = dataCorr.mask(np.tril(np.ones(dataCorr.shape)).astype(np.bool))
print (dataCorr)
0 1 2 3 4 5
0 NaN 0.042609 -0.041656 -0.113998 -0.173011 -0.201122
1 NaN NaN 0.486901 0.567216 0.914260 0.403469
2 NaN NaN NaN -0.412853 0.157747 -0.354012
3 NaN NaN NaN NaN 0.823628 0.858918
4 NaN NaN NaN NaN NaN 0.635730
5 NaN NaN NaN NaN NaN NaN
#in your data change 0.5 to 0.7
dataCorr = dataCorr[abs(dataCorr) >= 0.5].stack().reset_index()
print (dataCorr)
level_0 level_1 0
0 1 3 0.567216
1 1 4 0.914260
2 3 4 0.823628
3 3 5 0.858918
4 4 5 0.635730
Detail:
print (np.tril(np.ones(dataCorr.shape)))
[[ 1. 0. 0. 0. 0. 0.]
[ 1. 1. 0. 0. 0. 0.]
[ 1. 1. 1. 0. 0. 0.]
[ 1. 1. 1. 1. 0. 0.]
[ 1. 1. 1. 1. 1. 0.]
[ 1. 1. 1. 1. 1. 1.]]