pythonpandasmachine-learningpearson-correlation

.corr() method for dataframe not returning ideal values only returns either -1 or 1


Ideally it should be returning values between -1 and 1 for every cell except for the cells that have the same column name and row name those need to have a 1 value

Tried replacing the NaN with 0 before doing corr() and it returns proper values but those values are inaccurate for the purpose of the program

# df
            MovieA    MovieB    MovieC    MovieD  MovieE
Angee     0.000000       NaN -0.500000  0.500000     NaN
Anirvesh  1.166667 -0.333333 -0.833333       NaN     NaN
Jay       1.166667 -0.333333       NaN -0.833333     NaN
Karthik   0.000000 -1.500000       NaN       NaN     1.5
Naman          NaN  0.250000       NaN -0.250000     NaN

# df.T.corr()
          Angee  Anirvesh  Jay  Karthik  Naman
Angee       1.0       1.0 -1.0      NaN    NaN
Anirvesh    1.0       1.0  1.0      1.0    NaN
Jay        -1.0       1.0  1.0      1.0    1.0
Karthik     NaN       1.0  1.0      1.0    NaN
Naman       NaN       NaN  1.0      NaN    1.0

Solution

  • The NaNs are correct, they are returned when you cannot compute the correlation because of NaNs. This happens when you don't have at least two common values.

    Filling the NaNs before computation indeed doesn't make sense as this will add fake datapoints that will be used to compute the correlation.

    What you could do is fillna with 0 after the computation if you really don't want NaNs:

    out = df.T.corr().fillna(0)
    

    Output:

              Angee  Anirvesh  Jay  Karthik  Naman
    Angee       1.0       1.0 -1.0      0.0    0.0
    Anirvesh    1.0       1.0  1.0      1.0    0.0
    Jay        -1.0       1.0  1.0      1.0    1.0
    Karthik     0.0       1.0  1.0      1.0    0.0
    Naman       0.0       0.0  1.0      0.0    1.0