I have a dataframe book_matrix
with users as rows, books as columns, and ratings as values. When I use corrwith()
to compute the correlation between 'The Lord of the Rings' and 'The Silmarillion' the result is 1.0
, but the values are clearly different.
The non-null values [10, 3] and [10, 9] have correlation 1.0
. I would expect them to be exactly the same when the correlation is equal to one. How can this happen?
Correlation means the values have a certain relationship with one another, for example linear combination of factors. Here's an illustration:
import pandas as pd
df1 = pd.DataFrame({"A":[1, 2, 3, 4],
"B":[5, 8, 4, 3],
"C":[10, 4, 9, 3]})
df2 = pd.DataFrame({"A":[2, 4, 6, 8],
"B":[-5, -8, -4, -3],
"C":[4, 3, 8, 5]})
df1.corrwith(df2, axis=0)
A 1.000000
B -1.000000
C 0.395437
dtype: float64
So you can see that [1, 2, 3, 4]
and [2, 4, 6, 8]
have correlation 1.0
The next column [5, 8, 4, 3]
and [-5, -8, -4, -3]
have extreme negative correlation -1.0
In the last column, [10, 4, 9, 3]
and [4, 3, 8, 5]
are somewhat correlated 0.395437
, because both exhibits high-low-high-low sequence but with varying vertical scaling factors.
So in your case both books 'The Lord of the Rings' and 'The Silmarillion' only has 2 ratings each, and both ratings are having high-low sequence. Even if I illustrate with more data points, they have the same vertical scaling factor.
df1 = pd.DataFrame({"A": [10, 3, 10, 3, 10, 3],
"B": [10, 3, 10, 3, 10, 3]})
df2 = pd.DataFrame({"A": [10, 9, 10, 9, 10, 9],
"B": [10, 10, 10, 9, 9, 9]})
df1.corrwith(df2, axis=0)
A 1.000000
B 0.333333
dtype: float64
So you can see that [10, 3, 10, 3, 10, 3]
and [10, 9, 10, 9, 10, 9]
are also correlated perfectly at 1.0
.
But if I rearrange the sequence a little, [10, 3, 10, 3, 10, 3]
and [10, 10, 10, 9, 9, 9]
are not perfectly correlated anymore at 0.333333
So going forward, you need more data, and more variations in the data! Hope that helps 😎