pythonscipypearsonpearson-correlation

Homemade pearson's correlation implementation returning 0.999...2 when passing two of the same sets of data to it


I was getting fed up of scipy and numpy, and decided to go ahead and work on another implementation, based on a SO answer somewhere.

from statistics import pstdev, mean

def pearson(x, y):
    sx = []
    sy = []

    mx = mean(x)
    my = mean(y)

    stdx = pstdev(x)
    stdy = pstdev(y)

    for i in x:
        sx.append((i - mx) / stdx)

    for j in y:
        sy.append((j - my) / stdy)

    return sum([i * j for i, j in zip(sx, sy)]) / len(x)

I passed a few numbers into it to see if it was giving the same thing as scipy.stats.pearsonr, and it seemed to be fine. A number or so towards the end was different, but wasn't anything groundbreaking...

Until I tried passing the same set of data to it as x and y. When I did, I got returned 0.9999999999999992, when scipy and numpy both say it's 1.0.

Is there something wrong with this implementation? I'm using the population stdev instead of the sample one, and as far as I'm aware, both numpy and scipy use that. I'm wondering why this isn't returning 1.0 as it should be. Could it be float issues in python itself? I've tried it in Py 2 and 3, and I'm getting 0.999... in both.

If needed, the set of data I passed into it was:

[7, 1, 5, 1, 8, 5, 9, 8, 5, 10, 5, 8, 1, 8, 8, 8, 10, 4, 8, 9, 9, 6, 8, 7, 8, 5, 10, 5, 6, 8, 8, 7, 9, 4, 6, 10, 7, 10, 4, 5, 4, 7, 4, 8, 9, 10, 9, 8, 7, 8, 6, 8, 6, 6, 5, 7, 7, 7, 7, 3, 7, 8, 6, 8, 5, 7, 8, 7, 8, 6, 8, 6, 9, 6, 6, 6, 8, 9, 5, 7, 9, 2, 9, 6, 7, 6, 7, 7, 5, 5, 7, 7, 8, 6, 9, 1, 3, 6, 7, 9, 7, 7, 6, 9, 9, 4, 9, 9, 7, 9, 6, 2, 2, 8, 4, 7, 7, 6, 3, 7, 3, 5, 10, 9, 8, 10, 8, 7, 4, 7, 8, 9, 8, 4, 7, 9, 7, 7, 6, 8, 8, 9, 9, 7, 4, 4, 7, 3, 9, 3, 1, 8, 3, 9, 4, 8, 3, 9, 8, 8, 7, 9, 9, 8, 10, 8, 3, 10, 4, 7, 7, 10, 8, 7, 8, 7, 1, 8, 9, 5, 7, 5, 5, 3, 5, 7, 7, 7, 2, 4, 1, 6, 9, 9, 7, 7, 10, 9, 2, 9, 8, 2, 5, 1, 2, 5, 9, 1, 4, 8, 9, 6, 4, 4, 7, 3, 7, 9, 4, 3, 7, 8, 7, 6, 8, 8, 7]


Solution

  • Your expectations about floating-point behavior are way too optimistic. With experience, you wouldn't be surprised a bit that the result isn't exactly 1.0. For example, try this much smaller input instead:

    [7, 1, 5]
    

    On my box, your function returns 1.0000000000000002. "Close to" 1.0, but not exactly 1.0. That's the best you can hope for, in general.

    For a good clue about why, think about what this "should" compute:

    math.sqrt(x)**2 == x
    

    Mathematically (working in infinite precision), that should always return True. But in floating-point (no matter how much precision is used, provided only that the precision is bounded), it's impossible for it to always be true. In fact, counterexamples are very easy to find; like, on my box just now:

    >>> math.sqrt(2)**2
    2.0000000000000004
    

    The problem is that, with finite precision, sqrt() is necessarily a many-to-one function. It squashes the domain 1..N into the range 1..sqrt(N), and with finite precision the cardinality of the domain is larger than the cardinality of the range. Therefore there must exist distinct x and y in the domain that map to the same value in the range, so there is no exact functional inverse.

    Your function is more complicated than a plain sqrt, but the same principles are at work.