pythonmathcorrelationpearson-correlationpearson

Estimate Pearson correlation coefficient from stream of data


Is there a way to estimate the correlation of two variables if the data is received in chunks without storing the received pairs?

For example, we receive the pairs:

  1. [(x1, y1), (x2, y2), (x3, y3)]

  2. [(x4, y4)]

  3. [(x5, y5), (x6, y6)]

and we have to estimate the correlation between x1:6 and y1:6.

Non-optimal solution:

Even though this definition works: Correlation

it is suboptimal since if we have large values on the stream the squared values will easily overflow.


Solution

  • Yes, this can be computed incrementally. The method is a small generalisation of Welford's algorithm, see here, for example

    You maintain a number of variables, updating them each time data comes in. At each stage these are the mean etc of the data seen so far

    Initialisation:

    int n = 0; // number of points
    double mx = 0.0; // mean of x's
    double my = 0.0; // mean of y's
    double vx = 0.0; // variance of x's
    double vy = 0.0; // variance of y's
    double cxy = 0.0; // covariance of x and y
    

    Update (new values x,y in )

      n += 1;
    double f = 1.0/n;
    double dx = x - mx;
    double dy = y - my;
      mx += f*dx;
      my += f*dy;
      vx = (1.0-f)*(vx + f*dx*dx);
      vy = (1.0-f)*(vy + f*dy*dy);
      cxy= (1.0-f)*(cxy+ f*dx*dy);
    

    In terms of these variables we have

    rxy = cxy/sqrt( vx*vy)
    

    Note though that vx and vy will be zero after just one pair as been seen.

    Don't be surprised if the stream of estimates for rxy is noisy. Estimates of correlation tend to be so.