Is there a way to estimate the correlation of two variables if the data is received in chunks without storing the received pairs?
For example, we receive the pairs:
[(x1, y1), (x2, y2), (x3, y3)]
[(x4, y4)]
[(x5, y5), (x6, y6)]
and we have to estimate the correlation between x1:6 and y1:6.
Non-optimal solution:
Even though this definition works:
it is suboptimal since if we have large values on the stream the squared values will easily overflow.
Yes, this can be computed incrementally. The method is a small generalisation of Welford's algorithm, see here, for example
You maintain a number of variables, updating them each time data comes in. At each stage these are the mean etc of the data seen so far
Initialisation:
int n = 0; // number of points
double mx = 0.0; // mean of x's
double my = 0.0; // mean of y's
double vx = 0.0; // variance of x's
double vy = 0.0; // variance of y's
double cxy = 0.0; // covariance of x and y
Update (new values x,y in )
n += 1;
double f = 1.0/n;
double dx = x - mx;
double dy = y - my;
mx += f*dx;
my += f*dy;
vx = (1.0-f)*(vx + f*dx*dx);
vy = (1.0-f)*(vy + f*dy*dy);
cxy= (1.0-f)*(cxy+ f*dx*dy);
In terms of these variables we have
rxy = cxy/sqrt( vx*vy)
Note though that vx and vy will be zero after just one pair as been seen.
Don't be surprised if the stream of estimates for rxy is noisy. Estimates of correlation tend to be so.