After reading several topics such as this, this and this, I am still confused about how to tackle the following issue. I have a bunch of files I would like to filter based on the correlation (Pearson) of their rows (eg. using a threshold greater than +0.9). For instance, two files:
file1.txt -> 15 columns and 4 rows
file2.txt -> 16 columns and 10 rows
So, the first row in file1.txt
is correlated with all rows in file2.txt
, and then it continues sequentially with the next rows in file1.txt
. One caveat here is that only values between column 1 and column 14 are considered for correlation (ie. column 15 in file1.txt
and columns 15 and 16 in file2.txt
are not taken into account). If correlated rows give a correlation greater than +0.9, then it prints (or saves in a new file) the value from column 15 in file1.txt
and values from columns 15 and 16 in file2.txt
, so:
output file -> 3 columns and number of rows depend on correlation values > +0.9
I was playing with this code to understand how to do the correlation but no luck so far:
awk '{
a = 0; for (i = 1; i <= NF; ++i) a += $i; a /= NF-1
b = 0; for (i = 1; i <= NF; ++i) b += ($i - a) ^ 2; b = sqrt(b)
if (b <= 0) next
for (i = 1; i <= NF; ++i) x[NR, i] = ($i - a) / b
n[NR] = $1
for (i = 1; i <= NR; ++i) {
if (!(i in n)) continue
a = 0
for (k = 1; k <= NF; ++k)
a += x[NR, k] * x[i, k]
print n[NR], n[i], a
}}' << eof
r1 1 2 3 4 5 6 7 8 9 10 11 12 13 14
r2 1 2 3 4 5 6 7 8 9 10 11 12 13 14
r3 1 2 3 4 5 6 7 8 9 10 11 12 13 14
eof
it obviously gives 1 indicating the correlated rows:
r1 r1 1
r2 r1 1
r2 r2 1
r3 r1 1
r3 r2 1
r3 r3 1
Not sure how to adapt/modify/improve this piece of code to use two separated files as well as achieving what I would like to obtain as result, any help is much appreciated.
Some real values of the files I have are posted below:
real_file1.txt (15 columns and 4 rows):
17336.2 38352.3 23.3161 16838.3 4127.13 16173 20.6492 17695.3 16673.1 85.8239 26.0635 17247.9 24.639 0.0138022 110.0MN
16684.4 38376.4 24.5148 16838.3 4127.13 16173 20.6492 17695.3 16673.1 85.8239 26.0635 17247.9 24.639 0.0138022 66.21622MN
17336.2 38352.3 23.3161 16838.3 4127.13 16173 20.6492 17695.3 16673.1 85.8239 26.0635 17247.9 24.639 0.0138022 60.0MN
17336.2 38352.3 23.3161 16838.3 4127.13 16173 20.6492 17695.3 16673.1 85.8239 26.0635 17247.9 24.639 0.0138022 20.0MN
real_file2.txt (16 columns and 10 rows):
20985.4 38466.8 57.9789 27394.6 941.422 10109.7 11.326 6077.4 3729.51 87.5948 25.9344 4477.65 23.4117 0.00158122 -84.4166666667 -33.33333333333
21020 38438.3 60.0672 27570.8 1268.44 9737.09 10.7524 7888.51 3609.84 87.7387 25.9116 5785.01 23.3214 0.00191128 -84.3333333333 -33.33333333333
21138.4 38396.4 62.0438 27570.8 1268.44 9737.09 10.7524 7888.51 3609.84 87.7387 25.9116 5785.01 23.3214 0.00191128 -84.25 -33.33333333333
21250.5 38333.5 61.3351 27570.8 1268.44 9737.09 10.7524 7888.51 3609.84 87.7387 25.9116 5785.01 23.3214 0.00191128 -84.1666666667 -33.33333333333
21294.3 38261 59.1823 27703.4 1589.07 9294.17 10.2936 9471.09 3462.93 87.8499 25.8908 7098.47 23.2781 0.00219644 -84.0833333333 -33.33333333333
21263.2 38208.1 60.0892 27703.4 1589.07 9294.17 10.2936 9471.09 3462.93 87.8499 25.8908 7098.47 23.2781 0.00219644 -84.0 -33.33333333333
21208.9 38186.3 63.612 27703.4 1589.07 9294.17 10.2936 9471.09 3462.93 87.8499 25.8908 7098.47 23.2781 0.00219644 -83.9166666667 -33.33333333333
21165 38178.8 67.6346 27192.5 1842.54 9394.1 10.2802 10428.4 3771.67 87.8443 25.8791 8149.04 23.2957 0.00247546 -83.8333333333 -33.33333333333
21131.9 38183.1 72.5871 27192.5 1842.54 9394.1 10.2802 10428.4 3771.67 87.8443 25.8791 8149.04 23.2957 0.00247546 -83.75 -33.33333333333
21096.6 38187.6 72.6827 27192.5 1842.54 9394.1 10.2802 10428.4 3771.67 87.8443 25.8791 8149.04 23.2957 0.00247546 -83.6666666667 -33.33333333333
for instance, expected result based on both files (not tested with correlation, just file structure as example):
110.0MN -83.6666666667 -3.33333333333
20.0MN -84.25 -3.33333333333
Here a MWE composed of rows from the examples previously mentioned, where:
file1.txt
is correlated with the 2 rows of file2.txt
and the result.txt
is created (ie. correlation > 0.9 for this case, because it is the same row for testing purposes, so XOR is 1).file1.txt
is correlated with the 2 rows of file2.txt
and no output is generated (ie. I am assuming here that correlation is < 0.9 for all correlated rows).file1.txt:
17336.2 38352.3 23.3161 16838.3 4127.13 16173 20.6492 17695.3 16673.1 85.8239 26.0635 17247.9 24.639 0.0138022 110.0MN
16684.4 38376.4 24.5148 16838.3 4127.13 16173 20.6492 17695.3 16673.1 85.8239 26.0635 17247.9 24.639 0.0138022 66.21622MN
file2.txt
21096.6 38187.6 72.6827 27192.5 1842.54 9394.1 10.2802 10428.4 3771.67 87.8443 25.8791 8149.04 23.2957 0.00247546 -83.6666666667 -33.33333333333
17336.2 38352.3 23.3161 16838.3 4127.13 16173 20.6492 17695.3 16673.1 85.8239 26.0635 17247.9 24.639 0.0138022 -80.7777777777 -30.22222222222
result.txt
110.0MN -80.7777777777 -30.22222222222
With any POSIX awk
:
$ cat foo.awk
function stat( sum, sum2, i) {
sum = sum2 = 0
for(i = 1; i <= max; i++) {
sum += $i
sum2 += $i * $i
}
avg = sum / max
std = sqrt(sum2 / max - avg * avg)
}
function pcc(i, sum_ab, j) {
sum_ab = 0;
for(j = 1; j <= max; j++) sum_ab += a[i,j] * $j
return (sum_ab / max - avg_a[i] * avg) / (std_a[i] * std)
}
BEGIN {
max = 14
}
NR == FNR {
cnt += 1
stat()
for(i = 1; i <= NF; i++) a[cnt,i] = $i
avg_a[cnt] = avg
std_a[cnt] = std
next
}
{
stat()
if(std == 0) next
for(i = 1; i <= cnt; i++) {
if(std_a[i] == 0) continue
if(pcc(i) > 0.9) print a[i,max+1], $(max+1), $(max+2);
}
}
$ cat file1.txt
17336.2 38352.3 23.3161 16838.3 4127.13 16173 20.6492 17695.3 16673.1 85.8239 26.0635 17247.9 24.639 0.0138022 110.0TN
16684.4 38376.4 24.5148 16838.3 4127.13 16173 20.6492 17695.3 16673.1 85.8239 26.0635 17247.9 24.639 0.0138022 66.21622TN
$ cat file2.txt
21096.6 38187.6 72.6827 27192.5 1842.54 9394.1 10.2802 10428.4 3771.67 87.8443 25.8791 8149.04 23.2957 0.00247546 -83.6666666667 -33.33333333333
17336.2 38352.3 23.3161 16838.3 4127.13 16173 20.6492 17695.3 16673.1 85.8239 26.0635 17247.9 24.639 0.0138022 -80.7777777777 -30.22222222222
$ awk -f foo.awk file1.txt file2.txt
110.0TN -80.7777777777 -30.22222222222
66.21622TN -80.7777777777 -30.22222222222
max
is the size of your samples. It is set to 14 in the BEGIN
block.
Pseudo 2D matrix a
stores the values of the first file: a[i,j]
is column j
of row i
. It is populated by the second block (NR == FNR
) while parsing the first file. The second block also uses function stat
to compute the average (avg
) and the standard deviation (std
) of the sample (the max
first fields of each row). They are stored in arrays avg_a
and std_a
, respectively.
The third block parses the rows of the second file. For each row, function stat
computes the average (avg
) and the standard deviation (std
) of the sample, and function pcc
is called to compute the Pearson Correlation Coefficient (PCC) of the row with each row of matrix a
. If it is strictly larger than 0.9 the max+1
-th column of the correlated row of a
, and columns max+1
and max+2
of current row are printed.