bashawktext

Filtering rows between two files based on correlation in awk


After reading several topics such as this, this and this, I am still confused about how to tackle the following issue. I have a bunch of files I would like to filter based on the correlation (Pearson) of their rows (eg. using a threshold greater than +0.9). For instance, two files:

file1.txt -> 15 columns and 4 rows

file2.txt -> 16 columns and 10 rows

So, the first row in file1.txt is correlated with all rows in file2.txt, and then it continues sequentially with the next rows in file1.txt. One caveat here is that only values between column 1 and column 14 are considered for correlation (ie. column 15 in file1.txt and columns 15 and 16 in file2.txt are not taken into account). If correlated rows give a correlation greater than +0.9, then it prints (or saves in a new file) the value from column 15 in file1.txt and values from columns 15 and 16 in file2.txt, so:

output file -> 3 columns and number of rows depend on correlation values > +0.9

I was playing with this code to understand how to do the correlation but no luck so far:

 awk '{
  a = 0; for (i = 1; i <= NF; ++i) a += $i; a /= NF-1
  b = 0; for (i = 1; i <= NF; ++i) b += ($i - a) ^ 2; b = sqrt(b)
  if (b <= 0) next
  for (i = 1; i <= NF; ++i) x[NR, i] = ($i - a) / b
  n[NR] = $1
  for (i = 1; i <= NR; ++i) {
    if (!(i in n)) continue
    a = 0
    for (k = 1; k <= NF; ++k)
      a += x[NR, k] * x[i, k]
    print n[NR], n[i], a
  }}' << eof
r1 1 2 3 4 5 6 7 8 9 10 11 12 13 14
r2 1 2 3 4 5 6 7 8 9 10 11 12 13 14
r3 1 2 3 4 5 6 7 8 9 10 11 12 13 14
eof

it obviously gives 1 indicating the correlated rows:

r1 r1 1
r2 r1 1
r2 r2 1
r3 r1 1
r3 r2 1
r3 r3 1

Not sure how to adapt/modify/improve this piece of code to use two separated files as well as achieving what I would like to obtain as result, any help is much appreciated.

Some real values of the files I have are posted below:

real_file1.txt (15 columns and 4 rows):

17336.2  38352.3  23.3161  16838.3  4127.13  16173  20.6492  17695.3  16673.1  85.8239  26.0635  17247.9  24.639  0.0138022  110.0MN
16684.4  38376.4  24.5148  16838.3  4127.13  16173  20.6492  17695.3  16673.1  85.8239  26.0635  17247.9  24.639  0.0138022  66.21622MN
17336.2  38352.3  23.3161  16838.3  4127.13  16173  20.6492  17695.3  16673.1  85.8239  26.0635  17247.9  24.639  0.0138022  60.0MN
17336.2  38352.3  23.3161  16838.3  4127.13  16173  20.6492  17695.3  16673.1  85.8239  26.0635  17247.9  24.639  0.0138022  20.0MN

real_file2.txt (16 columns and 10 rows):

20985.4  38466.8  57.9789  27394.6  941.422  10109.7  11.326   6077.4   3729.51  87.5948  25.9344  4477.65  23.4117  0.00158122  -84.4166666667  -33.33333333333
21020    38438.3  60.0672  27570.8  1268.44  9737.09  10.7524  7888.51  3609.84  87.7387  25.9116  5785.01  23.3214  0.00191128  -84.3333333333  -33.33333333333
21138.4  38396.4  62.0438  27570.8  1268.44  9737.09  10.7524  7888.51  3609.84  87.7387  25.9116  5785.01  23.3214  0.00191128  -84.25          -33.33333333333
21250.5  38333.5  61.3351  27570.8  1268.44  9737.09  10.7524  7888.51  3609.84  87.7387  25.9116  5785.01  23.3214  0.00191128  -84.1666666667  -33.33333333333
21294.3  38261    59.1823  27703.4  1589.07  9294.17  10.2936  9471.09  3462.93  87.8499  25.8908  7098.47  23.2781  0.00219644  -84.0833333333  -33.33333333333
21263.2  38208.1  60.0892  27703.4  1589.07  9294.17  10.2936  9471.09  3462.93  87.8499  25.8908  7098.47  23.2781  0.00219644  -84.0           -33.33333333333
21208.9  38186.3  63.612   27703.4  1589.07  9294.17  10.2936  9471.09  3462.93  87.8499  25.8908  7098.47  23.2781  0.00219644  -83.9166666667  -33.33333333333
21165    38178.8  67.6346  27192.5  1842.54  9394.1   10.2802  10428.4  3771.67  87.8443  25.8791  8149.04  23.2957  0.00247546  -83.8333333333  -33.33333333333
21131.9  38183.1  72.5871  27192.5  1842.54  9394.1   10.2802  10428.4  3771.67  87.8443  25.8791  8149.04  23.2957  0.00247546  -83.75          -33.33333333333
21096.6  38187.6  72.6827  27192.5  1842.54  9394.1   10.2802  10428.4  3771.67  87.8443  25.8791  8149.04  23.2957  0.00247546  -83.6666666667  -33.33333333333

for instance, expected result based on both files (not tested with correlation, just file structure as example):

110.0MN -83.6666666667  -3.33333333333
20.0MN  -84.25          -3.33333333333

Here a MWE composed of rows from the examples previously mentioned, where:

file1.txt:

17336.2  38352.3  23.3161  16838.3  4127.13  16173  20.6492  17695.3  16673.1  85.8239  26.0635  17247.9  24.639  0.0138022  110.0MN
16684.4  38376.4  24.5148  16838.3  4127.13  16173  20.6492  17695.3  16673.1  85.8239  26.0635  17247.9  24.639  0.0138022  66.21622MN

file2.txt

21096.6  38187.6  72.6827  27192.5  1842.54  9394.1   10.2802  10428.4  3771.67  87.8443  25.8791  8149.04  23.2957  0.00247546  -83.6666666667  -33.33333333333
17336.2  38352.3  23.3161  16838.3  4127.13  16173    20.6492  17695.3  16673.1  85.8239  26.0635  17247.9  24.639   0.0138022  -80.7777777777  -30.22222222222

result.txt

110.0MN -80.7777777777  -30.22222222222

Solution

  • With any POSIX awk:

    $ cat foo.awk
    function stat(    sum, sum2, i) {
      sum = sum2 = 0
      for(i = 1; i <= max; i++) {
        sum += $i
        sum2 += $i * $i
      }
      avg = sum / max
      std = sqrt(sum2 / max - avg * avg)
    }
    
    function pcc(i,    sum_ab, j) {
      sum_ab = 0;
      for(j = 1; j <= max; j++) sum_ab += a[i,j] * $j
      return (sum_ab / max - avg_a[i] * avg) / (std_a[i] * std)
    }
    
    BEGIN {
      max = 14
    }
    
    NR == FNR {
      cnt += 1
      stat()
      for(i = 1; i <= NF; i++) a[cnt,i] = $i
      avg_a[cnt] = avg
      std_a[cnt] = std
      next
    }
    
    {
      stat()
      if(std == 0) next
      for(i = 1; i <= cnt; i++) {
        if(std_a[i] == 0) continue
        if(pcc(i) > 0.9) print a[i,max+1], $(max+1), $(max+2);
      }
    }
    
    $ cat file1.txt
    17336.2  38352.3  23.3161  16838.3  4127.13  16173  20.6492  17695.3  16673.1  85.8239  26.0635  17247.9  24.639  0.0138022  110.0TN
    16684.4  38376.4  24.5148  16838.3  4127.13  16173  20.6492  17695.3  16673.1  85.8239  26.0635  17247.9  24.639  0.0138022  66.21622TN
    
    $ cat file2.txt
    21096.6  38187.6  72.6827  27192.5  1842.54  9394.1   10.2802  10428.4  3771.67  87.8443  25.8791  8149.04  23.2957  0.00247546  -83.6666666667  -33.33333333333
    17336.2  38352.3  23.3161  16838.3  4127.13  16173    20.6492  17695.3  16673.1  85.8239  26.0635  17247.9  24.639   0.0138022  -80.7777777777  -30.22222222222
    
    $ awk -f foo.awk file1.txt file2.txt 
    110.0TN -80.7777777777 -30.22222222222
    66.21622TN -80.7777777777 -30.22222222222
    

    max is the size of your samples. It is set to 14 in the BEGIN block.

    Pseudo 2D matrix a stores the values of the first file: a[i,j] is column j of row i. It is populated by the second block (NR == FNR) while parsing the first file. The second block also uses function stat to compute the average (avg) and the standard deviation (std) of the sample (the max first fields of each row). They are stored in arrays avg_a and std_a, respectively.

    The third block parses the rows of the second file. For each row, function stat computes the average (avg) and the standard deviation (std) of the sample, and function pcc is called to compute the Pearson Correlation Coefficient (PCC) of the row with each row of matrix a. If it is strictly larger than 0.9 the max+1-th column of the correlated row of a, and columns max+1 and max+2 of current row are printed.