Get common lines, for only specific fields, from multiple files

I am trying to understand the following code used to pull out overlapping lines over multiple files using BASH.

awk 'END {
  # the END block is executed after
  # all the input has been read
  # loop over the rec array
  # and build the dup array indxed by the nuber of
  # filenames containing a given record
  for (R in rec) {
    n = split(rec[R], t, "/")
    if (n > 1) 
      dup[n] = dup[n] ? dup[n] RS sprintf("\t%-20s -->\t%s", rec[R], R) : \
        sprintf("\t%-20s -->\t%s", rec[R], R)
    }
  # loop over the dup array
  # and report the number and the names of the files 
  # containing the record   
  for (D in dup) {
    printf "records found in %d files:\n\n", D
    printf "%s\n\n", dup[D]
    }  
  }
{  
  # build an array named rec (short for record), indexed by 
  # the content of the current record ($0), concatenating 
  # the filenames separated by / as values
  rec[$0] = rec[$0] ? rec[$0] "/" FILENAME : FILENAME
  }' file[a-d]

After understanding what each sub-block of code is doing, I would like to extend this code to find specific fields with overlap, rather than the entire line. For example, I have tried changing the line:

n = split(rec[R], t, "/")

n = split(rec[R$1], t, "/")

to find the lines where the first field is the same across all files but this did not work. Eventually I would like to extend this to check that a line has fields 1, 2, and 4 the same, and then print the line.

Specifically, for the files mentioned in the example in the link: if file 1 is:

chr1    31237964    NP_055491.1    PUM1    M340L
chr1    33251518    NP_037543.1    AK2    H191D

and file 2 is:

chr1    116944164    NP_001533.2    IGSF3    R671W
chr1    33251518    NP_001616.1    AK2    H191D
chr1    57027345    NP_001004303.2    C1orf168    P270S

I would like to pull out:

file1/file2 --> chr1    33251518    AK2    H191D

I found this code at the following link: http://www.unix.com/shell-programming-and-scripting/140390-get-common-lines-multiple-files.html#post302437738. Specifically, I would like to understand what R, rec, n, dup, and D represent from the files themselves. It is unclear from the comments provided and printf statements I've added within the subloops fail.

Thank you very much for any insight on this!

Solution

The script works by building an auxiliary array, the indices of which are the lines in the input files (denoted by $0 in rec[$0]), and the values are filename1/filename3/... for those filenames in which the given line $0 is present. You can hack it up to just work with $1,$2 and $4 like so:

awk 'END {
  # the END block is executed after
  # all the input has been read
  # loop over the rec array
  # and build the dup array indxed by the nuber of
  # filenames containing a given record
  for (R in rec) {
    n = split(rec[R], t, "/")
    if (n > 1) {
        split(R,R1R2R4,SUBSEP)
        dup[n] = dup[n] ? dup[n] RS sprintf("\t%-20s -->\t%s\t%s\t%s", rec[R], R1R2R4[1],R1R2R4[2],R1R2R4[3]) : \
          sprintf("\t%-20s -->\t%s\t%s\t%s", rec[R], R1R2R4[1],R1R2R4[2],R1R2R4[3])
      }
    }
  # loop over the dup array
  # and report the number and the names of the files 
  # containing the record   
  for (D in dup) {
    printf "records found in %d files:\n\n", D
    printf "%s\n\n", dup[D]
    }  
  }
{  
  # build an array named rec (short for record), indexed by 
  # the partial content of the current record
  # (special concatenation of $1, $2 and $4)
  # concatenating the filenames separated by / as values
  rec[$1,$2,$4] = rec[$1,$2,$4] ? rec[$1,$2,$4] "/" FILENAME : FILENAME
  }' file[a-d]

this solution makes use of multidimensional arrays: we create rec[$1,$2,$4] instead of rec[$0]. This special syntax of awk concatenates the indices with the SUBSEP character, which is by default non-printable ("\034" to be precise), and so it is unlikely to be part of either of the fields. In effect it does rec[$1 SUBSEP $2 SUBSEP $4]=.... Otherwise this part of the code is the same. Note that it would be more logical to move the second block to the beginning of the script, and finish with the END block.

The first part of the code also has to be changed: now for (R in rec) loops over these tricky concatenated indices, $1 SUBSEP $2 SUBSEP $4. This is good while indexing, but you need to split R at the SUBSEP characters to obtain again the printable fields $1, $2, $4. These are put into the array R1R2R4, which can be used to print the necessary output: instead of %s,...,R we now have %s\t%s\t%s,...,R1R2R4[1],R1R2R4[2],R1R2R4[3],. In effect we're doing sprintf ...%s,...,$1,$2,$4; with pre-saved fields $1, $2, $4. For your input example this will print

records found in 2 files:

    foo11.inp1/foo11.inp2 -->   chr1    33251518    AK2

Note that the output is missing H191D but rightly so: that is not in field 1, 2 or 4 (but rather in field 5), so there's no guarantee that it is the same in the printed files! You probably don't want to print that, or anyway have to specify how you should treat the columns which are not checked between files (and so may differ).

A bit of explanation for the original code:

rec is an array, the indices of which are full lines of input, and the values are the slash-separated list of files in which those lines appear. For instance, if file1 contains a line "foo bar", then rec["foo bar"]=="file1" initially. If then file2 also contains this line, then rec["foo bar"]=="file1/file2". Note that there are no checks for multiplicity, so if file1 contains this line twice, then eventually you'll get rec["foo bar"]=file1/file1/file2 and obtain 3 for the number of files containing this line.
R goes over the indices of the array rec after it has been fully built. This means that R will eventually assume each unique line of every input file, allowing us to loop over rec[R], containing the filenames in which that specific line R was present.
n is a return value from split, which splits the value of rec[R] --- that is the filename list corresponding to line R --- at each slash. Eventually the array t is filled with the list of files, but we don't make use of this, we only use the length of the array t, i.e. the number of files in which line R is present (this is saved in the variable n). If n==1, we don't do anything, only if there are multiplicities.
the loop over n creates classes according to the multiplicity of a given line. n==2 applies to lines that are present in exactly 2 files. n==3 to those which appear thrice, and so on. What this loop does is that it builds an array dup, which for every multiplicity class (i.e. for every n) creates the output string "filename1/filename2/... --> R", with each of these strings separated by RS (the record separator) for each value of R that appears n times total in the files. So eventually dup[n] for a given n will contain a given number of strings in the form of "filename1/filename2/... --> R", concatenated with the RS character (by default a newline).
The loop over D in dup will then go through multiplicity classes (i.e. valid values of n larger than 1), and print the gathered output lines which are in dup[D] for each D. Since we only defined dup[n] for n>1, D starts from 2 if there are multiplicities (or, if there aren't any, then dup is empty, and the loop over D will not do anything).