I am trying to understand the following code used to pull out overlapping lines over multiple files using BASH.
awk 'END {
# the END block is executed after
# all the input has been read
# loop over the rec array
# and build the dup array indxed by the nuber of
# filenames containing a given record
for (R in rec) {
n = split(rec[R], t, "/")
if (n > 1)
dup[n] = dup[n] ? dup[n] RS sprintf("\t%-20s -->\t%s", rec[R], R) : \
sprintf("\t%-20s -->\t%s", rec[R], R)
}
# loop over the dup array
# and report the number and the names of the files
# containing the record
for (D in dup) {
printf "records found in %d files:\n\n", D
printf "%s\n\n", dup[D]
}
}
{
# build an array named rec (short for record), indexed by
# the content of the current record ($0), concatenating
# the filenames separated by / as values
rec[$0] = rec[$0] ? rec[$0] "/" FILENAME : FILENAME
}' file[a-d]
After understanding what each sub-block of code is doing, I would like to extend this code to find specific fields with overlap, rather than the entire line. For example, I have tried changing the line:
n = split(rec[R], t, "/")
to
n = split(rec[R$1], t, "/")
to find the lines where the first field is the same across all files but this did not work. Eventually I would like to extend this to check that a line has fields 1, 2, and 4 the same, and then print the line.
Specifically, for the files mentioned in the example in the link: if file 1 is:
chr1 31237964 NP_055491.1 PUM1 M340L
chr1 33251518 NP_037543.1 AK2 H191D
and file 2 is:
chr1 116944164 NP_001533.2 IGSF3 R671W
chr1 33251518 NP_001616.1 AK2 H191D
chr1 57027345 NP_001004303.2 C1orf168 P270S
I would like to pull out:
file1/file2 --> chr1 33251518 AK2 H191D
I found this code at the following link: http://www.unix.com/shell-programming-and-scripting/140390-get-common-lines-multiple-files.html#post302437738. Specifically, I would like to understand what R, rec, n, dup, and D represent from the files themselves. It is unclear from the comments provided and printf statements I've added within the subloops fail.
Thank you very much for any insight on this!
The script works by building an auxiliary array, the indices of which are the lines in the input files (denoted by $0
in rec[$0]
), and the values are filename1/filename3/...
for those filenames in which the given line $0
is present. You can hack it up to just work with $1
,$2
and $4
like so:
awk 'END {
# the END block is executed after
# all the input has been read
# loop over the rec array
# and build the dup array indxed by the nuber of
# filenames containing a given record
for (R in rec) {
n = split(rec[R], t, "/")
if (n > 1) {
split(R,R1R2R4,SUBSEP)
dup[n] = dup[n] ? dup[n] RS sprintf("\t%-20s -->\t%s\t%s\t%s", rec[R], R1R2R4[1],R1R2R4[2],R1R2R4[3]) : \
sprintf("\t%-20s -->\t%s\t%s\t%s", rec[R], R1R2R4[1],R1R2R4[2],R1R2R4[3])
}
}
# loop over the dup array
# and report the number and the names of the files
# containing the record
for (D in dup) {
printf "records found in %d files:\n\n", D
printf "%s\n\n", dup[D]
}
}
{
# build an array named rec (short for record), indexed by
# the partial content of the current record
# (special concatenation of $1, $2 and $4)
# concatenating the filenames separated by / as values
rec[$1,$2,$4] = rec[$1,$2,$4] ? rec[$1,$2,$4] "/" FILENAME : FILENAME
}' file[a-d]
this solution makes use of multidimensional arrays: we create rec[$1,$2,$4]
instead of rec[$0]
. This special syntax of awk
concatenates the indices with the SUBSEP
character, which is by default non-printable ("\034" to be precise), and so it is unlikely to be part of either of the fields. In effect it does rec[$1 SUBSEP $2 SUBSEP $4]=...
. Otherwise this part of the code is the same. Note that it would be more logical to move the second block to the beginning of the script, and finish with the END
block.
The first part of the code also has to be changed: now for (R in rec)
loops over these tricky concatenated indices, $1 SUBSEP $2 SUBSEP $4
. This is good while indexing, but you need to split
R
at the SUBSEP
characters to obtain again the printable fields $1
, $2
, $4
. These are put into the array R1R2R4
, which can be used to print the necessary output: instead of %s,...,R
we now have %s\t%s\t%s,...,R1R2R4[1],R1R2R4[2],R1R2R4[3],
. In effect we're doing sprintf ...%s,...,$1,$2,$4;
with pre-saved fields $1
, $2
, $4
. For your input example this will print
records found in 2 files:
foo11.inp1/foo11.inp2 --> chr1 33251518 AK2
Note that the output is missing H191D
but rightly so: that is not in field 1, 2 or 4 (but rather in field 5), so there's no guarantee that it is the same in the printed files! You probably don't want to print that, or anyway have to specify how you should treat the columns which are not checked between files (and so may differ).
A bit of explanation for the original code:
rec
is an array, the indices of which are full lines of input, and the values are the slash-separated list of files in which those lines appear. For instance, if file1
contains a line "foo bar
", then rec["foo bar"]=="file1"
initially. If then file2
also contains this line, then rec["foo bar"]=="file1/file2"
. Note that there are no checks for multiplicity, so if file1
contains this line twice, then eventually you'll get rec["foo bar"]=file1/file1/file2
and obtain 3 for the number of files containing this line.R
goes over the indices of the array rec
after it has been fully built. This means that R
will eventually assume each unique line of every input file, allowing us to loop over rec[R]
, containing the filenames in which that specific line R
was present.n
is a return value from split
, which splits the value of rec[R]
--- that is the filename list corresponding to line R
--- at each slash. Eventually the array t
is filled with the list of files, but we don't make use of this, we only use the length of the array t
, i.e. the number of files in which line R
is present (this is saved in the variable n
). If n==1
, we don't do anything, only if there are multiplicities.n
creates classes according to the multiplicity of a given line. n==2
applies to lines that are present in exactly 2 files. n==3
to those which appear thrice, and so on. What this loop does is that it builds an array dup
, which for every multiplicity class (i.e. for every n
) creates the output string "filename1/filename2/... --> R"
, with each of these strings separated by RS
(the record separator) for each value of R
that appears n
times total in the files. So eventually dup[n]
for a given n
will contain a given number of strings in the form of "filename1/filename2/... --> R"
, concatenated with the RS
character (by default a newline).D in dup
will then go through multiplicity classes (i.e. valid values of n
larger than 1), and print the gathered output lines which are in dup[D]
for each D
. Since we only defined dup[n]
for n>1
, D
starts from 2 if there are multiplicities (or, if there aren't any, then dup
is empty, and the loop over D
will not do anything).