I have over forty files with the following structure:
file1 first 21 lines
8191 M0
139559 M1
79 M10
1 M10007
1 M1006
1 M10123
file2 first 21 lines
8584 M0
119837 M1
72 M10
1 M10003
1 M10045
1 M1014
file3 first 21 lines
9090 M0
137373 M1
73 M10
1 M10046
2 M101
1 M1039
where number is the number of occurrences of an M
pattern, tab-separated. Now, the thing is these M
patterns are in part shared across all files and are in the range of 700-800 total for each one of them.
What I wish to do is to use AWK to extract only those common to all the forty-plus files (say ~600) along with their count (the column). Ideally, the final file will have forty-plus columns + 1 indicating for the shared M
patterns, with no particular order as I can then sort
on the last M-pattern column.
Something like this I imagine:
file1 | file2 | file3 | M-pattern |
---|---|---|---|
8191 | 8584 | 9090 | M0 |
139599 | 119837 | 137373 | M1 |
79 | 72 | 73 | M10 |
In theory since AWK should work sequentially I should be able to parse a header afterward that reflects the order in which files have been added using something like sed
. Any help is much appreciated, thanks in advance!
So far, I have attempted the following found in some other related answers:
awk 'FNR==NR{a[$0];next} $0 in a' one two
but it seems to not acting on column $2 where my M-patterns are, neither I think I understood how to modify it to do so eventually...
Using GNU awk for arrays of arrays, ARGIND
, and sorted_in
:
$ cat tst.awk
{ cnts[$2][ARGIND] += $1 }
END {
OFS = "\t"
for ( key in cnts ) {
if ( length(cnts[key]) == ARGIND ) {
goodKeys[key]
}
}
if ( length(goodKeys) ) {
for ( fileNr=1; fileNr<=ARGIND; fileNr++ ) {
printf "%s%s", ARGV[fileNr], OFS
}
print "M-pattern"
PROCINFO["sorted_in"] = "@ind_str_asc"
for ( key in goodKeys ) {
for ( fileNr=1; fileNr<=ARGIND; fileNr++ ) {
printf "%d%s", cnts[key][fileNr], OFS
}
print key
}
}
}
$ awk -f tst.awk file1 file2 file3
file1 file2 file3 M-pattern
8191 8584 9090 M0
139559 119837 137373 M1
79 72 73 M10
The above would work even if a given "M-pattern" could occur multiple times in an input file, it'd just summarize the counts for that M-pattern in that file, and it won't produce any output if no M-patterns exist across all files. It'll print the file names in the order provided on the command line and sort the M-patterns alphabetically in the output.