awkpattern-matchingmultiple-columns

how to select matching rows in multiple files with AWK


I have over forty files with the following structure:

file1 first 21 lines

8191 M0
139559 M1
79 M10
1 M10007
1 M1006
1 M10123

file2 first 21 lines

8584 M0
119837 M1
72 M10
1 M10003
1 M10045
1 M1014

file3 first 21 lines

9090 M0
137373 M1
73 M10
1 M10046
2 M101
1 M1039

where number is the number of occurrences of an M pattern, tab-separated. Now, the thing is these M patterns are in part shared across all files and are in the range of 700-800 total for each one of them.

What I wish to do is to use AWK to extract only those common to all the forty-plus files (say ~600) along with their count (the column). Ideally, the final file will have forty-plus columns + 1 indicating for the shared M patterns, with no particular order as I can then sort on the last M-pattern column. Something like this I imagine:

file1 file2 file3 M-pattern
8191 8584 9090 M0
139599 119837 137373 M1
79 72 73 M10

In theory since AWK should work sequentially I should be able to parse a header afterward that reflects the order in which files have been added using something like sed. Any help is much appreciated, thanks in advance!


So far, I have attempted the following found in some other related answers:

awk 'FNR==NR{a[$0];next} $0 in a'  one  two

but it seems to not acting on column $2 where my M-patterns are, neither I think I understood how to modify it to do so eventually...


Solution

  • Using GNU awk for arrays of arrays, ARGIND, and sorted_in:

    $ cat tst.awk
    { cnts[$2][ARGIND] += $1 }
    END {
        OFS = "\t"
    
        for ( key in cnts ) {
            if ( length(cnts[key]) == ARGIND ) {
                goodKeys[key]
            }
        }
    
        if ( length(goodKeys) ) {
            for ( fileNr=1; fileNr<=ARGIND; fileNr++ ) {
                printf "%s%s", ARGV[fileNr], OFS
            }
            print "M-pattern"
    
            PROCINFO["sorted_in"] = "@ind_str_asc"
            for ( key in goodKeys ) {
                for ( fileNr=1; fileNr<=ARGIND; fileNr++ ) {
                    printf "%d%s", cnts[key][fileNr], OFS
                }
                print key
            }
        }
    }
    

    $ awk -f tst.awk file1 file2 file3
    file1   file2   file3   M-pattern
    8191    8584    9090    M0
    139559  119837  137373  M1
    79      72      73      M10
    

    The above would work even if a given "M-pattern" could occur multiple times in an input file, it'd just summarize the counts for that M-pattern in that file, and it won't produce any output if no M-patterns exist across all files. It'll print the file names in the order provided on the command line and sort the M-patterns alphabetically in the output.