bashshellawksed

Find common lines to multiple files


I have nearly 200 files and I want to find lines that are common to all 200 files,the lines are like this:

HISEQ1:105:C0A57ACXX:2:1101:10000:105587/1
HISEQ1:105:C0A57ACXX:2:1101:10000:105587/2
HISEQ1:105:C0A57ACXX:2:1101:10000:121322/1
HISEQ1:105:C0A57ACXX:2:1101:10000:121322/2
HISEQ1:105:C0A57ACXX:2:1101:10000:12798/1
HISEQ1:105:C0A57ACXX:2:1101:10000:12798/2

is there a way to do it in a batch way?


Solution

  • awk '(NR==FNR){a[$0]=1;next}
         (FNR==1){ for(i in a) if(a[i]) {a[i]=0} else {delete a[i]} }
         ($0 in a) { a[$0]=1 }
         END{for (i in a) if (a[i]) print i}' file1 file2 file3 ... file200
    

    This method processes each file line-by-line. The idea is to keep track which lines have been seen in the current file by using an associative array a[line]. 1 means that the line is seen in the current file, 0 indicates that the line is not seen.

    1. (NR==FNR){a[$0]=1;next} store the first file into an array indexed by the line, and mark it as seen. (NR==FNR) is a condition used to check for the first line.
    2. (FNR==1){for(i in a) if(a[i]) {a[i]=0} else {delete a[i]} }: if we read the first line of a file, check which lines have been seen in the previous file. If the line in the array is not seen, delete it, if it is seen, reset it to not-seen (0). This way, we clean up the memory and handle duplicate lines in a single file.
    3. ($0 in a) { a[$0]=1 }: per line, check if the line is a member of the array, if it is, mark it as seen (1)
    4. END{for (i in a) if(a[i]) print i}: when all lines are processed, check which lines to print.