linuxfileawksedcut

How do I check if there are duplicate values across files at a specific position


I have about 2000 files in a directory on a Linux server. In each file, the positions x-y have invoice numbers. Which is the best way to check if there are duplicates across these files and print the file names and values? A simplified version of the problem -

$ cat a.txt 
xyz1234
xyz1234
pqr4567
$ cat b.txt 
lon9876
lon9876
lon4567

In the above 2 files, assuming that the Invoice numbers are in the position 4-8, we have duplicates - "4567" in a.txt and b.txt. If we have duplicates in the same file - as we have 1234 in a.txt, it is fine. No need to print that.I tried to cut the inv numbers, but the output doesn't have file names. My plan was to cut, get the file names also along with the Invoice numbers, do a unique on the output etc.


Solution

  • Perl to the rescue!

    perl -lne '
        $in_file{ substr $_, 3, 4 }{$ARGV} = 1;
        END {
            for $invoice (%in_file) {
                print join "\t", $invoice, keys %{ $in_file{$invoice} }
                    if keys %{ $in_file{$invoice} } > 1;
            }
        }
    ' -- *txt