I have about 2000 files in a directory on a Linux server. In each file, the positions x-y have invoice numbers. Which is the best way to check if there are duplicates across these files and print the file names and values? A simplified version of the problem -
$ cat a.txt
xyz1234
xyz1234
pqr4567
$ cat b.txt
lon9876
lon9876
lon4567
In the above 2 files, assuming that the Invoice numbers are in the position 4-8, we have duplicates - "4567" in a.txt and b.txt. If we have duplicates in the same file - as we have 1234 in a.txt, it is fine. No need to print that.I tried to cut the inv numbers, but the output doesn't have file names. My plan was to cut, get the file names also along with the Invoice numbers, do a unique on the output etc.
Perl to the rescue!
perl -lne '
$in_file{ substr $_, 3, 4 }{$ARGV} = 1;
END {
for $invoice (%in_file) {
print join "\t", $invoice, keys %{ $in_file{$invoice} }
if keys %{ $in_file{$invoice} } > 1;
}
}
' -- *txt
-n
reads the input files line by line, running the code for each;-l
removes newlines from the input and adds them to print
ed lines;$ARGV
contains the name of the currently open file;