shellunixcomm

comm command gives faulty output?


I have two files which just list a bunch of different file names on each line. I merged them, sorted them, and then checked the comm output and noticed something really interesting.

$ sort -u -o list1 list1
$ sort -u -o list2 list2
$ cat list1 list2 > combined
$ wc -l list1 
  18141 list1
$ wc -l list2 
  21755 list2
$ wc -l combined 
  39896 combined
$ sort -u -o combined combined 
$ wc -l combined 
  24400 combined


$ comm -23 list1 combined | wc -l
  12889
$ comm -13 list1 combined | wc -l
  19148
$ comm -12 list1 combined | wc -l
   5252


$ comm -23 list2 combined | wc -l
      0
$ comm -13 list2 combined | wc -l
   2645
$ comm -12 list2 combined | wc -l 
  21755

(line breaks above for clarity)

What's going on with those last few calls to comm? When I compare list1 to combined the output is wacky, but when I compare list2 to combined the output seems fine.

I even tried to combine all three lists again and test:

$ cat list1 list2 combined > combined-again
$ wc -l combined-again 
  64296 combined-again
$ sort -u -o combined-again combined-again
$ wc -l combined-again 
  24400 combined-again
$ diff combined combined-again

The sorted unique line count of combined and combined-again match, and there is no output from diff!

$ comm combined combined-again | wc -l
  24400
$ comm -12 combined combined-again | wc -l
  24400
$ comm -3 combined combined-again | wc -l
      0

These comm outputs make sense, there shouldn't be any difference between the two files.

$ comm -23 list1 combined-again | wc -l
  12889
$ comm -13 list1 combined-again | wc -l
  19148
$ comm -12 list1 combined-again | wc -l
   5252

When comparing against list1, we see the same wonky numbers again.

$ comm -23 list2 combined-again | wc -l                     
      0
$ comm -13 list2 combined-again | wc -l
   2645
$ comm -12 list2 combined-again | wc -l
  21755

When comparing against list2, the numbers are appropriate and correct.

I even used the some lines of output from comm -23 list1 combined-again to grep for those lines in combined-again, and those lines do exist. I'm totally at a loss for why the comm output is faulty in this case...

EDIT1:

$ locale
  LANG="en_US.UTF-8"
  LC_COLLATE="en_US.UTF-8"
  LC_CTYPE="en_US.UTF-8"
  LC_MESSAGES="en_US.UTF-8"
  LC_MONETARY="en_US.UTF-8"
  LC_NUMERIC="en_US.UTF-8"
  LC_TIME="en_US.UTF-8"
  LC_ALL=

Each of the files don't contain weird symbols or characters, just package names using camel case. For example:

$ head list1
  AAAAuthentication
  AAACorrelationAPI
  AAACorrespondence
  AAATestSuite
  AESDescription
  AESImplementation
  AESLogging
  AESMaster
  AESProofSystem
  AESTestSuite

EDIT2:

After some more investigation due to some suggestions in the comments, it seems that the issue could be because of the versioning of the comm and sort tools.

I ran all of the above commands on mac, where comm is from BSD January 26, 2005, and sort is from GNU coreutils, sort 5.93 on November 2005.

On the linux box, both comm and sort are from GNU coreutils 8.4 of January 2012, and the calls work perfectly.

I guess the question now is: what's the discrepancy between the versioning, and why does it affect the comm output as shown above?


Solution

  • For comm to work, its input needs to be sorted. And it needs to agree with your sort on what sorting method to use. In the C locale (LC_ALL=C) this is easy. Strings are compared one byte at a time, and the first byte that's different determines the order.

    In the en_US.UTF-8 locale, it's harder. First of all, there's no single authority describing what exactly the expected behavior is. Every vendor is free to imagine what "English sort order, US variant" means. And then document that decision or not (usually they pick "not"). And when your tools are half from BSD, half from GNU, the chance for disagreement is increased (though theoretically, I think they both should defer to the local C library...)

    Running all your commands with LC_ALL=C should make them more likely to agree with each other.