duplicatescomm

comm -23 not deleting all common lines


I want to delete lines from file 1.txt that are in file 2.txt and save the output to 3.txt, I am using this bash command:

comm -23 1.txt 2.txt > 3.txt

When I check the output in file 3.txt, I find that some common lines between 1.txt and 2.txt are still in 3.txt, take as an example the word "registry" , what is the problem?

You can download the two files below:

file 1.txt : https://ufile.io/n7vn6

file 2.txt : https://ufile.io/p4s58


Solution

  • I'm not sure how you generated your text files, but the problem is that some of your 1.txt and 2.txt lines don't have consistent line terminations. Some have a CR character (ctrl-M) but not the sole line feed Linux expects for text files. For example, one of them has registry^M which doesn't match registry (Linux programs that examine text will see ^M as another character or white space but not as a line termination that gets ignored). When you look at the file with some text editors, the ^M isn't visible so it appears registry is the same in both places, but it isn't.

    You could try:

    dos2unix 1.txt 2.txt
    comm -23 <(sort 1.txt) <(sort 2.txt) > 3.txt
    

    dos2unix will make all of the line terminations correct (assuming they might be using the DOS CR). Note that this can affect the sort a little, so I'm also resorting them. You can try this without resorting, and if there's an issue comm will give an error that one of the files isn't sorted.