bashcharacter-encodingcomm

Not able to compare two files with comm / diff


long time lurker, first time poster.

For several days i am trying to compare two sorted files, unsuccessfully. I tried comm and diff, even grep -v -f . even when i merged them together and used uniq -c , it presented each occurrence as 1 so it clearly does not think the lines are the same. Also comm shows that all 4000 lines are unique to both files. But for human eye they are identical. file -i shows they have the same encoding. I checked via VI for hidden characters and they both are absolutely identical.

[root@server tmp]# file -i master.tmp
master.tmp: text/plain; charset=us-ascii
[root@server tmp]# file -i mediaa.tmp
mediaa.tmp: text/plain; charset=us-ascii

I cant share the exact lines, but they look similar to this:

XXXXX%20(35e4df6a-48dd-43f-921-03942bd4)_1614884940

The only difference between the files is the way they were created. One is direct output of the application command. The other one has been put together from output of different application and had to be manipulated using AWK to achieve the same structure. Another lead is that once i copy over the text to my notepad++ and then copy it back into terminal, it starts to work correctly. But this is unwanted, the whole comparison will be part of a bigger script and i need it to be automatic. Are there any commands which i could use to clear any discrepancies in the file structure? I found iconv but im not sure which other encoding should i try. Any ideas what i am missing here? Thanks


Solution

  • its strange that VI :set list didnt show the difference.

    You'll notice the difference in vi if you immediately after loading the CR+NL file look at the status line, there's [dos] displayed next to the file name.

    If you just want to compare the files, you can use grep with the -Z (ignore white space at line end) option.

    If you want to remove the CRs from the DOS file, you can use tr -d \\r <withCR >withoutCR.