I am trying to write a bash script to take three user dictionaries from various places across my boxen, and combine them, remove duplicates and then write them back to their respective areas.
However, when I cat
the files, and either perform a sort -u
or a uniq
, the duplicate lines remain:
Alastair
Alastair
Albanese
Albanese
Alberts
Alberts
Alec
Alec
Alex
Alex
I narrowed it down to one of the files, which comes from Microsoft Outlook/Windows and is called CUSTOM.DIC. By examining it with file -i
I found that it was a UTF-16le file (and was printing oriental characters when concatenated with UTF-8 files directly), so I ran the command
iconv -f utf-16le -t utf-8 CUSTOM.DIC -o CUSTOMUTF8.DIC
Yet, when I concatenate that file with my other UTF-8 files, it produces duplicates that cannot be removed using sort -u
or uniq
.
I have found that for large files, file -i
only guesses the file format from the first (many) thousand lines, so I ran the commands
file_to_check="CUSTOMUTF8.DIC"
bytes_to_scan=$(wc -c < $file_to_check)
file -b --mime-encoding -P bytes=$bytes_to_scan $file_to_check
with the output:
so the conversion has happened, the output file combined.txt
is UTF-8 also, so why can't I remove the duplicate lines?
I have also checked to see if there are any trailing spaces in the combined file.
This feels like a problem that many people would have seen before, but I can't find the answer (or I've created the wrong search string, of course)...
Many thanks to @Andrew Henle - I knew it would be something simple!
Indeed, using hexdump -c combined2.txt
I saw that some lines ended with a \n
and some with \r\n
.
So I downloaded dos2unix
and ran
dos2unix combined2.txt
sort -u combine2.txt > combined3.txt
and it's all good!
Thanks again, Andrew!