pythonlistcharacter-encodinguniquetext-comparison

Code to find unique elements gives duplicate elements due to different character-encoding


I have a text file with a list of repeated names (some of which have accented alphabets like é, à, î etc.)

e.g. List: Précilia, Maggie, Précilia

I need to write a code that will give an output of the unique names.

But, my text file seems to have different character-encoding for the two accented é's in the two occurrences of Précilia (I am guess perhaps ASCII for one and UTF-8 for another). Thus my code gives both occurrences of Précilia as different unique elements. You can find my code below:

 seen = set()
 with open('./Desktop/input1.txt') as infile:
     with open('./Desktop/output.txt', 'w') as outfile:
         for line in infile:
             if line not in seen:
                 outfile.write(line)
                 seen.add(line)

Expected output: Prècilia, Maggie

Actual and incorrect output: Prècilia, Maggie, Prècilia

Update: The original file is a very large file. I need a way to consider both these occurrences as a single one.


Solution

  • So my boss suggested we use Unicode Normalization which replaces equivalent sequences of characters so that any two texts that are equivalent will be reduced to the same sequence of code points, called the normalization form or normal form of the original text.

    More details can be found on https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html and https://github.com/aws/aws-cli/issues/1639

    As of now we got positive results on our test cases and hopefully our main data set will work with this too.