pythonstringutf-8diacriticscombining-marks

How do I compare characters with combining diacritic marks ɔ̃, ɛ̃ and ɑ̃ to unaccented ones in python (imported from a utf-8 encoded text file)?


Summary: I want to compare ɔ̃, ɛ̃ and ɑ̃ to ɔ, ɛ and a, which are all different, but my text file has ɔ̃, ɛ̃ and ɑ̃ written as ɔ~, ɛ~ and a~.


I wrote a script which moves along the characters in two words simultaneously, comparing them to find the pair of characters which is different The words are of equal length (excepting for the diacritic issue which introduces an extra character), and represent the IPA phonetic pronunciation of two French words only one phoneme apart.

The ultimate goal is to filter a list of anki cards so that only certain pairs of phonemes are included, because other pairs are too easy to recognize. Each pair of words represents an anki note.

For this I need to differentiate the nasal sounds ɔ̃, ɛ̃ and ɑ̃ form other sounds, as they are only really confusable with themselves.

As written, the code treats accented characters as the character plus ~, and so as two characters. Thus if the only difference in a word is between a final accented and on-accented character, the script finds no differences on the last letter and as written will then find one word shorter than the other (the other still has the ~ left) and throw an error trying to compare one more character. This is a whole 'problem' by itself, but if I can get the accented characters to read as single units the words will then have the same lengths, and it will disappear.

I do not want to replace the accented characters with non-accented ones, as some people do for comparisons, because they are different sounds.

I have tried 'normalizing' the unicode to a 'combined' form, e.g. unicodedata.normalize('NFKC', line), but it didn't change anything.


Here is some output, including the line at which it just throws the error; the printouts show the words and character of each word that the code is comparing; the number is the index of that character within the word. The final letter is therefore what the script 'thinks' the two characters are, and it sees the same thing for ɛ̃ and ɛ. It is also choosing the wrong pair of letters then when it reports the differences, and it's important that the pair is right because I compare with a master list of allowable pairs.

0 alyʁ alɔʁ a a # this first word is done well
1 alyʁ alɔʁ l l
2 alyʁ alɔʁ y ɔ # it doesn't continue to compare the ʁ because it found the difference
...
0 ɑ̃bisjø ɑ̃bisjɔ̃ ɑ ɑ
1 ɑ̃bisjø ɑ̃bisjɔ̃ ̃ ̃  # the tildes are compared / treated  separately
2 ɑ̃bisjø ɑ̃bisjɔ̃ b b
3 ɑ̃bisjø ɑ̃bisjɔ̃ i i
4 ɑ̃bisjø ɑ̃bisjɔ̃ s s
5 ɑ̃bisjø ɑ̃bisjɔ̃ j j
6 ɑ̃bisjø ɑ̃bisjɔ̃ ø ɔ # luckily that wasn't where the difference was, this is
...
0 osi ɛ̃si o ɛ # here it should report (o, ɛ̃), not (o, ɛ)
...
0 bɛ̃ bɔ̃ b b
1 bɛ̃ bɔ̃ ɛ ɔ # an error of this type
...
0 bo ba b b
1 bo ba o a # this is working correctly 
...
0 bjɛ bjɛ̃ b b
1 bjɛ bjɛ̃ j j
2 bjɛ bjɛ̃ ɛ ɛ # AND here's the money, it thinks these are the same letter, but it has also run out of characters to compare from the first word, so it throws the error below
Traceback (most recent call last):

  File "C:\Users\tchak\OneDrive\Desktop\French.py", line 42, in <module>
    letter1 = line[0][index]

IndexError: string index out of range

Here is the code:

def lens(word):
    return len(word)

# open file, and new file to write to
input_file = "./phonetics_input.txt"
output_file = "./phonetics_output.txt"
set1 = ["e", "ɛ", "œ", "ø", "ə"]
set2 = ["ø", "o", "œ", "ɔ", "ə"]
set3 = ["ə", "i", "y"]
set4 = ["u", "y", "ə"]
set5 = ["ɑ̃", "ɔ̃", "ɛ̃", "ə"]
set6 = ["a", "ə"]
vowelsets = [set1, set2, set3, set4, set5, set6]
with open(input_file, encoding="utf8") as ipf, open(output_file, encoding="utf8") as opf:
    # for line in file; 
    vowelpairs= []
    acceptedvowelpairs = []
    input_lines = ipf.readlines()
    print(len(input_lines))
    for line in input_lines:
        #find word ipa transctipts
        unicodedata.normalize('NFKC', line)
        line = line.split("/")
        line.sort(key = lens)
        line = line[0:2] # the shortest two strings after splitting are the ipa words
        index = 0
        letter1 = line[0][index]
        letter2 = line[1][index]
        print(index, line[0], line[1], letter1, letter2)
            
        linelen = max(len(line[0]), len(line[1]))
        while letter1 == letter2:
            index += 1
            letter1 = line[0][index] # throws the error here, technically, after printing the last characters and incrementing the index one more
            letter2 = line[1][index]
            print(index, line[0], line[1], letter1, letter2)
            
        vowelpairs.append((letter1, letter2))   
        
    for i in vowelpairs:
        for vowelset in vowelsets:
            if set(i).issubset(vowelset):
                acceptedvowelpairs.append(i)
    print(len(vowelpairs))
    print(len(acceptedvowelpairs))

Solution

  • Unicode normalization does not help for described particular character combinations because an excerpt from Unicode database UnicodeData.Txt using simple regex "Latin.*Letter.*with tilde$" gives ÃÑÕãñõĨĩŨũṼṽẼẽỸỹ (no Latin letters Open O, Open E or Alpha). So you need to iterate through both compared strings separately as follows (omitted most of your code above a Minimal, Reproducible Example):

    import unicodedata
    
    def lens(word):
        return len(word)
    
    input_lines = ['alyʁ/alɔʁ', 'ɑ̃bisjø/ɑ̃bisjɔ̃ ', 'osi/ɛ̃si', 'bɛ̃ /bɔ̃ ', 'bo/ba', 'bjɛ/bjɛ̃ ']
    print(len(input_lines))
    for line in input_lines:
        print('')
        #find word ipa transctipts
        line = unicodedata.normalize('NFKC', line.rstrip('\n'))
        line = line.split("/")
        line.sort(key = lens)
        word1, word2 = line[0:2] # the shortest two strings after splitting are the ipa words
        index = i1 = i2 = 0
        while i1 < len(word1) and i2 < len(word2):
            letter1 = word1[i1]
            i1 += 1
            if i1 < len(word1) and unicodedata.category(word1[i1]) == 'Mn':
                letter1 += word1[i1]
                i1 += 1
            letter2 = word2[i2]
            i2 += 1
            if i2 < len(word2) and unicodedata.category(word2[i2]) == 'Mn':
                letter2 += word2[i2]
                i2 += 1
            same = chr(0xA0) if letter1 == letter2 else '#' 
            print(index, same, word1, word2, letter1, letter2)
            index += 1
            #if same != chr(0xA0):
            #    break
    

    Output: .\SO\67335977.py

    6
    
    0   alyʁ alɔʁ a a
    1   alyʁ alɔʁ l l
    2 # alyʁ alɔʁ y ɔ
    3   alyʁ alɔʁ ʁ ʁ
    
    0   ɑ̃bisjø ɑ̃bisjɔ̃  ɑ̃ ɑ̃
    1   ɑ̃bisjø ɑ̃bisjɔ̃  b b
    2   ɑ̃bisjø ɑ̃bisjɔ̃  i i
    3   ɑ̃bisjø ɑ̃bisjɔ̃  s s
    4   ɑ̃bisjø ɑ̃bisjɔ̃  j j
    5 # ɑ̃bisjø ɑ̃bisjɔ̃  ø ɔ̃
    
    0 # osi ɛ̃si o ɛ̃
    1   osi ɛ̃si s s
    2   osi ɛ̃si i i
    
    0   bɛ̃  bɔ̃  b b
    1 # bɛ̃  bɔ̃  ɛ̃ ɔ̃
    2   bɛ̃  bɔ̃
    
    0   bo ba b b
    1 # bo ba o a
    
    0   bjɛ bjɛ̃  b b
    1   bjɛ bjɛ̃  j j
    2 # bjɛ bjɛ̃  ɛ ɛ̃
    

    Note: diacritic tested as Unicode category Mn; you can test against another condition (e.g. from the following list):