unicodepyside6qtextdocumentpython-3.10qtextcursor

sequence of combination character with circumfrex


I have a document with the folloing sentence. "Mon frère aîné" I get each character by QTextCursor.

from PySide6 import QtWidgets, QtGui
import os, sys, PySide6
dirname = os.path.dirname(PySide6.__file__)
plugin_path = os.path.join(dirname, 'plugins', 'platforms')
os.environ['QT_QPA_PLATFORM_PLUGIN_PATH'] = plugin_path

doc = QtGui.QTextDocument()
step = 0
doc.setPlainText("Mon frère aîné")
for num, sen in enumerate("Mon frère aîné"):
    tc = QtGui.QTextCursor(doc)
    can_move = tc.movePosition(tc.NextCharacter, tc.MoveAnchor, step+1)
    if can_move:
        tc.movePosition(tc.PreviousCharacter, tc.KeepAnchor, 1)
        print(tc.selectedText(), num, sen)

    step += 1

result: M 0 M

o 1 o

n 2 n

3

f 4 f

r 5 r

è 6 è

r 7 r

e 8 e

9

a 10 a

î 11 i(here)

n 12 ̂ (here)

é 13 n(here)

QTextCursor can get two character like a combination unicode "î" as one character, on the other hand, python sequence distinguish the two between "i" and "^".

How can I make a coincidence between the two?


Solution

  • The glyph î can be represented two ways in Unicode:

    U+00EE - LATIN SMALL LETTER I WITH CIRCUMFLEX
    

    or:

    U+0069 - LATIN SMALL LETTER I
    U+0302 - COMBINING CIRCUMFLEX ACCENT
    

    QTextCursor seems to be Unicode grapheme-aware and advances a "perceived character" at a time. See Unicode Text Segmentation for more details.

    Unicode normalization can convert between the two in this case and may be all you need:

    import unicodedata as ud
    
    s1 = '\u00ee'
    s2 = '\u0069\u0302'
    
    print(s1,s2)           # They look the same
    print(len(s1),len(s2))
    
    print(s1 == s2)
    print(s1 == ud.normalize('NFC',s2))  # combined format
    print(ud.normalize('NFD',s1) == s2)  # decomposed format
    

    Output:

    î î
    1 2
    False
    True
    True
    

    In your example, some accented characters are composed and one is decomposed:

    text = "Mon frère aîné"
    print(len(text),text,ascii(text))
    text = ud.normalize('NFC',text)
    print(len(text),text,ascii(text))
    text = ud.normalize('NFD',text)
    print(len(text),text,ascii(text))
    

    Output:

    15 Mon frère aîné 'Mon fr\xe8re ai\u0302n\xe9'       # mix
    14 Mon frère aîné 'Mon fr\xe8re a\xeen\xe9'          # shorter, all combined
    17 Mon frère aîné 'Mon fre\u0300re ai\u0302ne\u0301' # longer, all decomposed
    

    QTextCursor