pythonunicodecombining-marks

python isalpha doesn't handle unicode combing marks properly?


I encountered weird ukrainian word Кири́лл. I converted it to unicode and tested it with isalpha, which returned False. I looked around and found that this word contains character named 'combining acute accent'. So the letter и́ is actually a combination of two characters: и and ́. If I understood it correctly, combining marks (like this acute accent) are intended only to modify other characters. So isalpha should recognize this string as a word. Am I wrong? Is there any way to get correct results? The word in question in utf8:

word = '\xd0\x9a\xd0\xb8\xd1\x80\xd0\xb8\xcc\x81\xd0\xbb\xd0\xbb'


Solution

  • I think you will need to replace the strings of any modifier characters since a modifier character is not considered alpha

    modifiers = "\xcc\x81|<OTHER>|<MODIFIERS>"
    
    text_to_analyze = re.sub(modifiers,"",my_text)
    print unicode(text_to_analyze,"utf8").isalpha()