[SOLVED] How to count correctly letters with diacritics in text?

How to count correctly letters with diacritics in text?

I want to find frequency of different letters in a text, and some of them use diacritics. As example the text uses both 'å' and 'ą̊ '(U+00E5 U+0328) and the frequency needs to be counted for separately.

How do I do that?

I've tried using Counter collection, open the file using utf8 format, split the text string using both text.split() and list(text), but python still counts 'å' and 'ą̊ ' as same letter!

Solution

The problem here is that unicode text (forget about utf-8, I am talking after decoding your data to proper Python 3 strings) uses more than one unicode code point for some characters: 'ą̊ ' for example has two marks, so while both "ą" and "å" can exist as a single character after proper normalization, a character that takes both marks have to use one of the "combining mark" characters in unicode.

That means that Python Counter alone won't be able to handle it, without at least an extra step. In Python code, the way to findout about these marker characters is by using unicodedata.category - and it is not that friendly, it just returns a two-character identifier for the category.

So, I think one thing that can be done is pre-process your text into a list where each character and its markings are normalized, using some "pure Python" code. Then, Counter could do its job.

It could be something along:

import unicodedata
from collections import Counter

characters = []

text = ...

# Decompose all characters into plain letters + marking diacritics:
text = unicodedata.normalize("NFD", text)
for character in text:
    if unicodedata.category(character)[0] == "M": 
        # character is a composing mark, so agregate it with
        # previous character
        characters[-1] += character
    else:
        characters.append(character)

counting = Counter(characters)

(Note that the snippet above does not take into account a potential malformed text snippet, that would start with a marking character in position 0)