[SOLVED] Splitting Thai text by characters

Splitting Thai text by characters

Not by word boundaries, that is solvable.

Example:

#!/usr/bin/env python3  
text = 'เมื่อแรกเริ่ม'  
for char in text:  
    print(char)

This produces:
เ
ม

อ
แ
ร
ก
เ
ร

ม

Which obviously is not the desired output. Any ideas?

A portable representation of text is:

text = u'\u0e40\u0e21\u0e37\u0e48\u0e2d\u0e41\u0e23\u0e01\u0e40\u0e23\u0e34\u0e48\u0e21'

Solution

tl;dr: Use \X regular expression to extract user-perceived characters:

>>> import regex # $ pip install regex
>>> regex.findall(u'\\X', u'เมื่อแรกเริ่ม')
['เ', 'มื่', 'อ', 'แ', 'ร', 'ก', 'เ', 'ริ่', 'ม']

While I do not know Thai, I know a little French.

Consider the letter è. Let s and s2 equal è in the Python shell:

>>> s
'è'
>>> s2
'è'

Same letter? To a French speaker visually, oui. To a computer, no:

>>> s==s2
False

You can create the same letter either using the actual code point for è or by taking the letter e and adding a combining code point that adds that accent character. They have different encodings:

>>> s.encode('utf-8')
b'\xc3\xa8'
>>> s2.encode('utf-8')
b'e\xcc\x80'

And differnet lengths:

>>> len(s)
1
>>> len(s2)
2

But visually both encodings result in the 'letter' è. This is called a grapheme, or what the end user considers one character.

You can demonstrate the same looping behavior you are seeing:

>>> [c for c in s]
['è']
>>> [c for c in s2]
['e', '̀']

Your string has several combining characters in it. Hence a 9 grapheme character Thai string to your eyes becomes a 13 character string to Python.

The solution in French is to normalize the string based on Unicode equivalence:

>>> from unicodedata import normalize
>>> normalize('NFC', s2) == s
True

That does not work for many non Latin languages though. An easy way to deal with unicode strings that may be multiple code points composing a single grapheme is with a regex engine that correctly deals with this by supporting \X. Unfortunately Python's included re module doesn't yet.

The proposed replacement, regex, does support \X though:

>>> import regex
>>> text = 'เมื่อแรกเริ่ม'
>>> regex.findall(r'\X', text)
['เ', 'มื่', 'อ', 'แ', 'ร', 'ก', 'เ', 'ริ่', 'ม']
>>> len(_)
9