pythonpython-3.xunicodethai

Splitting Thai text by characters


Not by word boundaries, that is solvable.

Example:

#!/usr/bin/env python3  
text = 'เมื่อแรกเริ่ม'  
for char in text:  
    print(char)  

This produces:






Which obviously is not the desired output. Any ideas?

A portable representation of text is:

text = u'\u0e40\u0e21\u0e37\u0e48\u0e2d\u0e41\u0e23\u0e01\u0e40\u0e23\u0e34\u0e48\u0e21'

Solution

  • tl;dr: Use \X regular expression to extract user-perceived characters:

    >>> import regex # $ pip install regex
    >>> regex.findall(u'\\X', u'เมื่อแรกเริ่ม')
    ['เ', 'มื่', 'อ', 'แ', 'ร', 'ก', 'เ', 'ริ่', 'ม']
    

    While I do not know Thai, I know a little French.

    Consider the letter è. Let s and s2 equal è in the Python shell:

    >>> s
    'è'
    >>> s2
    'è'
    

    Same letter? To a French speaker visually, oui. To a computer, no:

    >>> s==s2
    False
    

    You can create the same letter either using the actual code point for è or by taking the letter e and adding a combining code point that adds that accent character. They have different encodings:

    >>> s.encode('utf-8')
    b'\xc3\xa8'
    >>> s2.encode('utf-8')
    b'e\xcc\x80'
    

    And differnet lengths:

    >>> len(s)
    1
    >>> len(s2)
    2
    

    But visually both encodings result in the 'letter' è. This is called a grapheme, or what the end user considers one character.

    You can demonstrate the same looping behavior you are seeing:

    >>> [c for c in s]
    ['è']
    >>> [c for c in s2]
    ['e', '̀']
    

    Your string has several combining characters in it. Hence a 9 grapheme character Thai string to your eyes becomes a 13 character string to Python.

    The solution in French is to normalize the string based on Unicode equivalence:

    >>> from unicodedata import normalize
    >>> normalize('NFC', s2) == s
    True
    

    That does not work for many non Latin languages though. An easy way to deal with unicode strings that may be multiple code points composing a single grapheme is with a regex engine that correctly deals with this by supporting \X. Unfortunately Python's included re module doesn't yet.

    The proposed replacement, regex, does support \X though:

    >>> import regex
    >>> text = 'เมื่อแรกเริ่ม'
    >>> regex.findall(r'\X', text)
    ['เ', 'มื่', 'อ', 'แ', 'ร', 'ก', 'เ', 'ริ่', 'ม']
    >>> len(_)
    9