[SOLVED] Find word boundaries in Southeast Asian Languages (Thai, Khmer, Lao, Myanmar)

Find word boundaries in Southeast Asian Languages (Thai, Khmer, Lao, Myanmar)

Many languages have spaces between words, so it's easy to know when to wrap to a new line. For example, in english, any space represents an opportunity to wrap lines.

However, languages like Thai have wrapping between words, but no spaces to tell you when to wrap.

I am using a speech to text algorithm to transcribe audio into strings, then split it into line-by-line captions based on timestamps. For the first set of languages, this is pretty easy, using spaces to split up tokens. But I am worried that for Southeast Asian users, it will produce unintelligible splits between lines.

For these languages, is it possible to find the points in a string of text at which you would potentially be able to break the lines up? I have to assume this is somehow done internally for UIKit, otherwise, Thai users putting text into a UILabel would have incorrect line breaks.

See below for an explanation and visual example.

Solution

NLTokenizer can split text into words.

import NaturalLanguage

let tokenizer = NLTokenizer(unit: .word)
tokenizer.setLanguage(.thai)

// or:
//tokenizer.setLanguage(.khmer)
//tokenizer.setLanguage(.lao)
//tokenizer.setLanguage(.burmese)

let text = "ทำอะไรอยู่ล่ะคุณนาย"
tokenizer.string = text

// this returns the ranges of all the words
let tokenRanges = tokenizer.tokens(for: text.startIndex..<text.endIndex)
for tokenRange in tokenRanges {
    print(text[tokenRange])
}

Output:

ทำ
อะไร
อยู่
ล่ะ
คุณนาย

In your real code, you could use the upper bound of each word range as the "word boundary". Alternatively, since you are doing captions, it might be more suitable to use:

let lastWordRange = tokenizer.tokenRange(at: index)

to get the word range at a particular index, where index could be somewhere near your desired cut-off point, and you "cut the string off" at lastWordRange.upperBound.