I am attempting to iterate the following string:
mɔ̃tr
But no matter what I do, it always ends up getting processed as:
m ɔ ̃ t r
The tilde seems to detach from the reversed c.
One of my first attempts was to do the following:
"mɔ̃tr".map {
print(it)
}
The tilde would not stay with the reversed c.
I saw suggestions for the following iterator:
fun codePoints(string: String): Iterable<String> {
return object : Iterable<String> {
override fun iterator(): MutableIterator<String> {
return object : MutableIterator<String> {
var nextIndex = 0
override fun hasNext(): Boolean {
return nextIndex < string.length
}
override fun next(): String {
val result = string.codePointAt(nextIndex)
nextIndex += Character.charCount(result)
return String(Character.toChars(result))
}
override fun remove() {
throw UnsupportedOperationException()
}
}
}
}
}
But this gave the same output as the previous example.
I have been stuck on this seemingly simple problem for a day now, I just want to process this string as though it had 4 characters, not 5.
Any tips?
"ɔ̃" consists of two Unicode code points. This is why the code point iterator you showed still treats ɔ̃ as separate.
"ɔ̃" is a single grapheme cluster. To iterate over those, you need a java.text.BreakIterator
. In the documentation, there is an example that shows you how.
public static void printEachForward(BreakIterator boundary, String source) {
int start = boundary.first();
for (int end = boundary.next();
end != BreakIterator.DONE;
start = end, end = boundary.next()) {
System.out.println(source.substring(start,end));
}
}
In Kotlin, you can write an extension function on String
that returns you a Sequence
of the grapheme clusters.
fun String.graphemeClusterSequence() = sequence {
val iterator = BreakIterator.getCharacterInstance()
iterator.setText(this@graphemeClusterSequence)
var start = iterator.first()
var end = iterator.next()
while (end != BreakIterator.DONE) {
yield(this@graphemeClusterSequence.substring(start, end))
start = end
end = iterator.next()
}
}
Now "mɔ̃tr".graphemeClusterSequence().forEach { println(it) }
prints:
m
ɔ̃
t
r