unicodekotlincodepoint

kotlin split utf string into single length sub strings using codepoint


I'm just starting kotlin so I'm sure there is an easy way to do this but I don't see it. I want to split a into single-length sub strings using codepoints. In Java 8, this works:

public class UtfSplit {
    static String [] utf8Split (String str) {
        int [] codepoints = str.codePoints().toArray();
        String [] rv = new String[codepoints.length];
        for (int i = 0; i < codepoints.length; i++)
            rv[i] = new String(codepoints, i, 1);
        return rv;
    }
    public static void main(String [] args) {
        String test = "こんにちは皆さん";
        System.out.println("Test string:" + test);
        StringBuilder sb = new StringBuilder("Result:");
        for(String s : utf8Split(test))
            sb.append(s).append(", ");
        System.out.println(sb.toString());
    }
}

Output is:

Test string:こんにちは皆さん
Result:こ, ん, に, ち, は, 皆, さ, ん, 

How would I do this in kotlin?? I can get to codepoints although it's clumsy and I'm sure I'm doing it wrong. But I can't get from the codepoints back to a strings. The whole string/character interface seems different to me and I'm just not getting it.

Thanks Steve S.


Solution

  • You are using the same runtime as Java so the code is basically doing the same thing. However, the Kotlin version is shorter, and also has no need for a class, although you could group utility methods into an object. Here is the version using top-level functions:

    fun splitByCodePoint(str: String): Array<String> {
        val codepoints = str.codePoints().toArray()
        return Array(codepoints.size) { index ->
            String(codepoints, index, 1)
        }
    }
    
    fun main(args: Array<String>) {
        val input = "こんにちは皆さん"
        val result = splitByCodePoint(input)
    
        println("Test string: ${input}")
        println("Result:      ${result.joinToString(", ")}")
    }
    

    Output:

    Test string: こんにちは皆さん

    Result: こ, ん, に, ち, は, 皆, さ, ん

    Note: I renamed the function because the encoding doesn't really matter since you are just splitting by Codepoints.

    Some might write this without the local variable:

    fun splitByCodePoint(str: String): Array<String> {
        return str.codePoints().toArray().let { codepoints ->
            Array(codepoints.size) { index -> String(codepoints, index, 1) }
        }
    }
    

    See also: