stringkotlinutf-8character-encodingnon-unicode

How to convert single-byte charset (non-ASCII) ByteArray into Kotlin UTF8 String (How to avoid ��?)


I have API that produces results in specific single-byte charset (WIN 1257) and I am reading this result in Kotlin as:

val connection = URL("http://192.168.1.21:92/someAPI").openConnection() as HttpURLConnection
var byteArray: ByteArray = ByteArray(10000000)
connection.inputStream.read(byteArray)
val tmp = String(byteArray, Charsets.UTF_8).trim()

Of course, this is clearly incorrect code, because it presumes that byteArray is the representation of the string that is encoded in UTF-8. It may be desirable to correct this code by using Charsets.WIN_1257, but there is no such option in Kotlin. My byte array is the representation of the string that is WIN-1257 encoded - how can I get UTF-8 string?

Here is simple test code that isolates my problem and that can be run in https://play.kotlinlang.org:

/**
 * You can edit, run, and share this code.
 * play.kotlinlang.org
 */
fun main() {
    var byteArray: ByteArray = listOf(0xe2, 0x72).map { it.toByte() }.toByteArray()
    println(String(byteArray, Charsets.UTF_8))
}

On can se that UTF_8 produces the result:

�r

But I expect:

ār

Solution

  • Look into Charset.availableCharsets; just Charset.forName("Windows-1257") might work.