I'm porting my JNA-based library to "pure" Java using the Foreign Function and Memory API ([JEP 424][1]) in JDK 19.
One frequent use case my library handles is reading (null-terminated) Strings from native memory. For most *nix applications, these are "C Strings" and the MemorySegment.getUtf8String() method is sufficient to the task.
Native Windows Strings, however, are stored in UTF-16 (LE). Referenced as arrays of TCHAR
or as "Wide Strings" they are treated similarly to "C Strings" except consume 2 bytes each.
JNA provides a Native.getWideString()
method for this purpose which invokes native code to efficiently iterate over the appropriate character set.
I don't see a UTF-16 equivalent to the getUtf8String()
(and corresponding set...()
) optimized for these Windows-based applications.
I can work around the problem with a few approaches:
new String(bytes, StandardCharsets.UTF_16LE)
and:
trim()
split()
on the null delimiter and extract the first elementbyte[]
) I can iterate character-by-character looking for the null.While certainly I wouldn't expect the JDK to provide native implementations for every character set, I would think that Windows represents a significant enough usage share to support its primary native encoding alongside the UTF-8 convenience methods. Is there a method to do this that I haven't discovered yet? Or are there any better alternatives than the new String()
or character-based iteration approaches I've described?
A charset decoder provides a way to deal with null terminated MemorySegment
wide / UTF16_LE to String
on Windows using Foreign Memory API. This may not be any different / improvement to your workaround suggestions, as it involves scanning the resulting character buffer for the null position.
public static String toJavaString(MemorySegment wide) {
return toJavaString(wide, StandardCharsets.UTF_16LE);
}
public static String toJavaString(MemorySegment segment, Charset charset) {
// JDK Panama only handles UTF-8, it does strlen() scan for 0 in the segment
// which is valid as all code points of 2 and 3 bytes lead with high bit "1".
if (StandardCharsets.UTF_8 == charset)
return segment.getUtf8String(0);
// if (StandardCharsets.UTF_16LE == charset) {
// return Holger answer
// }
// This conversion is convoluted: MemorySegment->ByteBuffer->CharBuffer->String
CharBuffer cb = charset.decode(segment.asByteBuffer());
// cb.array() isn't valid unless cb.hasArray() is true so use cb.get() to
// find a null terminator character, ignoring it and the remaining characters
final int max = cb.limit();
int len = 0;
while(len < max && cb.get(len) != '\0')
len++;
return cb.limit(len).toString();
}
Going the other way String
-> null terminated Windows wide MemorySegment
:
public static MemorySegment toCString(SegmentAllocator allocator, String s, Charset charset) {
// "==" is OK here as StandardCharsets.UTF_8 == Charset.forName("UTF8")
if (StandardCharsets.UTF_8 == charset)
return allocator.allocateUtf8String(s);
// else if (StandardCharsets.UTF_16LE == charset) {
// return Holger answer
// }
// For MB charsets it is safer to append terminator '\0' and let JDK append
// appropriate byte[] null termination (typically 1,2,4 bytes) to the segment
return allocator.allocateArray(JAVA_BYTE, (s+"\0").getBytes(charset));
}
/** Convert Java String to Windows Wide String format */
public static MemorySegment toWideString(String s, SegmentAllocator allocator) {
return toCString(allocator, s, StandardCharsets.UTF_16LE);
}
Like you, I'd also like to know if there are better approaches than the above.
JDK22 Update
JDK22 supports conversion of StandardCharsets.XXX
, so conversion from Java String to MemorySegment is simply:
var seg = arena.allocateFrom(str, charset);
A fallback for other character sets uses the approach with appending \0
:
var seg = arena.allocateFrom(JAVA_BYTE, (s+"\0").getBytes(charset));