I'm currently investigating how SBCS/DBCS decoding works in the JDK, and I've stumbled upon a weird piece of code in the IBM930
charset implementation (although it's not the only one).
First of all - per my understanding - JDK implementers use mapping files to generate most charset classes. For example:
IBM930.map
IBM930.nr non-roundtrip bytes override
IBM930.c2b non-roundtrip codepoints override
are the files the DBCS utility interpret to generate IBM930.java
.
If we look into IBM930.nr
, we see:
25 000a
Which means byte 0x25
must map to \u000a
.
If we now look into IBM930.map
, we see:
...
24 0084
25 000A <---
26 0017
...
So, the non-roundtrip override has already been specified in the main .map file.
If we open IBM930.java
, we can see at the very bottom:
static class EncodeHolder {
static final char[] c2b = new char[0x7400];
static final char[] c2bIndex = new char[0x100];
static {
String b2cNR = "\u0025\n";
String c2bNR = ...
DoubleByte.Encoder.initC2B(DecodeHolder.b2cStr, DecodeHolder.b2cSBStr,
b2cNR, c2bNR,
0x40, 0xfe,
c2b, c2bIndex);
}
}
Specifically I'm pointing to String b2cNR = "\u0025\n"
.
Given the main .map file already contains NR overrides, why does the generation process generate a non-null b2cNR
anyway?
Is it because not all .map files are adjusted to include .nr entries?
Or am I missing a very specific behavior of the initC2B
method.
If we look into
IBM930.nr
, we see:25 000a
Yes.
Which means byte
0x25
must map to\u000a
.
No, not really. Certainly it's not how that data is used.
The source for the relevant sun.nio.cs.DoubleByte
class can be found here: https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/sun/nio/cs/DoubleByte.java. If you trace what DoubleByte.Encoder.initCB()
does with the data coming from that file via b2cNR
, you will see that it is not used to define a decoding of 0x25
to \u000a
. Rather, it is used to ensure that the mapped decoding of 0x25
to \u000a
(from the .map
file) is not used to also define an encoding from \u000a
back to 0x25
. And if there were no such decoding mapped, then the b2cNR
item and the corresponding .nr
entry would have no effect.
Given the main .map file already contains NR overrides, why does the generation process generate a non-null
b2cNR
anyway?
The NR data are not well characterized as overrides. At least, not mapping overrides. Rather, they flag mappings that are one way in the bytes to character (decoding) direction. You might think, then, that repeating the full mapping instead of just giving the bytes would be redundant, and perhaps you would be right, but giving the whole mapping provides a consistency check, and perhaps also is convenient in the event that there is a different character that is encoded to those bytes.
Is it because not all .map files are adjusted to include .nr entries?
It is because you misunderstand the significance of the .nr entries. The .map file is expected to provide all the bytes <--> character correspondances. The .nr entries flag some of those mappings as unidirectional.