[SOLVED] EBCDIC (Cp1047) junit 0D25

EBCDIC (Cp1047) junit 0D25

I'm working with some data encoded in EBCDIC and I'm required to substitute some of the values.

One of the values is the hex bytes 0D 25 which have to become 40 40.

I understand this to mean that the EBCDIC characters are as follows:

0D: carriage return
25: line feed

This pair when found in a String with this encoding must be replaced with two spaces (hex 40, twice). See also the wikipedia EDCDIC character table.

I understand 0D (carriage return) to be equivalent to \r and 25 (line feed) is (should be?) \n in UTF-16. Therefore:

assertThat(minimalExample("\r\n")).isEqualTo(" ");

should pass, given a minimal method defined as follows

String minimalExample(String raw) throws Exception {
    byte[] bytes = raw.getBytes("Cp1047");
    if (bytes[0] == 0x0d && bytes[1] == 0x25) {
        bytes[0] = 0x40;
        bytes[1] = 0x40;
    }
    return new String(bytes, "Cp1047");
}

What I end up getting though is

Expecting:
 <"
">
to be equal to:
 <" ">
but was not.

Because the second hex value for the given string is 15 (not 25 like I thought it would be).

It seems that 25 is represented in UTF-16 as \u0085.

Is this correct?

Solution

The EBCDIC newline character got a separate Unicode code point NEL = U+0085. For instance BufferedReader.readLine will also take that as line separator.

EBCDIC LF 0X25 is not \n but \u0085. \n is always \u000a. UTF-16.

The unit test should be done on byte[] and it should be checked that the code reading lines does not read an extra empty line on bytes 0d 25 in EBCDIC.

In regular expression you can use "\\R" for the miscellaneous line separators:

s = s.replaceAll("\\R", ".");