In an experimental feature exploration, I found some astounding behaviour when I try to load a Unix file that is UTF-8 on a mainframe system using Cobol and declaring the FD record as Unicode alphanumeric.
If my record length is 10 (ie. 10 chunk PIC U(10)
, first 10 characters are correctly loaded. Then 30 (apparently 3x the length according to exploration) is skipped, then it reads again 10 characters. Then the next 10 characters are read into my next record.
Source Code of my program:
IDENTIFICATION DIVISION.
PROGRAM-ID. loadutf8.
ENVIRONMENT DIVISION.
INPUT-OUTPUT SECTION.
FILE-CONTROL.
SELECT XMLFILE ASSIGN TO "XML".
DATA DIVISION.
FILE SECTION.
FD XMLFILE RECORDING MODE F.
01 chunk PIC U(10).
WORKING-STORAGE SECTION.
01 EOF PIC X.
PROCEDURE DIVISION.
START-PROGRAM.
OPEN INPUT XMLFILE.
PERFORM WITH TEST after UNTIL EOF = 'T'
READ XMLFILE
AT END MOVE 'T' TO EOF
NOT AT END
DISPLAY FUNCTION DISPLAY-of (chunk)
END-READ
END-PERFORM.
CLOSE XMLFILE.
GOBACK.
END PROGRAM loadutf8.
JOB-CARD:
//COBOL EXEC IGYWCLG,LNGPRFX=IGY630
//SYSIN DD DISP=SHR,DSN=COB.SRC(loadutf8)
//GO.XML DD PATH='/u/utf8.xml'
My UTF-8 file:
<?xml ?>
<!-- 0 --><!-- 1 --><!-- 2 --><!-- 3 --><!-- 4 --><!-- 5 --><x>???</x>
Output observed:
<?xml ?>
<!-- 3 -->
To me, it looks like consistently reading the chunk according to size, skip 3 times and then goes to next chunk size, etc..
What could be causing this?
Is there a best practice to it, how to load a Unix file to a XML and use a variable that has the usage UTF-8? Preferably w/o any hacks, just using 'standard' language features.
Just asking this out of curiosity, any idea on how to explain the observed outcome is appreciated.
Native UTF-8 support seems to have been introduced with IBM Enterprise Cobol V6.3. I don't have experience with this, but from reading the manual, I can explain what happens. I cannot, however, say whether this is desired behaviour or a bug.
Anyway, in the Programmers Guide (v6.4), topic Defining UTF-8 data items, one can read:
Fixed character-length UTF-8 data items.
This type of UTF-8 data item is defined when the PICTURE clause contains one or more 'U' characters, or a single 'U' character followed by a repetition factor, and neither the BYTE-LENGTH phrase of the PICTURE clause nor the DYNAMIC LENGTH clause is specified.
and further
For fixed character-length UTF-8 data items, the number of bytes reserved for the data item in memory is 4 × n, where n is the number of characters specified in the definition of the item. Note that, due to the varying length nature of the UTF-8 encoding, even after moving n characters to a UTF-8 data item of length n, it is not necessarily the case that all 4 × n reserved bytes are needed to hold the data. It depends on the size of each character in the data.
In the chapter on Processing QSAM files, one can read
You can also access byte-stream files in the z/OS UNIX file system using QSAM. These files are binary byte-oriented sequential files with no record structure. The record definitions that you code in your COBOL program and the length of the variables that you read into and write from determine the amount of data transferred.
I conclude from this that Cobol is simply advising the underlying I/O routine (QSAM) to read a many bytes as reserved for the receiving variable, which in your example is 40 bytes at a time. After all, QSAM does not support interpreting data as it is being read; it simply reads a given number of bytes, not characters, and places them in the input buffer.
The bytes will be interpreted as UTF-8 characters only when the variable is used later, such as in the DISPLAY statement. And, the length of the variable in number of UTF-8 characters as defined is then respected.
I did some quick test and read in some file which contains characters which need more than one byte in UTF-8 and the data displayed was shifted accordingly.
Not sure yet how to successfully process UTF-8 UNIX files with COBOL.