character-encodingdb2-400

How to recover original characters from data with broken characters?


We have a bunch of translations loaded in our Database which are showing broken characters where there should be accented latin characters like 'É'. I'm not sure where the encoding went wrong but we are now seeing data like 'CASQUE INTÃGRAL 100-SERIES 3.0' which obviously is not what we want.

I'm fairly certain the field is UTF-8 and I'm sure to keep it that way when reading the data. I've tried a few other combinations of encoding but all show the broken characters.

My question is whether it is possible to get this string back to a state before the broken characters occurred. Is the data still there, just in the incorrect encoding or did saving it in the incorrect encoding destroy the original character data?

To be clear, is it possible to convert 'CASQUE INTÃGRAL 100-SERIES 3.0' back to 'CASQUE INTÉGRAL 100-SERIES 3.0', or will this data need to be reloaded with the correct encoding?

The database in question is Db2 on AS400.


Solution

  • Short answer...yes you'll going to have to reload the data.

    Long answer, it'd be unusual in my experience to have a Db2 for IBM i table with CCSID(1208) (aka UTF-8) columns. Unless it's a relatively recently added table. Even now, my experience (granted in the US) has been than many continue to use EBCDIC CCSIDs by default.

    So it's likely however the table was loaded, conversion from UTF-8 to the assigned EBCDIC CCSID wasn't handled properly. Judging by your comments, you're going to need some assistance from the system admins and/or RPG/COBOL developers.

    UPDATE: CCSID(37) is US English
    I believe that should be able to handle INTÉGRAL so you'll need to look at how the data was loaded.

    This may be a useful presentation for you...
    What's With These ASCII, EBCDIC, Unicode CCSIDs?