How can I convert characters in Java from Extended ASCII or Unicode to their 7-bit ASCII equivalent, including special characters like open (“
0x93) and close (”
0x94) quotes to a simple double quote ("
0x22) for example. Or similarly dash (–
0x96) to hyphen-minus (-
0x2D). I have found Stack Overflow questions similar to this, but the answers only seem to deal with accents and ignore special characters.
For example I would like “Caffè – Peña”
to transformed to "Caffe - Pena"
.
However when I use java.text.Normalizer:
String sample = "“Caffè – Peña”";
System.out.println(Normalizer.normalize(sample, Normalizer.Form.NFD)
.replaceAll("\\p{InCombiningDiacriticalMarks}", ""));
Output is
“Caffe – Pena”
To clarify my need, I am interacting with an IBM i Db2 database that uses EBCDIC encoding. If a user pastes a string copied from Word or Outlook for example, characters like the ones I specified are translated to SUB (0x3F in EBCDIC, 0x1A in ASCII). This causes a lot of unnecessary headache. I am looking for a way to sanitize the string so as little information as possible is lost.
After some digging I was able to find solution based on this answer using org.apache.lucene.analysis.ASCIIFoldingFilter
All the examples I was able to find were using the static version of the method foldToASCII as in this project:
private static String getFoldedString(String text) {
char[] textChar = text.toCharArray();
char[] output = new char[textChar.length * 4];
int outputPos = ASCIIFoldingFilter.foldToASCII(textChar, 0, output, 0, textChar.length);
text = new String(output, 0, outputPos);
return text;
}
However that static method has a note on it saying
This API is for internal purposes only and might change in incompatible ways in the next release.
So after some trial and error I came up with this version that avoids using the static method:
public static String getFoldedString(String text) throws IOException {
String output = "";
try (Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer(KeywordTokenizerFactory.class)
.addTokenFilter(ASCIIFoldingFilterFactory.class)
.build()) {
try (TokenStream ts = analyzer.tokenStream(null, new StringReader(text))) {
CharTermAttribute charTermAtt = ts.addAttribute(CharTermAttribute.class);
ts.reset();
if (ts.incrementToken()) output = charTermAtt.toString();
ts.end();
}
}
return output;
}
Similar to an answer I provided here.
This does exactly what I was looking for and translates characters to their ASCII 7-bit equivalent version.
However, through further research I have found that because I am mostly dealing with Windows-1252 encoding and because of the way jt400 handles ASCII <-> EBCDIC (CCSID 37) translation, if an ASCII string is translated to EBCDIC and back to ACSII, the only characters that are lost are 0x80
through 0x9f
. So inspired by the way lucene's foldToASCII handles it, I put together following method that handles these cases only:
public static String replaceInvalidChars(String text) {
char input[] = text.toCharArray();
int length = input.length;
char output[] = new char[length * 6];
int outputPos = 0;
for (int pos = 0; pos < length; pos++) {
final char c = input[pos];
if (c < '\u0080') {
output[outputPos++] = c;
} else {
switch (c) {
case '\u20ac': //€ 0x80
output[outputPos++] = 'E';
output[outputPos++] = 'U';
output[outputPos++] = 'R';
break;
case '\u201a': //‚ 0x82
output[outputPos++] = '\'';
break;
case '\u0192': //ƒ 0x83
output[outputPos++] = 'f';
break;
case '\u201e': //„ 0x84
output[outputPos++] = '"';
break;
case '\u2026': //… 0x85
output[outputPos++] = '.';
output[outputPos++] = '.';
output[outputPos++] = '.';
break;
case '\u2020': //† 0x86
output[outputPos++] = '?';
break;
case '\u2021': //‡ 0x87
output[outputPos++] = '?';
break;
case '\u02c6': //ˆ 0x88
output[outputPos++] = '^';
break;
case '\u2030': //‰ 0x89
output[outputPos++] = 'p';
output[outputPos++] = 'e';
output[outputPos++] = 'r';
output[outputPos++] = 'm';
output[outputPos++] = 'i';
output[outputPos++] = 'l';
break;
case '\u0160': //Š 0x8a
output[outputPos++] = 'S';
break;
case '\u2039': //‹ 0x8b
output[outputPos++] = '\'';
break;
case '\u0152': //Œ 0x8c
output[outputPos++] = 'O';
output[outputPos++] = 'E';
break;
case '\u017d': //Ž 0x8e
output[outputPos++] = 'Z';
break;
case '\u2018': //‘ 0x91
output[outputPos++] = '\'';
break;
case '\u2019': //’ 0x92
output[outputPos++] = '\'';
break;
case '\u201c': //“ 0x93
output[outputPos++] = '"';
break;
case '\u201d': //” 0x94
output[outputPos++] = '"';
break;
case '\u2022': //• 0x95
output[outputPos++] = '-';
break;
case '\u2013': //– 0x96
output[outputPos++] = '-';
break;
case '\u2014': //— 0x97
output[outputPos++] = '-';
break;
case '\u02dc': //˜ 0x98
output[outputPos++] = '~';
break;
case '\u2122': //™ 0x99
output[outputPos++] = '(';
output[outputPos++] = 'T';
output[outputPos++] = 'M';
output[outputPos++] = ')';
break;
case '\u0161': //š 0x9a
output[outputPos++] = 's';
break;
case '\u203a': //› 0x9b
output[outputPos++] = '\'';
break;
case '\u0153': //œ 0x9c
output[outputPos++] = 'o';
output[outputPos++] = 'e';
break;
case '\u017e': //ž 0x9e
output[outputPos++] = 'z';
break;
case '\u0178': //Ÿ 0x9f
output[outputPos++] = 'Y';
break;
default:
output[outputPos++] = c;
break;
}
}
}
return new String(Arrays.copyOf(output, outputPos));
}
Since it turns out that my real problem was Windows-1252 to Latin-1 (ISO-8859-1) translation, here is a supporting material that shows the Windows-1252 to Unicode translation used in the method above to ultimately get Latin-1 encoding.