I have a String being sent in the request payload by a client as:
"[0xc3][0xa1][0xc3][0xa9][0xc3][0xad][0xc3][0xb3][0xc3][0xba][0xc3][0x81][0xc3][0x89][0xc3][0x8d][0xc3][0x93][0xc3][0x9a]Departms"
I want to get a String which is "áéíóúÁÉÍÓÚDepartms". How can I do this in Java?
The problem is that I have no control over the way client encodes this string. Seems like the client is just encoding the non-ascii characters in this format and sends the ascii chars as it is(see 'Departms' at the end).
The stuff within the square brackets, seems to be characters encoded in UTF-8 but converted into a hexadecimal string in a weird way. What you can do is find each instance that looks like [0xc3]
and convert it into the corresponding byte, and then create a new string from the bytes.
Unfortunately there are no good tools for working with byte arrays. Here's a quick and dirty solution that uses regex to find and replace these hex codes with the corresponding character in latin-1, and then fixes that by re-interpreting the bytes.
String bracketDecode(String str) {
Pattern p = Pattern.compile("\\[(0x[0-9a-f]{2})\\]");
Matcher m = p.matcher(str);
StringBuilder sb = new StringBuilder();
while (m.find()) {
String group = m.group(1);
Integer decode = Integer.decode(group);
// assume latin-1 encoding
m.appendReplacement(sb, Character.toString(decode));
}
m.appendTail(sb);
// oh no, latin1 is not correct! re-interpret bytes in utf-8
byte[] bytes = sb.toString().getBytes(StandardCharsets.ISO_8859_1);
return new String(bytes, StandardCharsets.UTF_8);
}