javahexnon-ascii-charactersstring-decoding

Convert string representation of a hexadecimal byte array to a string with non ascii characters in Java


I have a String being sent in the request payload by a client as:

"[0xc3][0xa1][0xc3][0xa9][0xc3][0xad][0xc3][0xb3][0xc3][0xba][0xc3][0x81][0xc3][0x89][0xc3][0x8d][0xc3][0x93][0xc3][0x9a]Departms"

I want to get a String which is "áéíóúÁÉÍÓÚDepartms". How can I do this in Java?

The problem is that I have no control over the way client encodes this string. Seems like the client is just encoding the non-ascii characters in this format and sends the ascii chars as it is(see 'Departms' at the end).


Solution

  • The stuff within the square brackets, seems to be characters encoded in UTF-8 but converted into a hexadecimal string in a weird way. What you can do is find each instance that looks like [0xc3] and convert it into the corresponding byte, and then create a new string from the bytes.

    Unfortunately there are no good tools for working with byte arrays. Here's a quick and dirty solution that uses regex to find and replace these hex codes with the corresponding character in latin-1, and then fixes that by re-interpreting the bytes.

    String bracketDecode(String str) {
        Pattern p = Pattern.compile("\\[(0x[0-9a-f]{2})\\]");
        Matcher m = p.matcher(str);
        StringBuilder sb = new StringBuilder();
        while (m.find()) {
            String group = m.group(1);
            Integer decode = Integer.decode(group);
            // assume latin-1 encoding
            m.appendReplacement(sb, Character.toString(decode));
        }
        m.appendTail(sb);
        // oh no, latin1 is not correct! re-interpret bytes in utf-8
        byte[] bytes = sb.toString().getBytes(StandardCharsets.ISO_8859_1);
        return new String(bytes, StandardCharsets.UTF_8);
    }