javastringsurrogate-pairs

How to remove surrogate characters in Java?


I am facing a situation where i get Surrogate characters in text that i am saving to MySql 5.1. As the UTF-16 is not supported in this, I want to remove these surrogate pairs manually by a java method before saving it to the database.

I have written the following method for now and I am curious to know if there is a direct and optimal way to handle this.

Thanks in advance for your help.

public static String removeSurrogates(String query) {
    StringBuffer sb = new StringBuffer();
    for (int i = 0; i < query.length() - 1; i++) {
        char firstChar = query.charAt(i);
        char nextChar = query.charAt(i+1);
        if (Character.isSurrogatePair(firstChar, nextChar) == false) {
            sb.append(firstChar);
        } else {
            i++;
        }
    }
    if (Character.isHighSurrogate(query.charAt(query.length() - 1)) == false
            && Character.isLowSurrogate(query.charAt(query.length() - 1)) == false) {
        sb.append(query.charAt(query.length() - 1));
    }

    return sb.toString();
}

Solution

  • Here's a couple things:

    I suggest this:

    public static String removeSurrogates(String query) {
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < query.length(); i++) {
            char c = query.charAt(i);
            // !isSurrogate(c) in Java 7
            if (!(Character.isHighSurrogate(c) || Character.isLowSurrogate(c))) {
                sb.append(firstChar);
            }
        }
        return sb.toString();
    }
    

    Breaking down the if statement

    You asked about this statement:

    if (!(Character.isHighSurrogate(c) || Character.isLowSurrogate(c))) {
        sb.append(firstChar);
    }
    

    One way to understand it is to break each operation into its own function, so you can see that the combination does what you'd expect:

    static boolean isSurrogate(char c) {
        return Character.isHighSurrogate(c) || Character.isLowSurrogate(c);
    }
    
    static boolean isNotSurrogate(char c) {
        return !isSurrogate(c);
    }
    
    ...
    
    if (isNotSurrogate(c)) {
        sb.append(firstChar);
    }