javaencodingmbstring

mb_strlen and android java show different results for same character length


In PHP

echo mb_strlen('🌦') the result is 1

In Android Java

"🌦".length() the result is 2

Another way to write the same char/icon

"\uD83C\uDF26".length() the result is 2

Android Encoding

Charset defaultCharset = Charset.defaultCharset() => UTF=8

(new OutputStreamWriter(new ByteArrayOutputStream())).getEncoding() => UTF-8

File encoding is UTF-8.

Questions

Why does Android Java show a different result than mb_strlen?

I assume mb_strlen result is right, and the length is 1. How can I make Java understand the string as to calculate the length to 1?

LE:

The problem is that I have a string comming from PHP server like this: LENGTH|STRING... example: 5|juice3|aha3|yes

If the string contains '🌦', example 7|sample🌦3|yes then Android Java will count it as 2 instead of 1, and will parse incorrectly the string


Solution

  • Solution

    Thank you all, the codePoint hint got me a starting point.

    While looping char by char thought the text received from php:

    1. changed int count = sb.length(); => int count = sb.codePointCount(0, sb.length());

    2. changed char charAt = sb.charAt(i); to int charAt = sb.codePointAt(i);

    3. most important

    changed

                                String definition = sb.substring(i, i + defLength);
    
                                i += defLength - 1;
    

    to

                                // +10% because maybe there are multi byte chars
                                StringBuilder definitionBuilder = new StringBuilder(defLength + defLength / 10);
    
                                int offset = 0;
                                for (int times = 0; times < defLength; times++)
                                {
                                    if (sb.length() > i + offset)
                                    {
                                        int codepoint = sb.codePointAt(i + offset);
                                        definitionBuilder.appendCodePoint(codepoint);
                                        offset += Character.charCount(codepoint);
                                    }
                                    else
                                    {
                                        Debug.d("Out of bounds, i = " + i + ", offset = " + offset + ", times = " + times);
                                        break;
                                    }
                                }
    
                                String definition = definitionBuilder.toString();
    
                                i += offset - 1;
    

    The solution is not perfect, but exemplifies the fix.

    Point #4 sometimes throws OutOfBounds, but it may be wrong server data, that is why the weird handling via if (sb.length() > i + offset)