excel-formulaunicode

Length of strings with Unicode in Excel


How to calculate the length of a Unicode string in Excel?

I tried both LEN and LENB, but both of them give the same result.
For non-English words, the result is correct, 5.

=LENB("こんにちは")

But for other strings, the result is wrong, 4 instead of 3.

=LENB("^🚀^")

The only solution I found, was to compare all substrings created with the function LEFT, but this looks overkill.

=1-SUM(-(LEFT(A1,SEQUENCE(LEN(A1)))<>LEFT(A1,SEQUENCE(LEN(A1),,2))))

The same issue is with the function MID or MIDB. Instead of =MID("^🚀🌍^",2,2), I have to write

=RIGHT(LEFT("^🚀🌍^",3),2)

Does anybody know a better way?



--- UPDATE 4/25/2023--- After the response from @JB-007, I did some more digging, and I found more anomalies.

  1. I reduced my function a little, but the answer is different than all other options. :-(

    =1-SUM(-(LEFT(A1,SEQUENCE(LEN(A1)))<>A1))
    
  2. If I append a standard ASCII character to 🤷‍♂️, all functions increase by 2, not 1.

  3. If I append a standard ASCII character to 🤦🏻‍♂️, all functions increase by 1. That is correct.

In my opinion, all 3 functions should give the same answers. But they are not. The function LEN is the most consistent: it gives the number of code points in the UNICODE character.

For me, in the absence of a function that returns 1 for a UNICODE character (grapheme), the result of the LEN function is useless.

Here are the final results: enter image description here


Solution

  • I just ran into the same problem and tried to come up with a solution with some heavy assistance from GPT.

    After some back and forth and trying to wrap my head around it, it ended up being this function:

    Function TrueLength(inputStr As String) As Integer
    Dim i As Integer
    Dim extraCount As Integer
    extraCount = 0
    Dim prevCharWasHighSurrogate As Boolean
    prevCharWasHighSurrogate = False
    
    For i = 1 To Len(inputStr)
        Dim currentChar As String
        currentChar = Mid(inputStr, i, 1)
        Dim currentCharCode As Long
        currentCharCode = AscW(currentChar)
    
        If currentCharCode >= &HD800 And currentCharCode <= &HDBFF Then
            extraCount = extraCount + 1
        ElseIf currentCharCode = &H200D Then
            extraCount = extraCount + 1
        ElseIf currentCharCode = &HFE0F Then
            extraCount = extraCount + 2
        End If
    Next i
    
    TrueLength = Len(inputStr) - extraCount
    End Function
    

    The way I understand it it works like this:

    The breakdown for 🤷‍♂️ is as follows:

    1. U+D937 – First part of the surrogate pair that is the shrug emoji; caught by the function and not counted

    2. U+DC4D – Second part of the surrogate pair for the shrug emoji – counted by the function as 1 char.

    3. U+200D – Zero width joiner and therefore not counted

    4. U+2642 – male symbol usually displayed as "♂" – usually counted, however, see Nr. 5

    5. U+FE0F – the variation selector supplement that indicates the last symbol being a variation indicator and not its own symbol. In my solution this subtracts 2 from the final count.

    I feel like the zero width joiner could maybe be used to not count any character until the next whitespace character maybe? Just an intuition though.

    Short conclusion: For a proper grapheme length counter, pretty much every edge case would need to be caught by functions that can correctly interpret unicode values. This seems to be only realistic on a case by case basis as there are many potential issues depending on what sorts of symbols come up. This could warrant a big project in itself.