How to calculate the length of a Unicode string in Excel?
I tried both LEN
and LENB
, but both of them give the same result.
For non-English words, the result is correct, 5.
=LENB("こんにちは")
But for other strings, the result is wrong, 4 instead of 3.
=LENB("^🚀^")
The only solution I found, was to compare all substrings created with the function LEFT
, but this looks overkill.
=1-SUM(-(LEFT(A1,SEQUENCE(LEN(A1)))<>LEFT(A1,SEQUENCE(LEN(A1),,2))))
The same issue is with the function MID or MIDB. Instead of =MID("^🚀🌍^",2,2), I have to write
=RIGHT(LEFT("^🚀🌍^",3),2)
Does anybody know a better way?
--- UPDATE 4/25/2023---
After the response from @JB-007
, I did some more digging, and I found more anomalies.
I reduced my function a little, but the answer is different than all other options. :-(
=1-SUM(-(LEFT(A1,SEQUENCE(LEN(A1)))<>A1))
If I append a standard ASCII character to 🤷♂️, all functions increase by 2, not 1.
If I append a standard ASCII character to 🤦🏻♂️, all functions increase by 1. That is correct.
In my opinion, all 3 functions should give the same answers. But they are not. The function LEN is the most consistent: it gives the number of code points in the UNICODE character.
For me, in the absence of a function that returns 1 for a UNICODE character (grapheme), the result of the LEN function is useless.
I just ran into the same problem and tried to come up with a solution with some heavy assistance from GPT.
After some back and forth and trying to wrap my head around it, it ended up being this function:
Function TrueLength(inputStr As String) As Integer
Dim i As Integer
Dim extraCount As Integer
extraCount = 0
Dim prevCharWasHighSurrogate As Boolean
prevCharWasHighSurrogate = False
For i = 1 To Len(inputStr)
Dim currentChar As String
currentChar = Mid(inputStr, i, 1)
Dim currentCharCode As Long
currentCharCode = AscW(currentChar)
If currentCharCode >= &HD800 And currentCharCode <= &HDBFF Then
extraCount = extraCount + 1
ElseIf currentCharCode = &H200D Then
extraCount = extraCount + 1
ElseIf currentCharCode = &HFE0F Then
extraCount = extraCount + 2
End If
Next i
TrueLength = Len(inputStr) - extraCount
End Function
The way I understand it it works like this:
Unicode values between U+D800
and U+DBFF
are a surrogate byte found in UTF-16 chars such as emojis and are not counted by the function (or rather added to a list of bytes to be subtracted from the total at the end)
The value U+200D
is used to combine emojis so it should also not be counted. Called the zero-width joiner.
The value U+FE0F
is even more special as it indicates that a previous sign (which is the ♂ in the case of "🤷♂️") was used as a variation of an emoji. So whatever that previous symbol was and this value itself should not be counted. This wikipedia article is relevant as far as I understand: https://en.wikipedia.org/wiki/Variation_Selectors_Supplement
The breakdown for 🤷♂️ is as follows:
U+D937 – First part of the surrogate pair that is the shrug emoji; caught by the function and not counted
U+DC4D – Second part of the surrogate pair for the shrug emoji – counted by the function as 1 char.
U+200D – Zero width joiner and therefore not counted
U+2642 – male symbol usually displayed as "♂" – usually counted, however, see Nr. 5
U+FE0F – the variation selector supplement that indicates the last symbol being a variation indicator and not its own symbol. In my solution this subtracts 2 from the final count.
I feel like the zero width joiner could maybe be used to not count any character until the next whitespace character maybe? Just an intuition though.
Short conclusion: For a proper grapheme length counter, pretty much every edge case would need to be caught by functions that can correctly interpret unicode values. This seems to be only realistic on a case by case basis as there are many potential issues depending on what sorts of symbols come up. This could warrant a big project in itself.