I'm working on an Angular 17 reactive form where I send the form data to a PHP API on the server and store it in a database.
I would like the user to be able to input emojis to the form so I have set my database to utf8mb4_unicode_ci
collation so that the emojis can be stored.
Security is very important to me so I do several checks on both the client side and the server side for various things.
One of the checks I do is to check the length of the input. I was wondering If you can help because the length results are inconsistent on client side and server side (since the string contains emojis).
On using the JavaScript .length
property and also the built-in Angular Form Validators called minLength
and maxLength
I see that they all calculate the length in the same way (for example most of the basic smilie emojis are calculated as having a length of 2).
However when I send this data (which includes emojis) to the server side I use the PHP method called mb_strlen($subject, 'utf8')
and the values are different (most of the basic smilie emojis are calculated as having a length of 1 and also they take up 1 varchar
character in the database).
I've tested about 160 emojis to see what values they return on both client side and server side in order to try and work out a pattern (so that I can do checks for length in the right way).
As you can see from my screenshots below in most cases mb_strlen($subject,‘utf8’)
returns a lower value for the length than JavaScript .length
. Sometimes it returns the same value as JavaScript .length
property but in all these cases mb_strlen($subject,‘utf8’)
has never returned a length greater than what JavaScript .length
returns.
Is it safe to assume that mb_strlen($subject,‘utf8’)
will never return a value greater than JavaScript .length
. for the rest of existing emojis that I have not tested?
If not could you explain a bit more about this and could you give some examples of characters where mb_strlen($subject,‘utf8’)
would return a greater value than JavaScript .length
?
Thank you
The encoding of Javascript strings is UTF-16, and the length property is the number of UTF-16 characters in a string. Each UTF-16 character is 2 bytes long. In PHP, you can count the length like this:
# Assume the input string is encoded with UTF-8
$str2 = mb_convert_encoding($str, 'UTF-16LE', 'UTF-8');
$length = strlen($str2) / 2;
However mb_strlen
counts the number of Unicode characters in a string. The length of a Unicode character is variable, in the current version, the length can be 2 or 3 bytes.
A 2-byte Unicode character can correspond to a UTF-16 character, but to represent a 3-byte Unicode character in UTF-16 encoding, you need to use surrogate pair (using 2 UTF-16 characters). Therefore, when you use mb_strlen
to count the number of Unicode characters, the result will never be greater than the string.length
property in Javascript.