What is the difference between utf8mb4
and utf8
charsets in MySQL?
I already know about ASCII, UTF-8, UTF-16 and UTF-32 encodings;
but I'm curious to know whats the difference of utf8mb4
group of encodings with other encoding types defined in MySQL Server.
Are there any special benefits/proposes of using utf8mb4
rather than utf8
?
UTF-8 is a variable-length encoding. In the case of UTF-8, this means that storing one code point requires one to four bytes. However, MySQL's encoding called "utf8" (alias of "utf8mb3") only stores a maximum of three bytes per code point.
So the character set "utf8"/"utf8mb3" cannot store all Unicode code points: it only supports the range 0x000 to 0xFFFF, which is called the "Basic Multilingual Plane". See also Comparison of Unicode encodings.
It's recommended to use utf8mb4
. utf8mb3
is deprecated as of MySQL 8.
This is what the MySQL documentation has to say about it:
The
utf8mb4
character set has these characteristics:
Supports BMP and supplementary characters.
Requires a maximum of four bytes per multibyte character.
utf8mb4
contrasts with theutf8mb3
character set, which supports only BMP characters and uses a maximum of three bytes per character:
For a BMP character,
utf8mb4
andutf8mb3
have identical storage characteristics: same code values, same encoding, same length.For a supplementary character,
utf8mb4
requires four bytes to store it, whereasutf8mb3
cannot store the character at all. When convertingutf8mb3
columns toutf8mb4
, you need not worry about converting supplementary characters because there are none.
utf8mb3
won't support storing characters lying outside the BMP (and you usually want to), such as emoji.
See also What are the most common non-BMP Unicode characters in actual use?.