mysqlencodingutf-8character-encodingutf8mb4

What is the difference between utf8mb4 and utf8 charsets in MySQL?


What is the difference between utf8mb4 and utf8 charsets in MySQL?

I already know about ASCII, UTF-8, UTF-16 and UTF-32 encodings; but I'm curious to know whats the difference of utf8mb4 group of encodings with other encoding types defined in MySQL Server.

Are there any special benefits/proposes of using utf8mb4 rather than utf8?


Solution

  • UTF-8 is a variable-length encoding. In the case of UTF-8, this means that storing one code point requires one to four bytes. However, MySQL's encoding called "utf8" (alias of "utf8mb3") only stores a maximum of three bytes per code point.

    So the character set "utf8"/"utf8mb3" cannot store all Unicode code points: it only supports the range 0x000 to 0xFFFF, which is called the "Basic Multilingual Plane". See also Comparison of Unicode encodings.

    It's recommended to use utf8mb4. utf8mb3 is deprecated as of MySQL 8.

    This is what the MySQL documentation has to say about it:

    The utf8mb4 character set has these characteristics:

    • Supports BMP and supplementary characters.

    • Requires a maximum of four bytes per multibyte character.

    utf8mb4 contrasts with the utf8mb3 character set, which supports only BMP characters and uses a maximum of three bytes per character:

    • For a BMP character, utf8mb4 and utf8mb3 have identical storage characteristics: same code values, same encoding, same length.

    • For a supplementary character, utf8mb4 requires four bytes to store it, whereas utf8mb3 cannot store the character at all. When converting utf8mb3 columns to utf8mb4, you need not worry about converting supplementary characters because there are none.

    utf8mb3 won't support storing characters lying outside the BMP (and you usually want to), such as emoji.

    See also What are the most common non-BMP Unicode characters in actual use?.