mysqlencodingcharacter-encoding

mysqldump - maintain character set and collations


I'm looking for the most secure way to preserve my database data in a .sql backup.

This:

mysqldump -u root -p DBName > backupName.sql

outputs also these lines for my database:

DROP TABLE IF EXISTS `tableName`;
/*!40101 SET @saved_cs_client     = @@character_set_client */;
/*!50503 SET character_set_client = utf8mb4 */;
CREATE TABLE `tableName` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `col1` int(11) unsigned NOT NULL,
  `col2` int(11) unsigned NOT NULL,
  ...
  PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_general_ci;
/*!40101 SET character_set_client = @saved_cs_client */;

How does this line work about the encoding?

/*!50503 SET character_set_client = utf8mb4 */;

I remember that those data were saved with some utf8 encoding but not with utf8mb4, maybe utf8mb4 can handle correctly all sub-set like utf8 and utf8_general_ci and utf8_unicode_ci?

(I'm using Ubuntu with MySQL 8)


Solution

  • Yes, utf8mb4 is a superset of utf8.

    utf8 supports only the Basic Multilingual Plane of the UTF-8 standard — i.e. 1-byte, 2-byte, and 3-byte code points.

    utf8mb4 supports everything utf8 does, and in addition, supports the Supplemental Multilingual Plane of the UTF-8 standard.

    As of MySQL 8.0.28, utf8 is now known as utf8mb3. It has been documented that a future release of MySQL will repurpose the utf8 alias to the utf8mb4 character set.

    The character_set_client only describes the character set used by the client to encode character data it sends. This doesn't have to be the same as the character set used by each table, if there's a valid conversion path from the client character set into whatever is used by the respective table.

    In other words, if you set the client character set to utf8mb4, and the table uses utf8 (a subset), it's fine as long as the client doesn't send 4-byte characters from the supplemental utf8 plane (this includes for example emojis).

    utf8_general_ci and utf8_unicode_ci are not character sets, they are collations. This doesn't affect storage of strings at all, but it affects the sort order used as indexes are built, and it also affects character equivalence for unique constraints.