mysqlutf-8utf8mb4invalid-charactersetherpad

How to debug invalid data in MySQL utf8mb4 column in Etherpad Lite database


We're running Etherpad Lite and we're trying to migrate database from MySQL to PostgreSQL.

MySQL database 'value' column is of type utf8mb4. However, around 10% of all rows contain value that is in fact encoded in Windows-1252 or ISO-8859-15 instead of UTF-8. How is this possible? Does not MySQL validate the UTF-8 before entering it into the column?

PostgreSQL cannot accept the invalid values during migration because it does validate the data and hits e.g. raw byte 0xE4 (ISO-8859-15: ä) which should be encoded as byte sequence 0xC3 0xA4 in UTF-8.

Is this a known "feature" of MySQL? Is there any way to always get real UTF-8 from utf8mb4 column?


Solution

  • No solution is known. This is probably a bug in MySQL which should disallow storing non-UTF-8 data in case client connnection and column type are both utf8mb4.

    I no longer use MySQL for anything so I haven't bothered to try to figure this bug any more. Nowadays, I'm using PostgreSQL for everything instead.