We're running Etherpad Lite and we're trying to migrate database from MySQL to PostgreSQL.
MySQL database 'value' column is of type utf8mb4. However, around 10% of all rows contain value that is in fact encoded in Windows-1252 or ISO-8859-15 instead of UTF-8. How is this possible? Does not MySQL validate the UTF-8 before entering it into the column?
PostgreSQL cannot accept the invalid values during migration because it does validate the data and hits e.g. raw byte 0xE4 (ISO-8859-15: ä
) which should be encoded as byte sequence 0xC3 0xA4 in UTF-8.
Is this a known "feature" of MySQL? Is there any way to always get real UTF-8 from utf8mb4
column?
No solution is known. This is probably a bug in MySQL which should disallow storing non-UTF-8 data in case client connnection and column type are both utf8mb4.
I no longer use MySQL for anything so I haven't bothered to try to figure this bug any more. Nowadays, I'm using PostgreSQL for everything instead.