I'm getting errors when certain characters are being added to a table... even when the column is has utf8mb4
character set. For example:
SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\xE0' for column 'surname'
The data in question is: SEGAL‡ (note the double dagger)
Is this character beyond even 4 byte UTF8 or is the collation causing the issue? Or is it something else?
Screenshot showing character set and collation of the column:
It's a Laravel 8 app and the MySQL connection is configured to the following:
'charset' => 'utf8mb4',
'collation' => 'utf8mb4_unicode_ci',
Looking at the CSV file in PHPStorm, non-ASCII characters are displayed as �
. I've tried explicitly setting the file encoding to UTF-8 in PHPStorm (with and without BOM).
If I open the CSV in Excel then the non-ASCII characters display correctly. Confused.
Examining the CSV in a HEX editor shows that a character like ä
is stored as a single byte (8A
). When this CSV is opened in Excel it correctly shows ä
, but in everything else it shows �
.
I don't know what character encoding Excel is using, as this character should be typically stored as E4
when using a single byte, or C3 A4
in UTF-8.
Double dagger is hex E2 80 A1 (only 3 bytes) when encoded in CHARACTER SET
utf8 or utf8mb4. It is also available in latin1 as hex 87.
C3A4 [ä]LATIN SMALL LETTER A WITH DIAERESIS (in utf8/utf8mb4) or E4 in latin1.
Please provide SHOW VARIABLES LIKE 'char%';
See "black diamond" in Trouble with UTF-8 characters; what I see is not what I stored for the likely cause.
The CHARACTER SET
is the "encoding" and is relevant; the COLLATION
is how characters compare (eg case-folding) and is not relevant.
Nothing in the Question hints of a BOM (http://mysql.rjweb.org/doc.php/charcoll#bom_byte_order_mark) EF BB BF at the beginning of some UTF-8-encoded files.
Do not use ucs-2 (UTF-16) unless you happen to have a file encoded that way. Still, I would declare the tables as utf8, not ucs-2.
More
I plugged 8AE0E8
into a handy script (that does CONVERT(CONVERT(UNHEX('8ae0e8') USING %s) USING utf8mb4)
and got:
ascii, big5, binary, cp932, eucjpms, euckr,
gb18030, gb2312, gbk, sjis, ujis,
utf8mb3, utf8mb4 ---invalid in these charsets
swe7 3 3 '???'
cp1257 5 3 '?ąč'
cp1256 5 3 '?àè'
geostd8 5 3 '?ჭ?'
latin7 6 3 'ąč'
dec8, latin5 6 3 'àè'
hp8 6 3 'ÁÒ'
macce 6 3 'äŗŤ'
cp850 6 3 'èÓÞ'
keybcs2 6 3 'ĹαΦ'
cp852 6 3 'ŐÓŔ'
latin2 6 3 'ŕč'
latin1 6 3 'Šàè'
cp1250 6 3 'Šŕč'
greek 6 3 'ΰθ'
cp866 6 3 'Крш'
cp1251 6 3 'Љаи'
armscii8 6 3 'ՈՌ'
hebrew 6 3 'אט'
koi8r, koi8u 7 3 '┼ЮХ'
macroman 7 3 'ä‡Ë'
tis620 8 3 'เ่'
It looks like macroman
is the charset. I think that is specific to Apple.