I am working with web-app (JSP) which inserts data to mySQL database from webform, the data is sent to servlet as parameters encoded in UTF8. Application works perfectly with normal letters and with symbols till certain extent. But if I am trying to insert any 4 byte character will it be replaced by question mark (?) symbol.
I am pretty sure the problem has something to do with MySQL weird way of having UTF8 as 3 bytes only, but this time the collation is SJIS.
I must be overlooking something so I would appreciate any help available, I have been hitting my head to wall for one day for this.
as for collation information, I have tried multiple different settings, result is always the same, everything works fine, except the 4byte characters.
this is the default collation:
SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%'
OR Variable_name LIKE 'collation%';
+--------------------------+-------------------+
| Variable_name | Value |
+--------------------------+-------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | sjis |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| collation_connection | utf8_general_ci |
| collation_database | sjis_japanese_ci |
| collation_server | latin1_swedish_ci |
+--------------------------+-------------------+
I have also tried with following:
+--------------------------+------------------+
| Variable_name | Value |
+--------------------------+------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | sjis |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | sjis |
| character_set_system | utf8 |
| collation_connection | utf8_general_ci |
| collation_database | sjis_japanese_ci |
| collation_server | sjis_japanese_ci |
+--------------------------+------------------+
example of a table I am inserting into(Z column):
show FULL COLUMNS FROM XYZ;
+--------+------------------+------------------+------+-----+---------+----- -----------+---------------------------------+---------+
| Field | Type | Collation | Null | Key | Default | Extra | Privileges | Comment |
+--------+------------------+------------------+------+-----+---------+----- -----------+---------------------------------+---------+
| X | int(10) unsigned | NULL | NO | PRI | NULL | auto_increment | select,insert,update,references | |
| Y | date | NULL | YES | | NULL | | select,insert,update,references | |
| Z | varchar(255) | sjis_japanese_ci | YES | | NULL | | select,insert,update,references | |
+--------+------------------+------------------+------+-----+---------+----------------+---------------------------------+---------+
inside JAVA-class encoding is set as following
request.setCharacterEncoding("UTF-8");
response.setCharacterEncoding("SHIFT_JIS");
I know DB can hold this characters as previously imported(LODA DATA INFILE) data has these characters and they are visible in DB (not question marks).
so Friends, I ask your help with this, this is probably something very easy (or impossible), if you need more information I can get it from the DB/source.
example of UTF8 4 byte character is: (might not be visible of your browser)
𠜎
or :) https://codepoints.net/U+1F4A9
Thank you very much!
I have tried absolutely everything to make this work with SJIS but didn't succeed, I fixed the situation with altering all the tables to utf8mb4.
ALTER TABLE xxx CONVERT TO CHARACTER SET utf8mb4;
and changing encoding all the way to UTF-8:
request.setCharacterEncoding("UTF-8");
response.setCharacterEncoding("UTF-8");
stay away from SJIS if possible.