mysqlunicodeutf-8nscharacterset

How to correctly handle dakuten and handakuten Japanese characters in mysql?


Disclaimer:

  1. Database is ut8mb4_unicode_520_ci
  2. Table field is ut8mb4_unicode_520_ci

How do you correctly query a table field that contains dakuten or handakuten Japanese characters? Dakuten.

Currently, it seems that the base character is returned, even when the query is ran for the tenten version.

Example Data

Given and . And a row with ID: 199, post_title: 'へ';

Scenario 1

Run:

SELECT 'へ' = 'ぺ'; 

-- Returns 0. Correct

Scenario 2

Run:

SELECT ID, post_title 
FROM wp_posts 
WHERE post_title = 'へ';

-- Returns row 199. Correct

Scenario 3

But, for some reason, when I run this query, it still returns record 199, noting the different title value.

Run:

SELECT ID, post_title 
FROM wp_posts 
WHERE post_title = 'ぺ';

-- Returns row 199. Incorrect

Example Image

An image would explain better (I'm just using union to better diplay everything in one screenshot):

enter image description here

Is there a solid approach to working with these characters? All other Japanese characters seem to work fine, its just the dakuten versions are treated like their bases in queries only.


Solution

  • This is because the collation you used (utf8mb4_unicode_ci, utf8mb4_unicode_520_ci and utf8mb4_0900_ai_ci) only compares character's base letter. For example, 'ぺ' = 'へ' + U+309A ◌゚, 'へ' is the base letter of 'ぺ'. So for your case, all 3 characters' base letter is same, 'へ'. So it is correct result for those collations return '1'.

    MySQL team is developing a new Japanese collation for utf8mb4 character set. It will differentiate these dakuten characters from base character. It will come soon.