Disclaimer:
ut8mb4_unicode_520_ci
ut8mb4_unicode_520_ci
How do you correctly query a table field that contains dakuten or handakuten Japanese characters? Dakuten.
Currently, it seems that the base character is returned, even when the query is ran for the tenten version.
Example Data
Given へ
and ぺ
.
And a row with ID: 199, post_title: 'へ';
Scenario 1
Run:
SELECT 'へ' = 'ぺ';
-- Returns 0. Correct
Scenario 2
Run:
SELECT ID, post_title
FROM wp_posts
WHERE post_title = 'へ';
-- Returns row 199. Correct
Scenario 3
But, for some reason, when I run this query, it still returns record 199, noting the different title value.
Run:
SELECT ID, post_title
FROM wp_posts
WHERE post_title = 'ぺ';
-- Returns row 199. Incorrect
Example Image
An image would explain better (I'm just using union to better diplay everything in one screenshot):
Is there a solid approach to working with these characters? All other Japanese characters seem to work fine, its just the dakuten versions are treated like their bases in queries only.
This is because the collation you used (utf8mb4_unicode_ci, utf8mb4_unicode_520_ci and utf8mb4_0900_ai_ci) only compares character's base letter. For example, 'ぺ' = 'へ' + U+309A ◌゚, 'へ' is the base letter of 'ぺ'. So for your case, all 3 characters' base letter is same, 'へ'. So it is correct result for those collations return '1'.
MySQL team is developing a new Japanese collation for utf8mb4 character set. It will differentiate these dakuten characters from base character. It will come soon.