[SOLVED] Charset comparison

Charset comparison

I need urgent help. I can't compare charset strings. A string written to a database table1 is utf-8 charset but looks still strange: ＳＡＤＩ However a string written to table2 in the same database is SADI which is normal. whenever I compare both, it gives false.

Any idea how comparison can be made? (actually comparison should give true result)
Any idea how I can insert ＳＡＤＩ as SADI to a database.

Either will be a solution hopefully.

Solution

In your strings, SADI is standard ASCII string, but ＳＡＤＩ is using full-width Unicode characters.

For example, Ｓ is U+FF33 'FULLWIDTH LATIN CAPITAL LETTER S' (UTF-8: 0xEF 0xBC 0xB3),

but S is standard ASCII U+0053 'LATIN CAPITAL LETTER S' (UTF-8 0x53).

Other characters are also similar extended Unicode characters, which look like standard Latin script, but in reality are not.

How did they get there - that's a good question. Probably somebody got really creative and copy-pasted something from Word? Who knows.

You can convert these strange characters back to normal ones by applying Unicode NFKC (Unicode Normalization Form KC) by using this Perl script as a filter (it accepts UTF-8 and outputs normalized UTF-8):

use Unicode::Normalize;
binmode STDIN,  ':utf8';
binmode STDOUT, ':utf8';
while(<>) { print NFKC($_); }

In php:

$result = Normalizer::normalize( $str, Normalizer::FORM_KC );

Requires the intl extension