unicodecharacter-encodingnormalizationunicode-normalizationtext-normalization

Charset comparison


I need urgent help. I can't compare charset strings. A string written to a database table1 is utf-8 charset but looks still strange: SADI However a string written to table2 in the same database is SADI which is normal. whenever I compare both, it gives false.

  1. Any idea how comparison can be made? (actually comparison should give true result)

  2. Any idea how I can insert SADI as SADI to a database.

Either will be a solution hopefully.


Solution

  • In your strings, SADI is standard ASCII string, but SADI is using full-width Unicode characters.

    For example, is U+FF33 'FULLWIDTH LATIN CAPITAL LETTER S' (UTF-8: 0xEF 0xBC 0xB3),

    but S is standard ASCII U+0053 'LATIN CAPITAL LETTER S' (UTF-8 0x53).

    Other characters are also similar extended Unicode characters, which look like standard Latin script, but in reality are not.

    How did they get there - that's a good question. Probably somebody got really creative and copy-pasted something from Word? Who knows.

    You can convert these strange characters back to normal ones by applying Unicode NFKC (Unicode Normalization Form KC) by using this Perl script as a filter (it accepts UTF-8 and outputs normalized UTF-8):

    use Unicode::Normalize;
    binmode STDIN,  ':utf8';
    binmode STDOUT, ':utf8';
    while(<>) { print NFKC($_); }
    

    In php:

    $result = Normalizer::normalize( $str, Normalizer::FORM_KC );
    

    Requires the intl extension