Is there any good solution out there that does this transliteration in a good manner?
I've tried using iconv()
, but is very annoying and it does not behave as one might expect.
//TRANSLIT
will try to replace what it can, leaving everything nonconvertible as "?" //IGNORE
will not leave "?" in text, but will also not transliterate and will also raise E_NOTICE
when nonconvertible char is found, so you have to use iconv with @ error suppressor//IGNORE//TRANSLIT
(as some people suggested in PHP forum) is actually same as //IGNORE
(tried it myself on php versions 5.3.2 and 5.3.13)//TRANSLIT//IGNORE
is same as //TRANSLIT
It also uses current locale settings to transliterate.
WARNING - a lot of text and code is following!
Here are some examples:
$text = 'Regular ascii text + čćžšđ + äöüß + éĕěėëȩ + æø€ + $ + ¶ + @';
echo '<br />original: ' . $text;
echo '<br />regular: ' . iconv("UTF-8", "ASCII//TRANSLIT", $text);
//> regular: Regular ascii text + ????? + ???ss + ?????? + ae?EUR + $ + ? + @
setlocale(LC_ALL, 'en_GB');
echo '<br />en_GB: ' . iconv("UTF-8", "ASCII//TRANSLIT", $text);
//> en_GB: Regular ascii text + cczs? + aouss + eeeeee + ae?EUR + $ + ? + @
setlocale(LC_ALL, 'en_GB.UTF8'); // will this work?
echo '<br />en_GB.UTF8: ' . iconv("UTF-8", "ASCII//TRANSLIT", $text);
//> en_GB.UTF8: Regular ascii text + cczs? + aouss + eeeeee + ae?EUR + $ + ? + @
Ok, that did convert č ć š ä ö ü ß é ĕ ě ė ë ȩ and æ, but why not đ and ø?
// now specific locales
setlocale(LC_ALL, 'hr_Hr'); // this should fix croatian đ, right?
echo '<br />hr_Hr: ' . iconv("UTF-8", "ASCII//TRANSLIT", $text);
// wrong > hr_Hr: Regular ascii text + cczs? + aouss + eeeeee + ae?EUR + $ + ? + @
setlocale(LC_ALL, 'sv_SE'); // so this will fix swedish ø?
echo '<br />sv_SE: ' . iconv("UTF-8", "ASCII//TRANSLIT", $text);
// will not > sv_SE: Regular ascii text + cczs? + aouss + eeeeee + ae?EUR + $ + ? + @
//this is interesting
setlocale(LC_ALL, 'de_DE');
echo '<br />de_DE: ' . iconv("UTF-8", "ASCII//TRANSLIT", $text);
//> de_DE: Regular ascii text + cczs? + aeoeuess + eeeeee + ae?EUR + $ + ? + @
// actually this is what any german would expect since ä ö ü really is same as ae oe ue
Lets try with //IGNORE
:
echo '<br />ignore: ' . iconv("UTF-8", "ASCII//IGNORE", $text);
//> ignore: Regular ascii text + + + + + $ + + @
//+ E_NOTICE: "Notice: iconv(): Detected an illegal character in input string in /var/www/test.server.web/index.php on line 49"
// with translit?
echo '<br />ignore/translit: ' . iconv("UTF-8", "ASCII//IGNORE//TRANSLIT", $text);
//same as ignore only> ignore/translit: Regular ascii text + + + + + $ + + @
//+ E_NOTICE: "Notice: iconv(): Detected an illegal character in input string in /var/www/test.server.web/index.php on line 54"
// translit/ignore?
echo '<br />translit/ignore: ' . iconv("UTF-8", "ASCII//TRANSLIT//IGNORE", $text);
//same as translit only> translit/ignore: Regular ascii text + cczs? + aouss + eeeeee + ae?EUR + $ + ? + @
Using solution of this guy also does not work as wanted: Regular ascii text + YYYYY + aous + eYYYeY + aoY + $ + � + @
Even using PECL intl Normalizer class (which is not awailable always even if you have PHP > 5.3.0, since ICU package intl uses may not be available to PHP i.e. on certain hosting servers) produces wrong result:
echo '<br />normalize: ' .preg_replace('/\p{Mn}/u', '', Normalizer::normalize($text, Normalizer::FORM_KD));
//>normalize: Regular ascii text + cczsđ + aouß + eeeeee + æø€ + $ + ¶ + @
So is there any other way of doing this right or the only proper thing to do is to do preg_replace()
or str_replace()
and define transliteration tables yourself?
// appendix: I have found on ZF wiki debate from 2008 about proposal for Zend_Filter_Transliterate but project was dropped since in some languages it is not possible to convert (i.e. chinese), but still for any latin- and cyrilic-based language IMO this option should exist.
The toAscii() function of Patchwork\Utf8 does exactly this, see:
https://github.com/nicolas-grekas/Patchwork-UTF8/blob/master/src/Patchwork/Utf8.php
It leverages iconv and intl's Normalizer to remove accents, split ligatures and do many other generic transliterations.