I'm trying to make a transliteration using PHP, but what I need is the conversion of all non-latin characters but keep the italian accented characters (àèìòù).
PHP Transliterator lacks of documentation and on-line examples.
I've read the ICU docs and I know that there is a rule that force Transliterator to convert a char into another specified by us (à > b
).
The code (using the create
funciton)
$str = "AŠAàèìòù Chén Hǎi yáo München Faißt Финиш 国内 - 镜像";
$transliterator = Transliterator::create("Any-Latin; Latin-ASCII");
echo $transliterator->transliterate($str);
converts all non-latin chars into latin (with all the accented chars) and gives the result
ASAaeiou Chen Hai yao Munchen Faisst Finis guo nei - jing xiang
and the code (using createFromRules
function)
$str = "AŠAàèìòù Chén Hǎi yáo München Faißt Финиш 国内 - 镜像";
$transliterator = Transliterator::createFromRules("á>b");
echo $transliterator->transliterate($str);
forces correctly the conversion of à
into b
, but, obviously, without the conversion Any-Latin; Latin-ASCII
made by the previous code, giving the result
AŠAbèìòù Chén Hǎi ybo München Faißt Финиш 国内 - 镜像
So my goal is to merge the Any-Latin; Latin-ASCII
conversion and the à > à
rule (and the other italian accented vowels), in order to tell Transliterator to convert all non latin chars to latin, but convert italian accented vowels into themselves, with the following result:
ASAàèìòù Chen Hai yao Munchen Faisst Finis guo nei - jing xiang
Is there a way to put the à>à
rule in the create
function's parameter or add the Any-Latin; Latin-ASCII
directive in the createFromRules
function's parameter?
Given your example with input and output:
$transliterator = Transliterator::create("Any-Latin; Latin-ASCII");
$str = "AŠAàèìòù Chén Hǎi yáo München Faißt Финиш 国内 - 镜像";
echo $transliterator->transliterate($str), "\n";
ASAaeiou Chen Hai yao Munchen Faisst Finis guo nei - jing xiang
when applying the transliteration only on the segments that do not match the range of characters you specified to keep (the italian accented characters [àèìòù]) it should provide the result.
One option is to use preg_replace_callback
for that.
It requires to have a callback to apply the transliteration:
$transliterate = static function (array $match) use ($transliterator) {
return $transliterator->transliterate($match[0]);
};
And it requires to have a pattern to match everything but the characters to keep. It needs to be properly defined and compatible with Unicode:
([^\xE0\xE8\xEC\xF2\xF9]+)ui
(...) : delimiters: the regular expression is inside
u : modifier: u - Unicode mode (UTF-8 encoding in
PHP, PCRE_UTF8)
i : modifier: i - letters in the pattern match
both upper and lower case letters
(PCRE_CASELESS)
[^...] : character class: not matching any of the
characters (`^`); negated character class
\xE0\xE8\xEC\xF2\xF9 : the italian accented characters àèìòù written
in a stable notation (you can easily copy and
paste it for example)
Last but not least, the subject to operate on must be compatible with the characters to keep. As there can be many ways to write the same character in Unicode, the input is normalized to be compatible with the PCRE pattern:
echo preg_replace_callback(
'([^\xE0\xE8\xEC\xF2\xF9]+)ui',
$transliterate,
Normalizer::normalize($str, Normalizer::NFC)
), "\n";
The output:
ASAàèìòù Chen Hai yao Munchen Faisst Finis guo nei - jing xiang
Addendum:
\xE0\xE1\xE8\xE9\xEC\xED\xF2\xF3\xF9\xFA
lower-case list of italian accented characters (can be used with i-modifier)\xC0\xC1\xC8\xC9\xCC\xCD\xD2\xD3\xD9\xDA\xE0\xE1\xE8\xE9\xEC\xED\xF2\xF3\xF9\xFA
lower- and upper-case list of italian accented characters (can be used without i-modifier) \xhh character with hex code hh
\x{hhh..} character with hex code hhh..