phptransliteration

Exclude specific characters from Transliterator conversion


I'm trying to make a transliteration using PHP, but what I need is the conversion of all non-latin characters but keep the italian accented characters (àèìòù).

PHP Transliterator lacks of documentation and on-line examples. I've read the ICU docs and I know that there is a rule that force Transliterator to convert a char into another specified by us (à > b).

The code (using the create funciton)

$str = "AŠAàèìòù Chén Hǎi yáo München Faißt Финиш 国内 - 镜像";
$transliterator = Transliterator::create("Any-Latin; Latin-ASCII");
echo $transliterator->transliterate($str);

converts all non-latin chars into latin (with all the accented chars) and gives the result

ASAaeiou Chen Hai yao Munchen Faisst Finis guo nei - jing xiang

and the code (using createFromRules function)

$str = "AŠAàèìòù Chén Hǎi yáo München Faißt Финиш 国内 - 镜像";
$transliterator = Transliterator::createFromRules("á>b");
echo $transliterator->transliterate($str);

forces correctly the conversion of à into b, but, obviously, without the conversion Any-Latin; Latin-ASCII made by the previous code, giving the result

AŠAbèìòù Chén Hǎi ybo München Faißt Финиш 国内 - 镜像

So my goal is to merge the Any-Latin; Latin-ASCII conversion and the à > à rule (and the other italian accented vowels), in order to tell Transliterator to convert all non latin chars to latin, but convert italian accented vowels into themselves, with the following result:

ASAàèìòù Chen Hai yao Munchen Faisst Finis guo nei - jing xiang

Is there a way to put the à>à rule in the create function's parameter or add the Any-Latin; Latin-ASCII directive in the createFromRules function's parameter?


Solution

  • Given your example with input and output:

    $transliterator = Transliterator::create("Any-Latin; Latin-ASCII");
    $str = "AŠAàèìòù Chén Hǎi yáo München Faißt Финиш 国内 - 镜像";
    echo $transliterator->transliterate($str), "\n";
    
    ASAaeiou Chen Hai yao Munchen Faisst Finis guo nei - jing xiang
    

    when applying the transliteration only on the segments that do not match the range of characters you specified to keep (the italian accented characters [àèìòù]) it should provide the result.

    One option is to use preg_replace_callback for that.

    It requires to have a callback to apply the transliteration:

    $transliterate = static function (array $match) use ($transliterator) {
        return $transliterator->transliterate($match[0]);
    };
    

    And it requires to have a pattern to match everything but the characters to keep. It needs to be properly defined and compatible with Unicode:

    ([^\xE0\xE8\xEC\xF2\xF9]+)ui
    
    
    (...)                : delimiters: the regular expression is inside
    u                    : modifier: u - Unicode mode (UTF-8 encoding in
                           PHP, PCRE_UTF8)
    i                    : modifier: i - letters in the pattern match
                           both upper and lower case letters
                           (PCRE_CASELESS)
    
    [^...]               : character class: not matching any of the
                           characters (`^`); negated character class
    \xE0\xE8\xEC\xF2\xF9 : the italian accented characters àèìòù written
                           in a stable notation (you can easily copy and
                           paste it for example)
    

    Last but not least, the subject to operate on must be compatible with the characters to keep. As there can be many ways to write the same character in Unicode, the input is normalized to be compatible with the PCRE pattern:

    echo preg_replace_callback(
        '([^\xE0\xE8\xEC\xF2\xF9]+)ui', 
        $transliterate, 
        Normalizer::normalize($str, Normalizer::NFC)
    ), "\n";
    

    The output:

    ASAàèìòù Chen Hai yao Munchen Faisst Finis guo nei - jing xiang
    

    Example across PHP versions.


    Addendum: