phparraysstringsplitmultibyte-characters

Split string of Croatian letters into an array of letters - accounting for double character letters


I need to split a string into an array of letters. The problem is that in my language (Croatian) there are double character letters aswell (e.g. lj, nj, dž).

So the string such as ljubičicajecvijet should be split into an array that would look like this:

Array
(
    [0] => lj
    [1] => u
    [2] => b
    [3] => i
    [4] => č
    [5] => i
    [6] => c
    [7] => a
    [8] => j
    [9] => e
    [10] => c
    [11] => v
    [12] => i
    [13] => j
    [14] => e
    [15] => t
)

Here is the list of Croatian characters in an array (I included English letters aswell).

$alphabet= array(
    'a', 'b', 'c',
    'č', 'ć', 'd',
    'dž', 'đ', 'e',
    'f', 'g', 'h',
    'i', 'j', 'k',
    'l', 'lj', 'm',
    'n', 'nj', 'o',
    'p', 'q', 'r',
    's', 'š', 't',
    'u', 'v', 'w',
    'x', 'y', 'z', 'ž'
);

Solution

  • You can use this kind of solution:

    Data:

    $text = 'ljubičicajecviježdžt';
    
    $alphabet = [
                'a', 'b', 'c',
                'č', 'ć', 'd',
                'dž', 'đ', 'e',
                'f', 'g', 'h',
                'i', 'j', 'k',
                'l', 'lj', 'm',
                'n', 'nj', 'o',
                'p', 'q', 'r',
                's', 'š', 't',
                'u', 'v', 'w',
                'x', 'y', 'z', 'ž'
    ];
    

    1. Order results by length in order to have the double letters at the beginning

    // 2 letters first
    usort($alphabet, function($a, $b) {
        if( mb_strlen($a) != mb_strlen($b) )
            return mb_strlen($a) < mb_strlen($b);
        else
            return $a > $b;
    });
    
    var_dump($alphabet);
    

    2. Finally, split. I used preg_split function with preg_quote to protect the function.

    // split
    $alphabet = array_map('preg_quote', $alphabet); // protect preg_split
    $pattern = implode('|', $alphabet); // 'dž|lj|nj|a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z|ć|č|đ|š|ž'
    
    var_dump($pattern);
    
    var_dump( preg_split('`(' . $pattern . ')`si', $text, null, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY) );
    

    And the result :)

    array (size=18)
      0 => string 'lj' (length=2)
      1 => string 'u' (length=1)
      2 => string 'b' (length=1)
      3 => string 'i' (length=1)
      4 => string 'č' (length=2)
      5 => string 'i' (length=1)
      6 => string 'c' (length=1)
      7 => string 'a' (length=1)
      8 => string 'j' (length=1)
      9 => string 'e' (length=1)
      10 => string 'c' (length=1)
      11 => string 'v' (length=1)
      12 => string 'i' (length=1)
      13 => string 'j' (length=1)
      14 => string 'e' (length=1)
      15 => string 'ž' (length=2)
      16 => string 'dž' (length=3)
      17 => string 't' (length=1)