phpunicodesplitcjkdelimited

Split string before qualifying substrings containing a Japanese character


How can I split this line:

我 [wǒ] - (pronoun) I or me 你 [nǐ] - (pronoun) you (second person singular); yourself 他 [tā] - (pronoun) he or him

into three lines like this:

我 [wǒ] - (pronoun) I or me

你 [nǐ] - (pronoun) you (second person singular); yourself

他 [tā] - (pronoun) he or him

Ultimately, I plan to insert a <br /> tag after each line.


Solution

  • The only clear pattern we can see since you removed the dots is "a foreign character, a space, and an opening bracket".

    Let focus on that :

    <?php
    
    $string = "我 [wǒ] - (pronoun) I or me 你 [nǐ] - (pronoun) you (second person singular); yourself 他 [tā] - (pronoun) he or him";
    
    $result = preg_replace('/(. \[)/u', // "any char, a space then [", 'u' flag to use UTF8 
                           '<br/>$1', // replace it by a break table and a back reference
                            $string);
    
    echo $result;
    

    Note that using this algo, the line breaks will be place at the begining of the lines. Don't forget the UTF-8 flag, and use UTF-8 everywhere in your application or processing strings will be a mess.

    EDIT : if you ever wants the line break to be only at the beginning of the two lines, then you can use negative lookbehind for that purpose :

    $string = "我 [wǒ] - (pronoun) I or me 你 [nǐ] - (pronoun) you (second person singular); yourself 他 [tā] - (pronoun) he or him";
    
    // the same pattern, but excluding the one preceded by "^", where the string starts
    $result = preg_replace('/(?<!^)(. \[)/u',   
                           '<br/>$1', 
                            $string);
    
    echo $result;