phpregexunicodeword-boundarynamed-captures

preg_replace_callback: including curly braces in a pattern: { is captured, } isn't


I have this function, which makes use of preg_replace_callback to split a sentence into "chains" of blocks belonging to different categories (alphabetic, han characters, everything else).

The function is trying to also include the characters ' , { , and } as "alphabetic"

function String_SplitSentence($string)
{
 $res = array();

 preg_replace_callback("~\b(?<han>\p{Han}+)\b|\b(?<alpha>[a-zA-Z0-9{}']+)\b|(?<other>[^\p{Han}A-Za-z0-9\s]+)~su",
 function($m) use (&$res) 
 {
 if (!empty($m["han"])) 
 {
  $t = array("type" => "han", "text" => $m["han"]);
  array_push($res,$t);
 }
 else if (!empty($m["alpha"])) 
 {
  $t = array("type" => "alpha", "text" => $m["alpha"]);
  array_push($res, $t);
 }
 else  if (!empty($m["other"])) 
 {
  $t = array("type" => "other", "text" => $m["other"]);
  array_push($res, $t);
 }
 },
 $string);

 return $res;
}

However, something seems to be wrong with the curly braces.

print_r(String_SplitSentence("Many cats{1}, several rats{2}"));

As can be seen in the output, the function treats { as an alphabetic character, as indicated, but stops at } and treats it as "other" instead.

Array
(
    [0] => Array
        (
            [type] => alpha
            [text] => Many
        )

    [1] => Array
        (
            [type] => alpha
            [text] => cats{1
        )

    [2] => Array
        (
            [type] => other
            [text] => },
        )

    [3] => Array
        (
            [type] => alpha
            [text] => several
        )

    [4] => Array
        (
            [type] => alpha
            [text] => rats{2
        )

    [5] => Array
        (
            [type] => other
            [text] => }
        )

What am I doing wrong?


Solution

  • I can't be completely sure, because your sample input doesn't represent any Chinese characters and I don't know what kind of fringe cases you may be trying to process, but this is how I would write the pattern:

    ~(?<han>\p{Han}+)|(?<alpha>[a-z\d{}']+)|(?<other>\S+)~ui
    

    The trouble with \b is that it is looking for \w characters. \w represents uppercase letters, lowercase letters, numbers, and underscores. Reference: https://stackoverflow.com/a/11874899/2943403

    Also your pattern doesn't include any .s so you can remove the s pattern modifier.


    Also your function call seems to be abusing preg_replace_callback(). I mean, you aren't actually replacing anything, so it is an inappropriate call. Perhaps you could consider this rewrite:

    function String_SplitSentence($string){
        if(!preg_match_all("~(?<han>\p{Han}+)|(?<alpha>[a-z\d{}']+)|(?<other>\S+)~ui",$string,$out)){
            return [];  // or $string or false
        }else{
            foreach($out as $group_key=>$group){
                if(!is_numeric($group_key)){  // disregard the indexed groups (which are unavoidably generated)
                    foreach($group as $i=>$v){
                        if(strlen($v)){  // only store the value in the subarray that has a string length
                            $res[$i]=['type'=>$group_key,'text'=>$v];
                        }
                    }
                }
            }
            ksort($res);
            return $res;
        }
    }
    

    A demonstration about your pattern: https://regex101.com/r/6EUaSM/1

    \b after your character class was fouling it all up. } is not included in the \w class. Regex wants to do a good job for you -- it captured "greedily" until it couldn't anymore. The } was getting left out because of the word boundary.