I have this function, which makes use of preg_replace_callback to split a sentence into "chains" of blocks belonging to different categories (alphabetic, han characters, everything else).
The function is trying to also include the characters ' , { , and } as "alphabetic"
function String_SplitSentence($string)
{
$res = array();
preg_replace_callback("~\b(?<han>\p{Han}+)\b|\b(?<alpha>[a-zA-Z0-9{}']+)\b|(?<other>[^\p{Han}A-Za-z0-9\s]+)~su",
function($m) use (&$res)
{
if (!empty($m["han"]))
{
$t = array("type" => "han", "text" => $m["han"]);
array_push($res,$t);
}
else if (!empty($m["alpha"]))
{
$t = array("type" => "alpha", "text" => $m["alpha"]);
array_push($res, $t);
}
else if (!empty($m["other"]))
{
$t = array("type" => "other", "text" => $m["other"]);
array_push($res, $t);
}
},
$string);
return $res;
}
However, something seems to be wrong with the curly braces.
print_r(String_SplitSentence("Many cats{1}, several rats{2}"));
As can be seen in the output, the function treats { as an alphabetic character, as indicated, but stops at } and treats it as "other" instead.
Array
(
[0] => Array
(
[type] => alpha
[text] => Many
)
[1] => Array
(
[type] => alpha
[text] => cats{1
)
[2] => Array
(
[type] => other
[text] => },
)
[3] => Array
(
[type] => alpha
[text] => several
)
[4] => Array
(
[type] => alpha
[text] => rats{2
)
[5] => Array
(
[type] => other
[text] => }
)
What am I doing wrong?
I can't be completely sure, because your sample input doesn't represent any Chinese characters and I don't know what kind of fringe cases you may be trying to process, but this is how I would write the pattern:
~(?<han>\p{Han}+)|(?<alpha>[a-z\d{}']+)|(?<other>\S+)~ui
The trouble with \b
is that it is looking for \w
characters. \w
represents uppercase letters, lowercase letters, numbers, and underscores. Reference: https://stackoverflow.com/a/11874899/2943403
Also your pattern doesn't include any .
s so you can remove the s
pattern modifier.
Also your function call seems to be abusing preg_replace_callback()
. I mean, you aren't actually replacing anything, so it is an inappropriate call. Perhaps you could consider this rewrite:
function String_SplitSentence($string){
if(!preg_match_all("~(?<han>\p{Han}+)|(?<alpha>[a-z\d{}']+)|(?<other>\S+)~ui",$string,$out)){
return []; // or $string or false
}else{
foreach($out as $group_key=>$group){
if(!is_numeric($group_key)){ // disregard the indexed groups (which are unavoidably generated)
foreach($group as $i=>$v){
if(strlen($v)){ // only store the value in the subarray that has a string length
$res[$i]=['type'=>$group_key,'text'=>$v];
}
}
}
}
ksort($res);
return $res;
}
}
A demonstration about your pattern: https://regex101.com/r/6EUaSM/1
\b after your character class was fouling it all up. }
is not included in the \w
class. Regex wants to do a good job for you -- it captured "greedily" until it couldn't anymore. The }
was getting left out because of the word boundary.