phpsplitcpu-word

Split text into words and consider numbers with decimals as a whole "word"


I am trying to split the text into words:

$delimiterList = array(" ", ".", "-", ",", ";", "_", ":",
           "!", "?", "/", "(", ")", "[", "]", "{", "}", "<", ">", "\r", "\n",
           '"');
$words = mb_split($delimiterList, $string);

which works quite fine with strings but I am stuck in some cases where I have to do with numbers.

E.g. If I have the text "Look at this.My score is 3.14, and I am happy about it.". Now the array is

[0]=>Look,
[1]=>at,
[2]=>this,
[3]=>My,
[4]=>score,
[5]=>is,
[6]=>3,
[7]=>14,
[8]=>and, ....

Then also the 3.14 is divided in 3 and 14 which should not happen in my case. I mean point should divide two strings but not two numbers. It should be like:

[0]=>Look,
[1]=>at,
[2]=>this,
[3]=>My,
[4]=>score,
[5]=>is,
[6]=>3.14,
[7]=>and, ....

But I have no idea how to avoid this cases!


Solution

  • Or use regex :)

    <?php
    $str = "Look at this.My score is 3.14, and I am happy about it.";
    
    // alternative to handle Marko's example (updated)
    // /([\s_;?!\/\(\)\[\]{}<>\r\n"]|\.$|(?<=\D)[:,.\-]|[:,.\-](?=\D))/
    
    var_dump(preg_split('/([\s\-_,:;?!\/\(\)\[\]{}<>\r\n"]|(?<!\d)\.(?!\d))/',
                        $str, null, PREG_SPLIT_NO_EMPTY));
    
    array(13) {
      [0]=>
      string(4) "Look"
      [1]=>
      string(2) "at"
      [2]=>
      string(4) "this"
      [3]=>
      string(2) "My"
      [4]=>
      string(5) "score"
      [5]=>
      string(2) "is"
      [6]=>
      string(4) "3.14"
      [7]=>
      string(3) "and"
      [8]=>
      string(1) "I"
      [9]=>
      string(2) "am"
      [10]=>
      string(5) "happy"
      [11]=>
      string(5) "about"
      [12]=>
      string(2) "it"
    }