I'm using php 5.3 and I want to count the words of some text for validation reason. My problem is that the javascript functionality that I have for the validation text, returns different number of words according the php functionality.
Here is the php code:
//trim it
$text = strip_tags(html_entity_decode($text,ENT_QUOTES));
// replace numbers with X
$text = preg_replace('/\d/', 'X', $text);
// remove ./,/-/&
$text = str_replace(array('.',',','-','&'), '', $text);
// number of words
$count = str_word_count($text);
I noticed that with php 5.5, I get the right number of the words but with php 5.3 not. I searched about that and I found this link (http://grokbase.com/t/php/php-bugs/12c14e0y6q/php-bug-bug-63663-new-str-word-count-does-not-properly-handle-non-latin-characters) that explains about the bug that php 5.3 has regarding with the latin characters. I tried to solve it with this code:
// remove non-utf8 characters
$text = preg_replace('/[^(\x20-\x7F)]*/','', $text);
But I still didn't get right result. Basically, the number of the word was very close to the result and sometimes accurate but often I had issues.
I decided to create another php functionality to fix the bug. Here is the php code:
//trim it
$text = strip_tags(html_entity_decode($text,ENT_QUOTES));
// replace multiple (one ore more) line breaks with a single space
$text = preg_replace("/[\n]+/", " ", $text);
// replace multiple (one ore more) spaces with a separator string (@SEPARATOR@)
$text = preg_replace("/[\s]+/", "@SEPARATOR@", $text);
// explode the separator string (@SEPARATOR@) and get the array
$text_array = explode('@SEPARATOR@', $text);
// get the numbers of the array/words
$count = count($text_array);
// check if the last key of the array is empty and decrease the count by one
$last_key = end($text_array);
if (empty($last_key)) {
$count--;
}
The last code is working fine for me and I would like to ask two questions:
Assuming you are asking how to still use str_word_count: You could try using: preg_replace('/[^a-zA-Z0-9\s]/','',$string)
after you have already replaced any punctuation. Not having a "test string" that you know fails, I had no way to try that out, but at least it is something you can try yourself.
One improvement, would be to actually trim the text, it mentions trim in the first comment but that first line is just removing HTML tags. Add a trim($string)
then you can remove the last part:
CHANGE first 2 lines:
//trim it & remove tags
$text = trim(strip_tags(html_entity_decode($text,ENT_QUOTES)));
Remove:
// check if the last key of the array is empty and decrease the count by one
$last_key = end($text_array);
if (empty($last_key)) {
$count--;
}