I'm using strnatcmp
in my comparison function for sorting person names in a table. For our Belgian client, we get some strange results. They have names like 'Van der Broecke' and 'Vander Veere', and strnatcasecmp("Van der", "Vander")
returns 0
!
As natural comparison aims to sort as a human would, I don't understand why the spaces are completely disregarded.
E.g.:
$names = array("Van de broecke", "Vander Veere", "Vande Muizen", "Vander Zoeker", "Van der Programma", "vande Huizen", "vande Kluizen", "vander Muizen", "Van der Luizen");
natcasesort($names);
print_r($names);
Gives:
Array (
[0] => Van de broecke
[5] => vande Huizen
[6] => vande Kluizen
[2] => Vande Muizen
[8] => Van der Luizen
[7] => vander Muizen
[4] => Van der Programma
[1] => Vander Veere
[3] => Vander Zoeker
)
But a human would say:
Array (
[0] => Van de broecke
[4] => Van der Programma
[8] => Van der Luizen
[5] => vande Huizen
[6] => vande Kluizen
[2] => Vande Muizen
[7] => vander Muizen
[1] => Vander Veere
[3] => Vander Zoeker
)
My solution now is to replace all spaces with underscores, which are handled properly. Two questions:
Why does natsort
work like this?
Is there a better solution?
If you look in the source code you can actually see this, which definitely seems like a bug: http://gcov.php.net/PHP_5_3/lcov_html/ext/standard/strnatcmp.c.gcov.php (scroll down to line 130):
//inside a while loop...
/* Skip consecutive whitespace */
while (isspace((int)(unsigned char)ca)) {
ca = *++ap;
}
while (isspace((int)(unsigned char)cb)) {
cb = *++bp;
}
Note that's a link to 5.3, but the same code still exists in 5.5 (http://gcov.php.net/PHP_5_5/lcov_html/ext/standard/strnatcmp.c.gcov.php) Admittedly my knowledge of C is limited, but this basically appears to be advancing the pointer on each string if the current character is a space, essentially ignoring that character in the sort. The comment implies that it's only doing this if the whitespaces are consecutive; however, there is no check to ensure the previous character was actually a space first. That would need something like
//declare these outside the loop
short prevAIsSpace = 0;
short prevBIsSpace = 0;
//....in the loop
while (prevAIsSpace && isspace((int)(unsigned char)ca)) {
//won't get here the first time since prevAIsSpace == 0
ca = *++ap;
}
//now if the character is a space, flag it for the next iteration
prevAIsSpace = isspace((int)(unsigned char)ca));
//repeat with string b
while (prevBIsSpace && isspace((int)(unsigned char)cb)) {
cb = *++bp;
}
prevBIsSpace = isspace((int)(unsigned char)cb));
Someone who actually knows C could probably write this better, but that's the general idea.
On another potentially interesting note, for your example, if you're using PHP >= 5.4, this gives the same result as the usort mentioned by Aaron Saray (it does lose the key/value associations as well):
sort($names, SORT_FLAG_CASE | SORT_STRING);
print_r($names);
Array (
[0] => Van de broecke
[1] => Van der Luizen
[2] => Van der Programma
[3] => vande Huizen
[4] => vande Kluizen
[5] => Vande Muizen
[6] => vander Muizen
[7] => Vander Veere
[8] => Vander Zoeker
)