Why are spaces ignored in natsort / strnatcmp / strnatcasecmp?
Asked Answered
S

3

8

I'm using strnatcmp in my comparison function for sorting person names in a table. For our Belgian client, we get some strange results. They have names like 'Van der Broecke' and 'Vander Veere', and strnatcasecmp("Van der", "Vander") returns 0!

As natural comparison aims to sort as a human would, I don't understand why the spaces are completely disregarded.

E.g.:

$names = array("Van de broecke", "Vander Veere", "Vande Muizen", "Vander Zoeker", "Van der Programma", "vande Huizen", "vande Kluizen", "vander Muizen", "Van der Luizen");
natcasesort($names);

print_r($names);

Gives:

Array ( 
[0] => Van de broecke 
[5] => vande Huizen 
[6] => vande Kluizen 
[2] => Vande Muizen 
[8] => Van der Luizen 
[7] => vander Muizen 
[4] => Van der Programma 
[1] => Vander Veere 
[3] => Vander Zoeker 
)

But a human would say:

Array ( 
[0] => Van de broecke 
[4] => Van der Programma 
[8] => Van der Luizen 
[5] => vande Huizen 
[6] => vande Kluizen 
[2] => Vande Muizen 
[7] => vander Muizen 
[1] => Vander Veere 
[3] => Vander Zoeker 
)

My solution now is to replace all spaces with underscores, which are handled properly. Two questions: Why does natsort work like this? Is there a better solution?

Strophe answered 16/8, 2013 at 15:3 Comment(5)
As I understand, natsort sorts strings by numbers as humans do, it doesn't sort correctly if there's no numbersLamrert
@Baldrs No. PHP.net: "This function implements a sort algorithm that orders alphanumeric strings in the way a human being would while maintaining key/value associations. "Strophe
A key phrase is alphanumeric strings.Lamrert
@Baldrs Exactly. That's the key phrase. "Van der broecke" is an alphanumeric string. en.wikipedia.org/wiki/AlphanumericStrophe
@Baldrs Note that _ or + are sorted. However spaces are ignored. So the alphanumeric here is not strict.Benthos
T
2

If you look in the source code you can actually see this, which definitely seems like a bug: http://gcov.php.net/PHP_5_3/lcov_html/ext/standard/strnatcmp.c.gcov.php (scroll down to line 130):

 //inside a while loop...

 /* Skip consecutive whitespace */
 while (isspace((int)(unsigned char)ca)) {
         ca = *++ap;
 }

 while (isspace((int)(unsigned char)cb)) {
         cb = *++bp;
 }

Note that's a link to 5.3, but the same code still exists in 5.5 (http://gcov.php.net/PHP_5_5/lcov_html/ext/standard/strnatcmp.c.gcov.php) Admittedly my knowledge of C is limited, but this basically appears to be advancing the pointer on each string if the current character is a space, essentially ignoring that character in the sort. The comment implies that it's only doing this if the whitespaces are consecutive; however, there is no check to ensure the previous character was actually a space first. That would need something like

//declare these outside the loop
short prevAIsSpace = 0;
short prevBIsSpace = 0;

//....in the loop
while (prevAIsSpace && isspace((int)(unsigned char)ca)) {
    //won't get here the first time since prevAIsSpace == 0
    ca = *++ap;
}
//now if the character is a space, flag it for the next iteration
prevAIsSpace = isspace((int)(unsigned char)ca));
//repeat with string b
while (prevBIsSpace && isspace((int)(unsigned char)cb)) {
    cb = *++bp;
}
prevBIsSpace = isspace((int)(unsigned char)cb));

Someone who actually knows C could probably write this better, but that's the general idea.

On another potentially interesting note, for your example, if you're using PHP >= 5.4, this gives the same result as the usort mentioned by Aaron Saray (it does lose the key/value associations as well):

sort($names, SORT_FLAG_CASE | SORT_STRING);

print_r($names);
Array ( 
    [0] => Van de broecke 
    [1] => Van der Luizen 
    [2] => Van der Programma 
    [3] => vande Huizen 
    [4] => vande Kluizen 
    [5] => Vande Muizen 
    [6] => vander Muizen 
    [7] => Vander Veere 
    [8] => Vander Zoeker
) 
Trickster answered 24/8, 2013 at 23:35 Comment(2)
Very interesting! Sorry for disregarding this question for a while - I was out of the loop for a bit.Strophe
Thanks - if this is what you were looking for would you mind accepting the answer?Trickster
S
2

Take a look at bugs.php.net #26412 (natsort() was compressing multiple spaces to 1 space). Apparently, this behavior is so "aa", "a a", and "a a" (note the 2 spaces) do not sort as identical strings.

Setscrew answered 21/8, 2013 at 17:15 Comment(1)
An unaccepted bug report from 2003! :) But it is, in fact, the same core problem.. But I don't understand your conclusion of why this would be. They seem to do the opposite to me: sort aa, a a and a a as identical, the latter two being absolutely identical, and the former being closer to a a than to ab. It's even more Kolmogorovian complex to add a space and a character... I'm still puzzledStrophe
G
2

Like other answers/commentors have said, this is a known issue. However, you can write your own sort with usort(). Please try this and see if it works:

usort($names2, function($first, $second) {
    if ($first == $second) {
        return 0;
    }
    else {
        return (strtolower($first) < strtolower($second)) ? -1 : 1;
}
});

I noticed the output is slightly different than your suggested answer:

You suggested:

[4] => Van der Programma 
[8] => Van der Luizen

But I'm sure this was a typo - these should be swapped. :)

Guttersnipe answered 23/8, 2013 at 19:32 Comment(0)
T
2

If you look in the source code you can actually see this, which definitely seems like a bug: http://gcov.php.net/PHP_5_3/lcov_html/ext/standard/strnatcmp.c.gcov.php (scroll down to line 130):

 //inside a while loop...

 /* Skip consecutive whitespace */
 while (isspace((int)(unsigned char)ca)) {
         ca = *++ap;
 }

 while (isspace((int)(unsigned char)cb)) {
         cb = *++bp;
 }

Note that's a link to 5.3, but the same code still exists in 5.5 (http://gcov.php.net/PHP_5_5/lcov_html/ext/standard/strnatcmp.c.gcov.php) Admittedly my knowledge of C is limited, but this basically appears to be advancing the pointer on each string if the current character is a space, essentially ignoring that character in the sort. The comment implies that it's only doing this if the whitespaces are consecutive; however, there is no check to ensure the previous character was actually a space first. That would need something like

//declare these outside the loop
short prevAIsSpace = 0;
short prevBIsSpace = 0;

//....in the loop
while (prevAIsSpace && isspace((int)(unsigned char)ca)) {
    //won't get here the first time since prevAIsSpace == 0
    ca = *++ap;
}
//now if the character is a space, flag it for the next iteration
prevAIsSpace = isspace((int)(unsigned char)ca));
//repeat with string b
while (prevBIsSpace && isspace((int)(unsigned char)cb)) {
    cb = *++bp;
}
prevBIsSpace = isspace((int)(unsigned char)cb));

Someone who actually knows C could probably write this better, but that's the general idea.

On another potentially interesting note, for your example, if you're using PHP >= 5.4, this gives the same result as the usort mentioned by Aaron Saray (it does lose the key/value associations as well):

sort($names, SORT_FLAG_CASE | SORT_STRING);

print_r($names);
Array ( 
    [0] => Van de broecke 
    [1] => Van der Luizen 
    [2] => Van der Programma 
    [3] => vande Huizen 
    [4] => vande Kluizen 
    [5] => Vande Muizen 
    [6] => vander Muizen 
    [7] => Vander Veere 
    [8] => Vander Zoeker
) 
Trickster answered 24/8, 2013 at 23:35 Comment(2)
Very interesting! Sorry for disregarding this question for a while - I was out of the loop for a bit.Strophe
Thanks - if this is what you were looking for would you mind accepting the answer?Trickster

© 2022 - 2024 — McMap. All rights reserved.