Natural order sorting
Tuesday, February 8th, 2005
It’s sometimes humbling to learn new things about programming which are actually really basic stuff. Things that should have been obvious but which you never really thought off.
Take for instance the subject of sorting. Sorting is done via an algorithm which compares strings (a collection of characters). There are different types of comparisons from a programmers perspective:
- Normal comparison
- Case in-sensitive comparison
- Numerical comparison (comparing two strings which represent numbers)
That last one, numerical comparison, is about ASCII values (actually, the second is too, but less so). Normal string comparison compares two strings using the value of the individual characters. For instance the string “abc” consists of the ASCII values 97, 98 and 99 and “bcd” of 98, 99 and 100, so “abc” is smaller than “bcd”.
However if we have the string “100” and the string “20” and these are compared using the above string comparison routine then “100” will be smaller than “20” because “100” has the ASCII values 49, 48 and 48 and “20” has the ASCII values 50, 48. 49 is smaller than 50, so “100” is seen as smaller than “20”. (This, incidentally, is called lexicographical comparison) This is what the numerical comparison is about. Instead of comparing the ASCII values, it compares the real values of the string (100 v.s. 20 instead of 49 v.s. 50).
Nothing really new for me about this.
But what if you’ve got the following strings? “img1.gif”, “img2.gif” “img10.gif” “img15.gif”? They will need to be sorted by ASCII value, because the string contains letters, but also by numerical value because it contains numbers. If we use a normal comparison function the list will get sorted like this:
- img1.gif
- img10.gif
- img15.gif
- img2.gif
Also not new for me, and in this case quite easy to circumvent by stripping of the first three chars. But things can get quite complex pretty quickly. What if you’ve got these strings: “1-2”, “1-02”, “1-20”, “10-20”? I never really thought about how these things were handled since I never really needed it.
Turns out though that most programming languages have a comparison function (or lib) for this so called ‘Natural Order comparison‘ (sometimes also called ‘collate‘ comparison). Here’s a little list:
- PHP: natsort(), natcasesort(), strnatcmp(), strnatcasecmp
- C: Custom algorithm (by Martin Pool), g_utf8_collate() (Glib library)
- Perl: Sort-Naturally (CPAN)
There’s probably one for your favorite language too. From now on I’m gonna use it instead of the default strcmp().