Efficient string sorting algorithm
Asked Answered
C

4

15

Sorting strings by comparisons (e.g. standard QuickSort + strcmp-like function) may be a bit slow, especially for long strings sharing a common prefix (the comparison function takes O(s) time, where s is the length of string), thus a standard solution has the complexity of O(s * nlog n). Are there any known faster algorithms?

Callender answered 7/8, 2011 at 11:59 Comment(2)
Is it causing your code to be slow? If not, don't worry about it.Jeffries
It's not the first time, when I encounter this problem, but yes, at the moment this sorting is a part, where my program spends a lot of time.Callender
G
17

If you know that the string consist only of certain characters (which is almost always the case), you can use a variant of BucketSort or RadixSort.

Gelman answered 7/8, 2011 at 12:23 Comment(1)
I did a hybrid solution to firstly sort the suffixes of strings using quicksort and then the rest using radixsort. It works pretty fast. I didn't want to go with pure radix sort as some strings are long, but the suffixes are rather different so it didn't hurt to sort them using quicksort. I think there is still room for improvement, but for now this solution is enough. ThanksCallender
C
11

You could build a trie, which should be O(s*n), I believe.

Causation answered 7/8, 2011 at 12:5 Comment(3)
@tyz: Insertion into a trie should be O(s), and you need to do it n times.Causation
I have to think about it, in straightforward implementation it seems to be a bit memory-hungry.Callender
@Piotr: At the moment, your question only asks about computational complexity. Other issues, such as memory complexity or cache efficiency may often dominate!Causation
B
4

Please search for "Sedgewick Multikey quick sort" (Sedgewick wrote famous algorithms textbooks in C and Java). His algorithm is relatively easy to implement and quite fast. It avoids the problem you are talking above. There is the burst sort algorithm which claims to be faster, but I don't know of any implementation.

There is an article Fast String Sort in C# and F# that describes the algorithm and has a reference to Sedgewick's code as well as to C# code. (disclosure: it's an article and code that I wrote based on Sedgewick's paper).

Bruit answered 10/10, 2015 at 13:47 Comment(0)
M
1

Summary

I found the string_sorting repo by Tommi Rantala comprehensive, it includes many known efficient (string) sorting algorithms, e.g. MSD radix sort, burstsort and multi-key-quicksort. In addition, most of them are also cache efficient.

My Experience

It appears to me three-way radix/string quicksort is one of the fastest string sorting algorithms. Also, MSD radix sort is a good one. They are introduced in Sedgewick's excellent Algorithms book.

Here are some results to sort leipzig1M.txt taken from here:

$ wc leipzig1M.txt
#  lines       words       characters
  1'000'000  21'191'455   129'644'797    leipzig1M.txt
Method Time
Hoare 7.8792s
Quick3Way 7.5074s
Fast3Way 5.78015s
RadixSort 4.86149s
Quick3String 4.3685s
Heapsort 32.8318s
MergeSort 16.94s
std::sort/introsort 6.10666s
MSD+Q3S 3.74214s

The charming thing about three-way radix/string quicksort is it is really simple to implement, effectively only about ten source lines of code.

template<typename RandomIt>
void insertion_sort(RandomIt first, RandomIt last, size_t d)
{
    const int len = last - first;
    for (int i = 1; i < len; ++i) {
        // insert a[i] into the sorted sequence a[0..i-1]
        for (int j = i; j > 0 && std::strcmp(&(*(first+j))[d], &(*(first+j-1))[d]) < 0; --j)
            iter_swap(first + j, first + j - 1);
    }
}

template<typename RandomIt>
void quick3string(RandomIt first, RandomIt last, size_t d)
{
    if (last - first < 2) return;
#if 0 // seems not to help much
    if (last - first <= 8) { // change the threshold as you like
        insertion_sort(first, last, d);
        return;
    }
#endif
    typedef typename std::iterator_traits<RandomIt>::value_type String;
    typedef typename string_traits<String>::value_type CharT;
    typedef std::make_unsigned_t<CharT> UCharT;

    RandomIt lt = first, i = first + 1, gt = last - 1;
    /* make lo = median of {lo, mid, hi} */
    RandomIt mid = lt + ((gt - lt) >> 1);
    if ((*mid)[d] < (*lt)[d]) iter_swap(lt, mid);
    if ((*mid)[d] < (*gt)[d]) iter_swap(gt, mid);
    // now mid is the largest of the three, then make lo the median
    if ((*lt)[d] < (*gt)[d]) iter_swap(lt, gt);

    UCharT pivot = (*first)[d];
    while (i <= gt) {
        int diff = (UCharT) (*i)[d] - pivot;
        if      (diff < 0) iter_swap(lt++, i++);
        else if (diff > 0) iter_swap(i, gt--);
        else               ++i;
    }
    // Now a[lo..lt-1] < pivot = a[lt..gt] < a[gt+1..hi].
    quick3string(first, lt, d);      // sort a[lo..lt-1]
    if (pivot != '\0')
        quick3string(lt, gt+1, d+1); // sort a[lt..gt] on following character
    quick3string(gt+1, last, d);     // sort a[gt+1..hi]
}

/*
 * Three-way string quicksort.
 * Similar to MSD radix sort, we first sort the array on the leading character
 * (using quicksort), then apply this method recursively on the subarrays. On
 * first sorting, a pivot v is chosen, then partition it in 3 parts, strings
 * whose first character are less than v, equal to v, and greater than v. Just
 * like the partitioning in classic quicksort but with comparing only the 1st
 * character instead of the whole string. After partitioning, only the middle
 * (equal-to-v) part can sort on the following character (index of d+1). The
 * other two recursively sort on the same depth (index of d) because these two
 * haven't been sorted on the dth character (just partitioned them: <v or >v).
 *
 * Time complexity: O(N~N*lgN), space complexity: O(lgN).
 * Explaination: N * string length (for partitioning, find equal-to-v part) +
 *               O(N*lgN) (to do the quicksort thing)
 * character comparisons (instead of string comparisons in normal quicksort).
 */
template<typename RandomIt>
void str_qsort(RandomIt first, RandomIt last)
{
    quick3string(first, last, 0);
}

NOTE: But if you like me searching Google for "fastest string sorting algorithm", chances are it's burstsort, a cache-aware MSD radix sort variant (paper). I also found this paper by Bentley and Sedgewick helpful, which used a Multikey Quicksort.

Mohsen answered 4/3, 2022 at 9:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.