Sorting strings by comparisons (e.g. standard QuickSort + strcmp-like function) may be a bit slow, especially for long strings sharing a common prefix (the comparison function takes O(s) time, where s is the length of string), thus a standard solution has the complexity of O(s * nlog n). Are there any known faster algorithms?
If you know that the string consist only of certain characters (which is almost always the case), you can use a variant of BucketSort or RadixSort.
You could build a trie, which should be O(s*n)
, I believe.
O(s)
, and you need to do it n
times. –
Causation Please search for "Sedgewick Multikey quick sort" (Sedgewick wrote famous algorithms textbooks in C and Java). His algorithm is relatively easy to implement and quite fast. It avoids the problem you are talking above. There is the burst sort algorithm which claims to be faster, but I don't know of any implementation.
There is an article Fast String Sort in C# and F# that describes the algorithm and has a reference to Sedgewick's code as well as to C# code. (disclosure: it's an article and code that I wrote based on Sedgewick's paper).
Summary
I found the string_sorting repo by Tommi Rantala comprehensive, it includes many known efficient (string) sorting algorithms, e.g. MSD radix sort, burstsort and multi-key-quicksort. In addition, most of them are also cache efficient.
My Experience
It appears to me three-way radix/string quicksort is one of the fastest string sorting algorithms. Also, MSD radix sort is a good one. They are introduced in Sedgewick's excellent Algorithms book.
Here are some results to sort leipzig1M.txt taken from here:
$ wc leipzig1M.txt
# lines words characters
1'000'000 21'191'455 129'644'797 leipzig1M.txt
Method | Time |
---|---|
Hoare | 7.8792s |
Quick3Way | 7.5074s |
Fast3Way | 5.78015s |
RadixSort | 4.86149s |
Quick3String | 4.3685s |
Heapsort | 32.8318s |
MergeSort | 16.94s |
std::sort/introsort | 6.10666s |
MSD+Q3S | 3.74214s |
The charming thing about three-way radix/string quicksort is it is really simple to implement, effectively only about ten source lines of code.
template<typename RandomIt>
void insertion_sort(RandomIt first, RandomIt last, size_t d)
{
const int len = last - first;
for (int i = 1; i < len; ++i) {
// insert a[i] into the sorted sequence a[0..i-1]
for (int j = i; j > 0 && std::strcmp(&(*(first+j))[d], &(*(first+j-1))[d]) < 0; --j)
iter_swap(first + j, first + j - 1);
}
}
template<typename RandomIt>
void quick3string(RandomIt first, RandomIt last, size_t d)
{
if (last - first < 2) return;
#if 0 // seems not to help much
if (last - first <= 8) { // change the threshold as you like
insertion_sort(first, last, d);
return;
}
#endif
typedef typename std::iterator_traits<RandomIt>::value_type String;
typedef typename string_traits<String>::value_type CharT;
typedef std::make_unsigned_t<CharT> UCharT;
RandomIt lt = first, i = first + 1, gt = last - 1;
/* make lo = median of {lo, mid, hi} */
RandomIt mid = lt + ((gt - lt) >> 1);
if ((*mid)[d] < (*lt)[d]) iter_swap(lt, mid);
if ((*mid)[d] < (*gt)[d]) iter_swap(gt, mid);
// now mid is the largest of the three, then make lo the median
if ((*lt)[d] < (*gt)[d]) iter_swap(lt, gt);
UCharT pivot = (*first)[d];
while (i <= gt) {
int diff = (UCharT) (*i)[d] - pivot;
if (diff < 0) iter_swap(lt++, i++);
else if (diff > 0) iter_swap(i, gt--);
else ++i;
}
// Now a[lo..lt-1] < pivot = a[lt..gt] < a[gt+1..hi].
quick3string(first, lt, d); // sort a[lo..lt-1]
if (pivot != '\0')
quick3string(lt, gt+1, d+1); // sort a[lt..gt] on following character
quick3string(gt+1, last, d); // sort a[gt+1..hi]
}
/*
* Three-way string quicksort.
* Similar to MSD radix sort, we first sort the array on the leading character
* (using quicksort), then apply this method recursively on the subarrays. On
* first sorting, a pivot v is chosen, then partition it in 3 parts, strings
* whose first character are less than v, equal to v, and greater than v. Just
* like the partitioning in classic quicksort but with comparing only the 1st
* character instead of the whole string. After partitioning, only the middle
* (equal-to-v) part can sort on the following character (index of d+1). The
* other two recursively sort on the same depth (index of d) because these two
* haven't been sorted on the dth character (just partitioned them: <v or >v).
*
* Time complexity: O(N~N*lgN), space complexity: O(lgN).
* Explaination: N * string length (for partitioning, find equal-to-v part) +
* O(N*lgN) (to do the quicksort thing)
* character comparisons (instead of string comparisons in normal quicksort).
*/
template<typename RandomIt>
void str_qsort(RandomIt first, RandomIt last)
{
quick3string(first, last, 0);
}
NOTE: But if you like me searching Google for "fastest string sorting algorithm", chances are it's burstsort, a cache-aware MSD radix sort variant (paper). I also found this paper by Bentley and Sedgewick helpful, which used a Multikey Quicksort.
© 2022 - 2024 — McMap. All rights reserved.