Why don't we use tries for sorting if they take O(n) time to sort a list?

Asked 23/2, 2020 at 13:11 Answered 1/11, 2020 at 16:53

Solved algorithm sorting data-structures time-complexity trie

Here is a description of an algorithm to sort strings using a trie:

The algorithm first inserts all the items in the trie in O(n) time, where n is the total number of characters in the list of words to be sorted.

Then it traverses the tree in-order, printing out a node preceded by its prefix when it comes to a node with the is_end flag set. This requires a complete traversal of the trie, which takes O(m) time, where m is the number of nodes in the trie. This is bounded by n, so this step is bounded by O(n) as well.

The whole algorithm is made up of two sub-routines, each bounded by O(n). If we say e.g. that the average word contains c characters, then if m is the number of words, cm == n, and the total runtime is bounded by O(n) == O(cm) == O(m) (the reason I changed it to m is because that's the traditional measure of the length of a list to be sorted, not the total number of characters).

Therefore, my question is, if this runtime analysis is correct, why isn't this the default method for string sorting, as it would be faster than any O(nlogn) sorting algorithm?

Cienfuegos answered 23/2, 2020 at 13:11 Comment(1)

“Why isn’t” is hard to answer. Perhaps you have just invented something novel, perhaps you’re operating under false assumptions of performance. Have you tried implementing it and compared it to other algorithms? – Piliform 23/2, 2020 at 13:22

The O(n log n) lower bound is for comparison sorts, i.e. the elements in the array can only be compared with each other to check whether one should be before or after the other, or if they are equal. That is a good model for general sorting algorithms because it works for pretty much any type of data you might want to sort; numbers, strings, instances of user-defined classes, and so on. It could even just be a data type that can be mapped by a key function to some other data type supporting comparisons; or you could accept a comparator function to do the comparisons with.

Note that O(n log n) here is a lower bound on the number of comparisons, not the running time. If the comparisons take more than O(1) time each, say because you're comparing long strings that have long common prefixes, then the running time would be like O(cn log n) where comparisons are done in O(c) time. Comparing strings of length w takes O(w) time in the worst case, for example.

If you only need a sorting algorithm for a specific type of data, then you may be able to do better, because other operations specific to that data type can be performed on the elements. For example, when sorting integers, you can use the array elements to index another array, giving the counting sort algorithm which runs in O(n + r) time where r is the range of the array elements.

If the sorting keys are like strings, in the sense that they are (or can be mapped to) sequences such that comparing keys is equivalent to lexicographically comparing those sequences, then indeed you can use a trie to sort an array containing that data type. Congratulations: you have independently reinvented the radix sort algorithm, which can be implemented using tries. Its running time is O(wn), not O(n), because it takes O(w) time to insert a string of length w into the trie and you have to do that n times.

So, if the elements aren't strings, or "string-like" in the above sense, then radix sort simply isn't applicable. If the elements are strings or "string-like", then radix sort works but it takes O(wn) time instead of O(cn log n).

This means radix sort isn't strictly better, and is probably worse when the common prefixes of the strings are short relative to the strings themselves, which is often the case. For random strings, regular string comparison takes O(1) time on average, in which case O(n log n) is asymptotically better than radix sort for strings longer than O(log n).

In practical applications, one should also consider the hidden constants in the asymptotic analysis. Comparison sorts like Timsort have low hidden constants because they access array elements sequentially, which results in fewer cache misses compared to walking a tree whose nodes won't be consecutive in memory.

Unmanly answered 23/2, 2020 at 13:36 Comment(7)

Your first three and a half paragraphs seem gratuitous, since the OP clearly already understands that it's possible to beat O(n log n) with this approach. (But, +1 for the parts after that.) – Auscultation 23/2, 2020 at 17:1

@Auscultation Thanks for your feedback; on re-reading the question I agree that it does seem the OP already understands that, so I hope it doesn't come across as condescending. At least those parts might be useful to other readers. – Unmanly 23/2, 2020 at 17:23

I guess the radix sort could be improved by not building the whole trie, but expanding trie nodes only as needed to distinguish different strings with the same prefix (represented by the trie node that then needs to get expanded). Then for that mentioned case where the common prefixes of the strings are short relative to the strings themselves, it would do very well. – Infancy 23/2, 2020 at 17:45

@HeapOverflow That's quite possible. For a fixed alphabet size, I think the average length of a common prefix should be O(log n) since there are exponentially many distinct possible prefixes of each length - but the base of the logarithm is the alphabet size, so it should be O(log n) with a much lower constant than the lower bound for comparison sorts. But for a large alphabet the trie nodes each need a bigger data structure to hold their children (maybe a hashtable, or another radix tree by splitting the character into chunks of a few bits each). I guess empirical tests are the way to go. – Unmanly 23/2, 2020 at 18:8

@Auscultation I thought the beginning paragraphs were helpful. I didn't realize that the "usual" sorting algorithms belong to a specific subclass of sorting algorithms called comparison sort, so this was useful for me. – Cienfuegos 23/2, 2020 at 18:12

Is there a more general variant of radix sort, where you are not splitting a string into characters, but a data record into search criteria components? For example, if you want to sort people by lastname, then firstname, then age, you could build a tree structure using nested sorted maps. The first level maps the lastname to the subtree of that lastname, where you continue with firstname, and on the last level you map the age to a list of people that all have the same values for the path from root to that age. I wonder if this is something worth doing. – Cuirbouilli 28/11, 2023 at 11:22

@Cuirbouilli There's no reason why it wouldn't work, but (1) generally in that case each record has a fixed number of fields so the tree has a fixed depth and code which doesn't take advantage of that will have a performance penalty, and (2) generally each field isn't chosen from a finite set of values, so you can't use array indexing to store a node's children, you would need hashtables (or something else) instead, which carries another performance penalty. – Unmanly 28/11, 2023 at 17:39

Sorting with tries is faster for strings, but it requires building a trie which can be expensive. In many cases using a comparison sort is more flexible and can be done in-place.

Binny answered 1/11, 2020 at 16:53 Comment(0)

Recommended topics

Hot tags