When should I choose bucket sort over other sorting algorithms?
Asked Answered
A

1

15

When is bucket sort algorithm the best method to use for sorting? Is there a recommended guide in using them based on the size, type of data structure?

Annmarieannnora answered 26/7, 2015 at 3:36 Comment(2)
Try to narrow down your question. What particular problem are you trying to solve? What have you tried already?Grazier
Bucket sort is mainly useful when input is uniformly distributed over a range. Reference: geeksforgeeks.org/bucket-sort-2Telic
C
40

Bucket sort is a non-comparison based sorting algorithm that assumes it's possible to create an array of buckets and distribute the items to be sorted into those buckets by index. Therefore, as a prerequisite for even using bucket sort in the first place, you need to have some way of obtaining an index for each item. Those indices can't just be from a hash function; they need to satisfy the property that if any object x comes before any object y, then x's bucket index must be no greater than y's bucket index. Many objects have this property - you can sort integers this way by looking at some of the bits of the number, and you can sort strings this way by looking at the first few characters - but many do not.

The advantage of bucket sort is that once the elements are distributed into buckets, each bucket can be processed independently of the others. This means that you often need to sort much smaller arrays as a follow-up step than the original array. It also means that you can sort all of the buckets in parallel with one another. The disadvantage is that if you get a bad distribution into the buckets, you may end up doing a huge amount of extra work for no benefit or a minimal benefit. As a result, bucket sort works best when the data are more or less uniformly distributed or where there is an intelligent way to choose the buckets given a quick set of heuristics based on the input array. Bucket sort also works well if you have a large degree of parallelism available.

Another advantage of bucket sort is that you can use it as an external sorting algorithm. If you need to sort a list that is so huge you can't fit it into memory, you can stream the list through RAM, distribute the items into buckets stored in external files, then sort each file in RAM independently.

Here are a few disadvantages of bucket sort:

  • As mentioned above, you can't apply it to all data types because you need a good bucketing scheme.
  • Bucket sort's efficiency is sensitive to the distribution of the input values, so if you have tightly-clustered values, it's not worth it.
  • In many cases where you could use bucket sort, you could also use another specialized sorting algorithm like radix sort, counting sort, or burstsort instead and get better performance.
  • The performance of bucket sort depends on the number of buckets chosen, which might require some extra performance tuning compared to other algorithms.

I hope this helps give you a sense of the relative advantages and disadvantages of bucket sort. Ultimately, the best way to figure out whether it's a good fit is to compare it against other algorithms and see how it actually does, though the above criteria might help you avoid spending your time comparing it in cases where it's unlikely to work well.

Cineaste answered 24/8, 2015 at 23:22 Comment(7)
One issue is that each bucket needs to be the same size as the original array to handle the worst case scenario, and two sets of buckets are needed if doing a multi-pass bucket only sort. Counting / radix sort makes an initial pass to determine the size of each bucket for each pass, so that only a second array of the same size as the original array is needed (assuming your'e allowed to use the original array as temp storage during the sort).Antarctic
Although the buckets could be sorted in parallel, the process is memory bound and normally there's only one memory bus, so sorting in parallel in memory doesn't accomplish much (may make it worse). The same issue would apply to a solid state drive, and would be worse on a regular hard drive due to seek overhead.Antarctic
@Antarctic You could alternatively use dynamic arrays for each of the buckets that double in capacity every time they overflow, much in the same way that std::vector works in C++ or ArrayList works in Java. That would avoid a huge memory overhead at the cost of slightly slower performance.Cineaste
Seems that overhead would be greater than the overhead in the initial count pass of counting / radix sort.Antarctic
On my system (Intel 2600K 3.4 ghz), using a counting / radix sort on 4 million 64 bit unsigned integers (32 MB of data), sorting 8 bits (1 byte) at a time, so 8 passes, takes less than 3/10ths of a second (in either 32 bit or 64 bit mode), so from a practical standpoint, bucket sort will probably take longer if using dynamic arrays, or run out of space on an array large enough for the sort time to be significant.Antarctic
Correct me if I'm wrong but bucket sort does use comparison and cannot be assumed as a non-comparison based sorting algorithm. en.wikipedia.org/wiki/….Rusty
@IxionChowdhury I'm not familiar with how you would implement bucket sort using comparisons; my impression is that the idea of distributing items into buckets was that you would use a non-comparison step to do this. If you're going to use a comparison-based approach to drop items into buckets, you're probably better off just using quicksort, heapsort, etc. because those are such fast comparison-based sorts. (I didn't see a source on the claim listed in Wikipedia - I wonder what that's in reference to?)Cineaste

© 2022 - 2024 — McMap. All rights reserved.