How to optimize quicksort

Asked 17/9, 2012 at 7:32 Answered 5/3, 2014 at 13:40

Solved algorithm recursion quicksort insertion-sort

I am trying to work out an efficient quicksort algo. It works okay, but takes long time to run when the number of elements are huge, and certain sections of the array are pre-sorted. I was looking up the Wikipedia article on quicksort, and there I found this written:

To make sure at most O(log N) space is used, recurse first into the smaller half of the array, and use a tail call to recurse into the other.

Use insertion sort, which has a smaller constant factor and is thus faster on small arrays, for invocations on such small arrays (i.e. where the length is less than a threshold t determined experimentally). This can be implemented by leaving such arrays unsorted and running a single insertion sort pass at the end, because insertion sort handles nearly sorted arrays efficiently. A separate insertion sort of each small segment as they are identified adds the overhead of starting and stopping many small sorts, but avoids wasting effort comparing keys across the many segment boundaries, which keys will be in order due to the workings of the quicksort process. It also improves the cache use.

I am currently recursing for both partitions. Any idea how to implement the first tip? What is meant by recurse first into the smaller half of the array, and use a tail call to recurse into the other? And secondly, how can I implement an insertion-sort within quicksort? Will it always improve the efficiency, or only when certain sections of the array are pre-sorted? If it is the 2nd case, then of course I have no way of knowing when will that occur. So when should I include the insertion-sort?

Jeep answered 17/9, 2012 at 7:32 Comment(4)

The constant factors for insertion sort are much lower and there is a cross-over point where, for small arrays of a size which depends on your implementation, your computer, and the alignment of the planets, insertion sort is faster than quicksort. – Chantalchantalle 17/9, 2012 at 7:38

Can you please elaborate? A detailed explanation will be really useful. – Jeep 17/9, 2012 at 7:39

Sure: for n items, the work done by quicksort is A.n.log(n) (in the expected case) while the work done by insertion sort is B.n^2, where A and B are the constant factors corresponding roughly to "cost of instructions executed per iteration". Now B < A since insertion sort has a simpler inner loop, which means that below a certain value of n, insertion sort is cheaper to run than quicksort. The actual values for A and B depend on your implementation and your hardware. – Chantalchantalle 18/9, 2012 at 1:1

Similar question but with code: #19094783 – Plowshare 28/2, 2015 at 10:31

In Quick-sort , you choose a random pivot that delimits the array to two half's, most of the chances that one may be smaller,

e.g. Array size 100, pivot delimits the array to 40 / 60, the 40 is the the smaller size.

Lets assume that you decide on your threshold size to use the insertion sort to be 10, you need to continue recursively split the array by pivot , whenever one of the half's become smaller or equal to 10, you may use the insertion sort that behaves like O(n) on small size arrays.

Take into account that insertion sort will behave badly if your array is sorted reversely (worst case).

As regards to the recursion stuff, you only need to modify the stop case of the quick-sort recursion -> array size <= 10 stop recursion and sort the all the array (which is much smaller in this recursion step) using insertion sort.

By tail recursion , they mean do everything you need with the first half, and then invoke insertion sort for the smaller half as a last method , it is used to save space.

  Quick-sort()
      choose a pivot
      move the smaller elements from left
      move the bigger elements from right
      quick-sort on the bigger half of the array

      if half is less then X
         only then do an insertion sort on the other half <- this is a tail recursion insertion sort 
      else
         quick sort on this half also

As far as i see , the second optimization suggest not to use insertion sort for every recursion step, but remember the indexes for which the constraint is made, then to invoke insertion sort in one batch concatenating the items from all the slices, this will insure improve the cache use , but is is slightly more difficult to implement,

Satisfy answered 17/9, 2012 at 7:48 Comment(11)

What is the quicksort efficiency for smaller array, compared to O(log N) for insertion-sort? – Jeep 17/9, 2012 at 7:51

quick sort efficiency is o(nlog(n)) in average, and not o(log(n)) , insertion sort efficiency for small array is like o(n) – Satisfy 17/9, 2012 at 7:53

But then, even it has its caveats, if the partition is reverse-sorted. And what with the recursion with the tail call bit? – Jeep 17/9, 2012 at 7:57

of course it has it caveats, in-order to decide if you would use this optimization or not, you need to see how your data looks like, if it is partially sorted or randomly sorted "e.g. A,B,E,G,C" then you have good chances to improve your performance, if you have initially data like "E,D,C,B,A", then you may apply your algorithm after swapping the array elements, you must base your decision on the distribution of data. Additionally, i don't see any problem with Quick-sort recursion. – Satisfy 17/9, 2012 at 8:3

Thanks. Can you please explain the first point: recurse first into the smaller half of the array, and use a tail call to recurse into the other.? – Jeep 17/9, 2012 at 8:16

So basically it's the same as a normal quicksort, except that in the last step, instead of using recursion for both partitions, the larger one is recursed into, whereas the smaller one is sorted by insertion-sort. – Jeep 17/9, 2012 at 8:27

Yes , exactly , this will happen on every recursion step , notice the if condition , you may play with the threshold to achieve max performance – Satisfy 17/9, 2012 at 8:29

Can you explain the last line: It also improves the cache use.? How does insertion-sort improve cache-use, and how come quicksort doesnt? – Jeep 17/9, 2012 at 8:31

How is that possible at all? suppose as per my threshold (say, 25 elements), I have found that 3 partitions need to be sorted through insertion-sort, first one from index 0 to 11, 2nd one from 34 to 52, 3rd one from 87 to 111. These portions have been left unsorted, right? So how can combining these 3 portions into one array, and running an insertion-sort on this combined array improve performance? – Jeep 17/9, 2012 at 8:50

let us continue this discussion in chat – Satisfy 17/9, 2012 at 8:52

There are multiple ways one can make standard quicksort more efficent. To implement the first tip from your post you should write something like:

void quicksort(int * tab, int l, int r)
{
   int q;
   while(l < r)
   {
      q = partition(tab, l, r);
      if(q - l < r - q) //recurse into the smaller half
      {
         quicksort(tab, l, q - 1);
         l = q + 1;
      } else
      {
         quicksort(tab, q + 1, r);
         r = q - 1;
      }
   }
}

Hope that's clear enough. Next step would be to implement your own stack (or use some built-in from whatever language you are using) instead of using recursive calls. Example (pseudo)code:

void quicksort2(int * tab, int l, int r)
{
    int le, ri, q;
    init stack;
    push(l, r, stack);
    while(!empty(stack))
    {
        //take the top pair of values from the stack and set them to le and ri
        pop(le, ri, stack);
        if(le >= ri)
            continue;
        q = partition(tab, le, ri);
        if(q - le < ri - q) //smaller half goes first
        {
            push(le, q - 1, stack);
            push(q + 1, ri, stack);
        } else
        {
            push(q + 1, ri, stack);
            push(le, q - 1, stack);
        }
    }
    delete stack;
}

Then you can proceed to implement the other tip from your post. To do this you should set some arbitrary constant, lets call it CUT_OFF, to around 20. This will tell your algorithm when it should switch to insertion sort. It should be rather easy (a matter of adding one if-statement) to alter the previous example so that it switches to insertion sort after it's reached a CUT_OFF point so I will leave you to that.

As for partition method I would recommend using the Lomuto partition instead of Hoare.

However, if your data is already pre-sorted, then you could consider using a different algorithm altogether. From my experience, natural series merge sort implemented on a linked list is a very good choice, if your data is pre-sorted.

Glasper answered 17/9, 2012 at 8:32 Comment(8)

Thanks. I didn't understand your 2nd algorithm. What is being pushed into the stack, and how does it improve the performance? – Jeep 17/9, 2012 at 8:36

You push ranges i.e pairs of values (left, right) onto the stack. Basically you try to imitate the call stack using your own stack. This improves performance because it eliminates the need for recursion. – Glasper 17/9, 2012 at 8:40

Both stack and recursion are doing the same thing. So how come one performs better than the other? – Jeep 17/9, 2012 at 8:52

Non-recursive version is almost always faster than recursive one. That's because, generally speaking, recursion is bad for performance. Recursion involves lots of behind-the-scene stuff that adds significant overhead. For more info on this topic, see: #3521 – Glasper 17/9, 2012 at 8:58

Thanks. Any reason to use Lomuto instead of Hoare? Both seems to give O(N) time. – Jeep 17/9, 2012 at 9:57

Lomuto is simpler, so while they both have same complexity, Lomuto should be a fraction faster (less operations). It's also more difficult to make a mistake while implementing it. ;) – Glasper 17/9, 2012 at 10:1

Actually, Hoare tends to perform fewer swaps and is generally faster in practice. – Stonefly 4/9, 2017 at 22:21

According to geeksforgeeks, "Hoare’s scheme is more efficient than Lomuto’s partition scheme because it does three times fewer swaps on average, and it creates efficient partitions even when all values are equal." – Baynebridge 27/1, 2018 at 11:11

I wrote some time ago a quicksort-based algorithm that you can find there (actually it is a selection algorithm, but can be used a sort algorithm too):

http://processors.wiki.ti.com/index.php/Top_K_Elements_Sort_For_DSP_With_Index_Ordering

The lessons I learned from this experience are the following:

Carefully tune the partition loop of your algorithm. This is often underestimated, but you do get a significant performance boost if you take care of writing loops that the compiler/CPU will be able to software pipeline. This alone has lead to a win of about 50% in CPU cyles.
Hand-coding small sorts gives you a major peformance win. When the number of elements to be sorted in a partition is under 8 elements, just don't bother trying to recurse, but instead implement a hard-coded sort using just ifs and swaps (have a look at the fast_small_sort function in this code). This can lead to a win of about 50% in CPU cycles giving the quicksort the same practical performance as a well written "merge sort".
Spend time to pick a better pivot value when a "poor" pivot selection is detected. My implementation starts using a "median of median" algorithm for pivot selection whenever a pivot selection leads to one side being under 16% of the remaining elements to be sorted. This is a mitigation strategy for worst-case performance of quick-sort, and help ensure that in practice the upper bound is also O(n*log(n)) instead of O(n^2).
Optimize for arrays with lots of equal values (when needed). If the arrays to be sorted have lots of equal values it is worth optimizing for as it will lead to poor pivot selection. In my code I do it by counting all the array entries that are equal to the pivot value. This enables me treat the pivot and all the equal values in the array in a faster way, and doesn't degrade performance when it is not applicable. This is another mitigation strategy for worst-case performance, it helps reduce the worst-case stack usage by reducing the max recursion level in a drastic way.

I hope this helps, Laurent.

Acetyl answered 17/9, 2012 at 11:3 Comment(5)

The last point is interesting. Suppose I choose a pivot, then searching the array for equal keys will take substantial time in itself, right? And secondly, even if it doesn't, how can I treat the pivot and all the equal values in the array in a faster way? – Jeep 17/9, 2012 at 11:26

Once you have picked a pivot you partition the array in two by doing a comparison with each element in the array, that's where you can compare for equality for the same cost. Look for the equalCount variable in my code, it illustrates one way to do it. – Acetyl 17/9, 2012 at 11:33

Yes, but comparing every element in the partition with the pivot will itself incur a penalty, won't it offset the time gain that might be achieved by noting all equal numbers? – Jeep 17/9, 2012 at 12:58

The quicksort ALREADY performs a comparison of the pivot with every element being sorted, so adding a comparison for equality adds no cost. Once you have partitioned the elements in two sets and if you know that you have "many" values equal to the pivot then ONLY a second pass just to process the values that are equal to the pivot. The decision to do special processing for "equal elements" is a tradeoff: if equalCount is low and the number of elements remaining to be sorted is big then don't do it, but if the ratio of equalCount over the number of elements to be sorted is big then do it. – Acetyl 17/9, 2012 at 13:16

I think #2 is the old tricks for the old platform and compiles, as my test, using a simple insertion sort for small set(usually < 32 )is enough, and should be better than hand-coding (usually < 8 ) on overall performance. – Fieldwork 22/12, 2013 at 5:36

Tail recursion is to change a recursive call into a loop. For QuickSort, it would be somthing like:

QuickSort(SortVar)                                                                     
   Granularity = 10                                                            
   SortMax = Max(SortVar)
   /* Put an element after the last with a higher key than all other elements 
      to avoid that the inner loop goes on forever */
   SetMaxKey(SortVar, SortMax+1)

   /* Push the whole interval to sort on stack */               
   Push 1 SortMax                                                              
   while StackSize() > 0                                                       
      /* Pop an interval to sort from stack */
      Pop SortFrom SortTo                                                     

      /* Tail recursion loop */                           
      while SortTo - SortFrom >= Granularity                                

         /* Find the pivot element using median of 3 */                            
         Pivot = Median(SortVar, SortFrom, (SortFrom + SortTo) / 2, SortTo)             
         /* Put the pivot element in front */                                     
         if Pivot > SortFrom then Swap(SortVar, SortFrom, Pivot)

         /* Place elements <=Key to the left and elements >Key to the right */           
         Key = GetKey(SortVar, SortFrom)                                                
         i = SortFrom + 1                                                      
         j = SortTo                                                            
         while i < j                                                        
            while GetKey(SortVar, i) <= Key; i = i + 1; end                          
            while GetKey(SortVar, j) > Key; j = j - 1; end                           
            if i < j then Swap(SortVar, i, j)                                       
         end                                                                   

         /* Put the pivot element back */                            
         if GetKey(SortVar, j) < Key then Swap(SortVar, SortFrom, j)                                         

         if j - SortFrom < SortTo - j then                                  
            /* The left part is smallest - put it on stack */                     
            if j - SortFrom > Granularity then Push SortFrom j-1               
            /* and do tail recursion on the right part */                           
            SortFrom = j + 1                                                   
         end                                                                   
         else
            /* The right part is smallest - put it on stack */                       
            if SortTo - j > Granularity then Push j+1 SortTo                   
            /* and do tail recursion on the left part */                         
            SortTo = j - 1                                                     
         end                                                                   
      end                                                                      
   end                                                                         

   /* Run insertionsort on the whole array to sort the small intervals */    
   InsertionSort(SortVar)                                                          
return

Additionally there is no reason to call InsertionSort on the small intervals, because when QuickSort is finished the array is roughly sorted, such that there are only small intervals left to sort. And this is just the perfect case for InsertionSort.

If you don't have a stack, you can use recursion instead - but keep the tail recursion:

QuickSort(SortVar, SortFrom, SortTo)                                                                     
   Granularity = 10                                                            

   /* Tail recursion loop */                           
   while SortTo - SortFrom >= Granularity                                

      /* Find the pivot element using median of 3 */                            
      Pivot = Median(SortVar, SortFrom, (SortFrom + SortTo) / 2, SortTo)             
      /* Put the pivot element in front */                                     
      if Pivot > SortFrom then Swap(SortVar, SortFrom, Pivot)

      /* Place elements <=Key to the left and elements >Key to the right */           
      Key = GetKey(SortVar, SortFrom)                                                
      i = SortFrom + 1                                                      
      j = SortTo                                                            
      while i < j                                                        
         while GetKey(SortVar, i) <= Key; i = i + 1; end                          
         while GetKey(SortVar, j) > Key; j = j - 1; end                           
         if i < j then Swap(SortVar, i, j)                                       
      end                                                                   

      /* Put the pivot element back */                            
      if GetKey(j) < Key then Swap(SortVar, SortFrom, j)                                         

      if j - SortFrom < SortTo - j then                                  
         /* The left part is smallest - recursive call */                     
         if j - SortFrom > Granularity then QuickSort(SortVar, SortFrom, j-1)           
         /* and do tail recursion on the right part */                           
         SortFrom = j + 1                                                   
      end                                                                   
      else
         /* The right part is smallest - recursive call */                       
         if SortTo - j > Granularity then QuickSort(SortVar, j+1, SortTo)                   
         /* and do tail recursion on the left part */                         
         SortTo = j - 1                                                     
      end                                                                   
   end                                                                         

   /* Run insertionsort on the whole array to sort the small intervals */    
   InsertionSort(SortVar)                                                          
return

Futile answered 5/3, 2014 at 13:40 Comment(0)

You can take a look at TimSort, which for non completely random data performs better than quicksort (they have the same asymptotic complexity but TimSort has lower constants)

Confectionary answered 17/9, 2012 at 8:52 Comment(0)

I've recently have found this optimization. It works faster than std::sort. It uses insertion sort on small arrays and median-of-3 as partitioning element.

This is my C++ implementation:

const int CUTOFF = 8;

template<typename T>
bool less (T &v, T &w)
{
    return (v < w);
}

template<typename T>
bool eq (T &v, T &w)
{
    return w == v;
}

template <typename T>
void swap (T *a, T *b)
{
    T t = *a;
    *a = *b;
    *b = t;
}

template<typename T>
void insertionSort (vector<T>& input, int lo, int hi) 
{
    for (int i = lo; i <= hi; ++i)
    {
        for (int j = i; j > lo && less(input[j], input[j-1]); --j)
        {
            swap(&input[j], &input[j-1]);
        }
    }
}


template<typename T>
int median3 (vector<T>& input, int indI, int indJ, int indK)
{
    return (less(input[indI], input[indJ]) ?
            (less(input[indJ], input[indK]) ? indJ : less(input[indI], input[indK]) ? indK : indI) :
            (less(input[indK], input[indJ]) ? indJ : less(input[indK], input[indI]) ? indK : indI));
}


template <typename T>
void sort(vector<T>& input, int lo, int hi) 
{ 
    int lenN = hi - lo + 1;

    // cutoff to insertion sort
    if (lenN <= CUTOFF) 
    {
        insertionSort(input, lo, hi);
        return;
    }

    // use median-of-3 as partitioning element
    else if (lenN <= 40) 
    {
        int median = median3(input, lo, lo + lenN / 2, hi);
        swap(&input[median], &input[lo]);
    }

    // use Tukey ninther as partitioning element
    else  
    {
        int eps = lenN / 8;
        int mid = lo + lenN / 2;
        int mFirst = median3(input, lo, lo + eps, lo + eps + eps);
        int mMid = median3(input, mid - eps, mid, mid + eps);
        int mLast = median3(input, hi - eps - eps, hi - eps, hi); 
        int ninther = median3(input, mFirst, mMid, mLast);
        swap(&input[ninther], &input[lo]);
    }

    // Bentley-McIlroy 3-way partitioning
    int iterI = lo, iterJ = hi + 1;
    int iterP = lo, iterQ = hi + 1;

    for (;; ) 
    {
        T v = input[lo];
        while (less(input[++iterI], v))
        {
            if (iterI == hi) 
                break;
        }
        while (less(v, input[--iterJ]))
        {
            if (iterJ == lo)    
                break;
        }
        if (iterI >= iterJ) 
            break;
        swap(&input[iterI], &input[iterJ]);
        if (eq(input[iterI], v)) 
            swap(&input[++iterP], &input[iterI]);
        if (eq(input[iterJ], v)) 
            swap(&input[--iterQ], &input[iterJ]);
    }
    swap(&input[lo], &input[iterJ]);

    iterI = iterJ + 1;
    iterJ = iterJ - 1;
    for (int k = lo + 1; k <= iterP; ++k) 
    {
        swap(&input[k], &input[iterJ--]);
    }
    for (int k = hi  ; k >= iterQ; --k)
    {
        swap(&input[k], &input[iterI++]);
    }

    sort(input, lo, iterJ);
    sort(input, iterI, hi);
}

Involution answered 16/10, 2013 at 6:41 Comment(2)

"It works faster than std::sort": this is a bold claim, and should be backed up by a concrete benchmark (specific input and timings)! The implementors of the Standard Library generally know what they are doing. The typical implementation of std::sort also delegates to selection or insertion sort for small subsequences, and also uses smart selection of partitioning elements. – Lehman 26/12, 2013 at 14:0

Nice optimization! I replaced ´std::sort()´ with your version, and it ran 21% faster with full optimizations enabled (Compiler is Visual Studio C++ 2013). Thanks! =) – Aguila 16/5, 2016 at 18:49

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags