Partially sorting an array C
Asked Answered
W

4

6

I have an array which looks like this:

int array[] = {4.53, 3.65, 7.43, 9.54, 0.72, 0.0}

I am just wondering what method I can use to partially sort this array to bring the top three biggest doubles to the front. I am looking for the most efficient method to get the top three highest numbers in this array.

So far I have been using qsort, but I am just looking for another method to do this which could be even faster. I know that qsort is O(nlogn) for best cases and O(n^2) for worst cases, but is there an even more efficient method to achieve this problem? What I mean by efficient is just a faster way to do it, better than O(nlogn).

Any help would be great

Whipcord answered 15/10, 2016 at 17:15 Comment(10)
How would you find just the smallest number? Do that with three variables.Esculent
Would that be more efficient than qsort?Whipcord
@Whipcord finding the most extreme value in an array is O(n) and finding the k most extreme values in an array is O(kn) and assuming k<n it is indeed faster.Wire
The method is usually called quick-select. en.wikipedia.org/wiki/QuickselectOminous
Quickselect gives an average of O(n) regardless of k. However, in this case, because k is 3 Nikhil's method may be quickest.Madra
@deamentiaemundi: It can be done more quickly than that actually. Creating a heap from an array can be done in a single O(n) step, and popping off k values from it costs O(log n) per item. So you can make the cost O(max(n, k log n)) (which means if k == n, you just paid roughly the expected O(n log n) cost of a full sort, where doing k linear scans would scale to O(n**2)). I assume stuff like std::partial_sort uses an algorithm of this sort to avoid poor scaling when k is an appreciable percentage of n.Beep
On second thought, @EdwardJezisek suggestion to quickselect for k, then sort only the values on one side of the resulting partition, which achieves O(max(n, k log k)) performance, is probably what partial_sort really uses. Oops.Beep
Thanks for the suggestions? Could these methods be done with sorting an array of structs?Whipcord
Quick select can. I assume that partial sort would take in a comparator which would do what you need. But, if you only need the top 3 elements it will likely be quicker to use the method proposed by Nikhil as the big O constant will be lower (3) versus I think about 11 or so for quickselect(This is likely off, do not quote me) Per wikipedia: Finer computations of the average time complexity yield a worst case of 3.4n+o(n) which is slightly larger than 3n.Madra
Is partial_sort available in C?Madra
P
3

Simply maintain first, second, third.

   first =  array[0];
   second = array[1];
   third = array[2];

   /* scratch sort for three elements */
   if(first < second)
     swap(first, second);
  if(first < third)
     swap(first, third);
  if(second < third)
     swap(second, third);

  /* now go through, bubbling up if we have a hit */ 
  for(i=3;i<N;i++)
  {
      if(third < array[i])
      {
         third = array[i];
         if(second < third)
         {
            swap(second, third);
            if(first < second)
              swap(first, second);
         }
      }
  }     

I wouldn't try to scale up to k = four. I think three is about the limit for hardcoding it. As k get large you need to move to a formal method.

This doesn't answer the question you actually asked, which was how to partially sort, but it seems to be what you want.

If you wish to partially sort, you can use quicksort, and simply return early when the pivot goes above the bound you are interested it. So our first pivot divides into five, two. Ignore the last two, and only actually do the sub-sorts of the last five. But whilst it will be faster than quicksort, it won't be a game changer. If you can get a conservative upper bound on the k'th item (eg it's always going to be at most 25% between the minimum and the mean) you can quickly eliminate most of the data. If you get it wrong it's just another pass or two.

Using the quicksort method

  int sortfirstk_r(int *array, int N, int k)
  {
     int pivot = 0;
     int j = n -1;
     int i = 1;

     while(i <= j)
     {
        if(array[pivot] < array[i])
          swap(array[i], array[j--])
        else
          i++;

     }
     sortfirstk_r(array, i, k < i ? k : i);
     if(i < k)
       sortfirstk_r(array +i, N -i, k - i); 

  }

(Untested and there might be bugs in the slightly tricky sort logic).

However we've naively used the first element as the pivot. If we're sorting a large data set, and it has a normal distribution, and we want the top 1%, the z-score is 2.326. Take a bit more to allow us some sampling error, and we make a first pass with a pivot set at say 2.3 standard deviations above the mean. Then we split the distribution into two sets, the top 1% plus a bit, and the rest. We don't need to further process the rest, and just sort the top group.

Paez answered 15/10, 2016 at 19:4 Comment(0)
M
3

For your specific problem the quickest method is to do something similar to below since you only want three elements: (It may be quicker to use a priority queue or a different data structure, but the speed will not be very noticeable)

#include"stdio.h"
void moveThreeMaxToFront(double * arr, int length);
void moveMaxToFront(double*arr, int length);
int main() {
  int i;
  double meh[]={ 5,3,1,7,2,9,11};
  moveThreeMaxToFront(meh, 7);
  for(i=0; i<7; i++)
    printf("%f \n", meh[i]);
}
void moveThreeMaxToFront(double * arr, int length) {
  for(int i=0; i<3; i++)
    moveMaxToFront(arr++, length-i);
}
void moveMaxToFront(double* arr, int length) {
  int i;
  for(i=1; i<length; i++) {
    if(arr[i]>arr[0]) {
      double tmp=arr[i];
      arr[i]=arr[0];
      arr[0]=tmp;
    }
  }
}

However, it is potentially faster if k becomes substantially larger to either implement Quickselect or use the partial_sort method which I believe implements quickselect. However, the quickselect algorithm for the given case has an average constant of approximately 3.4-4.4 which is slightly larger than the constant above(3). Please also note that quickselect has an average run time of O(n). This run time can be guaranteed using median of 3, but this is not advised as it significantly increases the average constant. Intro-select properly handles this to prevent the worst case of quickselect while retaining its average case.

Madra answered 15/10, 2016 at 18:19 Comment(0)
P
3

Simply maintain first, second, third.

   first =  array[0];
   second = array[1];
   third = array[2];

   /* scratch sort for three elements */
   if(first < second)
     swap(first, second);
  if(first < third)
     swap(first, third);
  if(second < third)
     swap(second, third);

  /* now go through, bubbling up if we have a hit */ 
  for(i=3;i<N;i++)
  {
      if(third < array[i])
      {
         third = array[i];
         if(second < third)
         {
            swap(second, third);
            if(first < second)
              swap(first, second);
         }
      }
  }     

I wouldn't try to scale up to k = four. I think three is about the limit for hardcoding it. As k get large you need to move to a formal method.

This doesn't answer the question you actually asked, which was how to partially sort, but it seems to be what you want.

If you wish to partially sort, you can use quicksort, and simply return early when the pivot goes above the bound you are interested it. So our first pivot divides into five, two. Ignore the last two, and only actually do the sub-sorts of the last five. But whilst it will be faster than quicksort, it won't be a game changer. If you can get a conservative upper bound on the k'th item (eg it's always going to be at most 25% between the minimum and the mean) you can quickly eliminate most of the data. If you get it wrong it's just another pass or two.

Using the quicksort method

  int sortfirstk_r(int *array, int N, int k)
  {
     int pivot = 0;
     int j = n -1;
     int i = 1;

     while(i <= j)
     {
        if(array[pivot] < array[i])
          swap(array[i], array[j--])
        else
          i++;

     }
     sortfirstk_r(array, i, k < i ? k : i);
     if(i < k)
       sortfirstk_r(array +i, N -i, k - i); 

  }

(Untested and there might be bugs in the slightly tricky sort logic).

However we've naively used the first element as the pivot. If we're sorting a large data set, and it has a normal distribution, and we want the top 1%, the z-score is 2.326. Take a bit more to allow us some sampling error, and we make a first pass with a pivot set at say 2.3 standard deviations above the mean. Then we split the distribution into two sets, the top 1% plus a bit, and the rest. We don't need to further process the rest, and just sort the top group.

Paez answered 15/10, 2016 at 19:4 Comment(0)
H
0

If we are supposed to find out the three largest number then we can run findMax method three times and once a maximum is found replace appropriate index (1, 2 or 3) with maximum in array. This way we leave you with array will 3 largest elements at start of array in c * O(n) time-complexity.

Note: I used fact that you have to find first three maximum doubles

double findMax(double arr[i], double prevMax){
    double maximum = -100000000000;
    for(int i = 0; i < arr.length; i++){
        if(arr[i] < prevMax)
        maximum = max(arr[i], maximum);
    }
    return maximum;
 }
Hamish answered 15/10, 2016 at 17:23 Comment(10)
With an array such as {-10, -9, -8, -7}, replacing with -1 would not produce the correct result. We don't know if we can assume that numbers are all positive.Arnold
There is no Math.max in C. Also: use isgreater() or one of the other functions offered in >=C99 for comparing floating point values.Wire
@NikhilPathania max() does not work for all floating point values, its name clashes with the variable max in findMax and float.h has the necessary macros to find the smallest value a double can represent, please use it.Wire
Why is double[] the returned type if you are returning just a number?Adamantine
Some errors might be there, just a basic idea of implementationHamish
Your fix is "-100000000000 should be enough", but there are other ways to solve the problem correctly for all cases.Arnold
@NikhilPathania Is it so much work to include float.h and set the start-value of maximum to -DBL_MAX? Also mmax() is still wrong. You can use the function isgreater() (assumes >=c99) for comparing two floats or check for NaN and the other exceptions manually.Wire
I dont know much about C, i generally code in java, you can provide me with edits @WireHamish
@NikhilPathania if you don't know how to code in C than why do put C-code in your answer?Wire
Answer was to suggest an idea @WireHamish
S
0

I would suggest radix sort it is the most efficient sorting method for such cases and has complexity O(n). You could even change it alittle bit to stop when find three max numbers. You can find-understand radix short: https://www.cs.usfca.edu/~galles/visualization/RadixSort.html

Suffolk answered 15/10, 2016 at 17:35 Comment(4)
There is one problem that in radix-sort your space-complexity comes into play and therefore it will work for values less than 10^7Hamish
Why 10^7 ,you say that we can't have arrays bigger than 10^7??Suffolk
Maximum value of any number in array can be 10^7 to do radix sortHamish
I may agree with the a0^7 for any number though I've never read that anywhere, but then your previous posts don't make sense because n is the length of array and not of any element.Suffolk

© 2022 - 2024 — McMap. All rights reserved.