An efficient sorting algorithm for almost sorted list containing time data?
Asked Answered
S

6

10

The name says it all really. I suspect that insertion sort is best, since it's the best sort for mostly-sorted data in general. However, since I know more about the data there is a chance there are other sorts woth looking at. So the other relevant pieces of information are:

1) this is time data, which means I presumable could create an effective hash for ordering of data. 2) The data won't all exist at one time. instead I'll be reading in records which may contain a single vector, or dozen or hundreds of vectors. I want to output all time within a 5 second window. So it's possible that a sort that does the sorting as I insert the data would be a better option. 3) memory is not a big issue, but CPU speed is as this may be a bottleneck of the system.

Given these conditions can anyone suggest an algorithm that may be worth considering in addition to insertion sort? Also, How does one defined 'mostly sorted' to decide what is a good sort option? What I mean by that is how do I look at my data and decided 'this isn't as sorted as I thought it as, maybe insertion sort is no longer the best option'? Any link to an article which considered process complexity which better defines the complexity relative to the degree data is sorted would be appreciated.

Thanks

Edit: thank you everyone for your information. I will be going with an easy insertion or merge sort (whichever I have already pre-written) for now. However, I'll be trying some of the other methods once were closer to the optimization phase (since they take more effort to implement). I appreciate the help

Sparkle answered 13/6, 2012 at 14:5 Comment(4)
I suppose that you're looking for a sorting algorithm?Fang
Like you said....insertion sort. sorting-algorithms.com/nearly-sorted-initial-orderSenter
What are the range and granularity of your time data?Thorman
range and grandulaity vary. Were have to read in from multuple soruces and the range, gradularity, and even level of 'sortedness' can vary depending on the source.Sparkle
G
4

You could adopt option (2) you suggested - sort the data while you insert elements.

Use a skip list, sorted according to time, ascending to maintain your data.

  • Once a new entree arrives - check if it is larger then the last element (easy and quick) if it is - simply append it (easy to do in a skip list). The skip list will need to add 2 nodes on average for these cases, and will be O(1) on average for these cases.
  • If the element is not larger then the last element - add it to the skip list as a standard insert op, which will be O(logn).

This approach will yield you O(n+klogn) algorithm, where k is the number of elements inserted out of order.

Gaynor answered 13/6, 2012 at 14:17 Comment(3)
You could also do this with a balanced BST as long as you track the maximum element. I think that the BST approach would likely be better from a memory perspective, especially if you used something like a splay tree or scapegoat tree with exactly two pointers per node.Totemism
@templatetypedef: Though I believe it can be done - I find the skip list much more intuitive then a BST. If the BST is not self balanced -it is likely to decay into a tree with big height for the described input, and searching for elements that came unordered will be expansive. On the other hand, re-balancing the tree after you add a new maximum is less intuitive then appending an element to a skip list, in my opinion at least.Gaynor
@Gaynor Instead of using a data structure to sort the out-of-place items alongside of the sorted items, you can sort them separately and then merge them in later. See my answer for more details. The result is an O(n + k lg k) algorithm.Thievish
O
2

I would throw in merge sort if you implement the natural version you get a best case of O(N) with a typical and worst case of O(N log N) if you have any problems. Insertion you get a worst case of O(N^2) and a best case of O(N).

Oral answered 13/6, 2012 at 14:12 Comment(0)
T
2

You can sort a list of size n with k elements out of place in O(n + k lg k) time.

See: http://www.quora.com/How-can-I-quickly-sort-an-array-of-elements-that-is-already-sorted-except-for-a-small-number-of-elements-say-up-to-1-4-of-the-total-whose-positions-are-known/answer/Mark-Gordon-6?share=1

The basic idea is this:

  • Iterate over the elements of the array, building an increasing subsequence (if the current element is greater than or equal to the last element of the subsequence, append it to the end of the subsequence. Otherwise, discard both the current element and the last element of the subsequence). This takes O(n) time.
  • You will have discarded no more than 2k elements since k elements are out of place.
  • Sort the 2k elements that were discarded using an O(k lg k) sorting algorithm like merge sort or heapsort.
  • You now have two sorted lists. Merge the lists in O(n) time like you would in the merge step of merge sort.

Overall time complexity = O(n + k lg k)

Overall space complexity = O(n)

(this can be modified to run in O(1) space if you can merge in O(1) space, but it's by no means trivial)

Thievish answered 31/10, 2014 at 0:27 Comment(0)
T
1

Without fully understanding the problem, Timsort may fit the bill as you're alleging that your data is mostly sorted already.

Thorman answered 13/6, 2012 at 22:1 Comment(0)
T
0

There are many adaptive sorting algorithms out there that are specifically designed to sort mostly-sorted data. Ignoring the fact that you're storing dates, you might want to look at smoothsort or Cartesian tree sort as algorithms that can sort data that is reasonable sorted in worst-case O(n log n) time and best-case O(n) time. Smoothsort also has the advantage of requiring only O(1) space, like insertion sort.

Using the fact that everything is a date and therefore can be converted into an integer, you might want to look at binary quicksort (MSD radix sort) using a median-of-three pivot selection. This algorithm has best-case O(n log n) performance, but has a very low constant factor that makes it pretty competitive. Its worst case is O(n log U), where U is the number of bits in each date (probably 64), which isn't too bad.

Hope this helps!

Totemism answered 13/6, 2012 at 16:43 Comment(0)
S
0

If your OS or C library provides a mergesort function, it is very likely that it already handles the case where the data given is partially ordered (in any direction) running in O(N) time.

Otherwise, you can just copy the mergesort available from your favorite BSD operating system.

Stefaniestefano answered 13/6, 2012 at 16:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.