Is Java 7 using Tim Sort for the Method Arrays.Sort?
Asked Answered
T

3

55

I can't find the documentation of Java 7, I can only find about the Java 6, which is still quick or merge. Does anyone know how to find the documentation of the method Arrays.sort in Java 7?

Thatch answered 25/10, 2010 at 19:57 Comment(2)
they are better then Quick and MergeThatch
Possible duplicate of What is the sorting algorithm for JavaWhitewing
L
90

Java 7 uses Dual-Pivot Quicksort for primitives and TimSort for objects.

According to the Java 7 API doc for primitives:

Implementation note: The sorting algorithm is a Dual-Pivot Quicksort by Vladimir Yaroslavskiy, Jon Bentley, and Joshua Bloch. This algorithm offers O(n log(n)) performance on many data sets that cause other quicksorts to degrade to quadratic performance, and is typically faster than traditional (one-pivot) Quicksort implementations.

According to the Java 7 API doc for objects:

The implementation was adapted from Tim Peters's list sort for Python ( TimSort). It uses techiques from Peter McIlroy's "Optimistic Sorting and Information Theoretic Complexity", in Proceedings of the Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, pp 467-474, January 1993.

Timsort is a hybrid "merge sort and insertion sort."

Not sure if this is much different from what it was in Java 6, for Arrays.sort JDK6:

a tuned quicksort, adapted from Jon L. Bentley and M. Douglas McIlroy's "Engineering a Sort Function", Software-Practice and Experience, Vol. 23(11) P. 1249-1265 (November 1993)

For Object[] or collections (Collections.sort()) merge sort is used.

Leighleigha answered 25/10, 2010 at 20:1 Comment(11)
ah great. Thanks. For primitives, it use dual pivot quick sort, and for object, it uses Tim SortThatch
Any idea why the distinction is made between timsort for objects and dual pivot quicksort for primitives? Memory considerations maybe?Forcier
@mR_fr0g: see my answer in this thread: #3707690Leighleigha
Would also like to mention that timsort was also designed for lists in which comparisons were expensive. This is another great reason to use it for objects.Poachy
Bentley and McIlroy (1993) describes a highly optimized classic (single-pivot) quicksort. The dual-pivot version is new; it performs a strictly lower expected number of comparisons. IIRC it splits the input in three and performs three recursive calls each time.Pavla
Java 6 uses merge sort.Contraindicate
@michael-borgwardt: What about Java 8?Spree
@Przemek: according to the documentation, it's the sameLeighleigha
Side note: Guava claims to have a better performance than Timsort used in Collections#sort. From com.google.common.collect.Ordering#sortedCopy, "According to our benchmarking on Open JDK 7, immutableSortedCopy(java.lang.Iterable<E>) generally performs better (in both time and space) than this method, and this method in turn generally performs better than copying the list and calling Collections.sort(List)."Boater
Confusingly, the class that implements the primitive sort is called DualPivotQuicksort - but it actually implements at least 4 different sorting algorithms (only one of which is quicksort) and uses heuristics to choose between them. For example, byte arrays are never sorted using quicksort, and "structured" arrays (which already have a large degree of sortedness) are sorted using a mergesort (which has a lot in common with timsort).Armor
Timsort is apparently quite different: reddit.com/r/programming/comments/9jj5z/…Whitewing
A
29

Yes! ... and also no.

Summary

At the time of writing (2016), in current Open JDK 0 implementations Tim Sort is generally used for sorting arrays of objects (i.e., Arrays.sort(Object[]) and friends) - but for primitive arrays (the remainder of the Arrays.sort methods) a variety of other methods are used.

For primitives, the heuristics choose among a variety of sorting methods such as quicksort, merge sort, counting sort3. depending on the data being sorted. Most of these decisions are simply made up-front based on the type and size of the array being sorted, but for int and long elements the decision is actually adaptive based on the measured sortedness of the array. So you have adaptation/introspection (the heuristics to pick an algorithm) on top of adaptation/introspection (TimSort or similar merge sort) in many cases!

Details

Tim Sort is used for most sorts of objects, such as Arrays.sort(Object[] a), unless the user has specifically requested the legacy behavior by setting system property java.util.Arrays.useLegacyMergeSort to true.

For primitives, the situation is more complex. At least as of JDK 8 (version 1.8.0_111) a variety of heurstics are used depending on the size of the arrays being sorted, the primitive type and the measured "sortedness" of the array. Here's an overview:

  • For all primitive types other than bytes1, arrays of less than 47 elements are simply sorted using insertion sort (see DualPivotQuicksort.INSERTION_SORT_THRESHOLD). This threshold is also used when sorting sub-arrays that arise when merge or quicksort are used and the size of the subarray falls below the threshold. So some form of insertion sort will be used in all sorts, and for small arrays it is the only algorithm used.
  • For primitive types byte, short and char, a counting sort is used for largish arrays. This is a simple sort that takes O(n + range) time, where range is the total number of byte (256) or short/char (65536) values. The sort involves allocating an underlying array of range values, so it is only used when the number of elements to sort is a significant fraction of the total range. In particular, it is used for byte arrays greater than 29 elements (i.e. ~11% of the range), and short/char arrays greater than 3200 elements (~5% of the range).
  • For byte arrays, one of the two approaches above is always used.
  • For int and long arrays above the insertion sort threshold and for short/char arrays both above the insertion sort threshold and below the counting sort threshold, one of two algorithms may be used: dual pivot quicksort, or merge sort. Which one is used depends on a measure of the sortedness of the array: the input is divided into runs of ascending or descending elements. If the number of such runs is greater than 66, then the array is considered mostly unsorted and is sorted with dual-pivot quicksort. Otherwise, the array is considered mostly sorted, and mergesort is used (using the already enumerated runs as a starting point).

The idea of finding runs and then using mergesort to sort them is in fact very similar to TimSort, although there are some differences. So at least for some parameters, the JDK is using a run-aware mergesort, but for many other combinations of parameters it is using a different algorithm, and at least 5 distinct algorithms are used in total!

Rationale

The reasoning behind the different sort behavior of Object[] versus primitive is probably at least two-fold:

1) Sorts of Object[] are required to be stable: objects which sort equally will appear in the same order as the input. For primitive arrays, no such concept exists: primitives are fully defined by their value, so there is no distinction between a stable and an unstable sort. This allows primitive sorts to dispense with the need for stable algorithms in favor of speed.

2) Sorts of Object[] need to involve the Object.compare() method, which may be arbitrarily complex and expensive. Even if the compare() method is simple, there will generally be method call overhead unless the entire sort method can be inlined2. So sorts of Object[] will generally be biased towards minimizing total comparisons, even at the cost of some additional algorithmic complexity.

Sorts of primitive arrays, on the other hand, just directly compare primitive values which typically takes on the order of a cycle or two. In this case, the algorithm should be optimized considering both the cost of comparisons and the surrounding algorithm, since the are likely to be of the same magnitude.


0 At least for versions between Java 7 and Java 9, and of course this also includes Oracle's JDK as it is based on Open JDK. It is likely that other implementations use a similar approach, but I haven't checked.

1 For byte arrays, the insertion sort threshold is effectively 29 elements since that's the lower cutoff above which counting sort is used.

2 This seems unlikely, since it is quite large.

3 Counting sort is only used for values with relatively limited range of 16-bits or less: byte, short or char.

Armor answered 13/12, 2016 at 19:39 Comment(3)
I don't see immediately where Timsort might end up being used for primitive arrays in the source...?Whitewing
@Whitewing - you are right, I either got this wrong or looked at a different version of the source compared to what I'm looking at now. In JDK8 at least, the primitive sort delegate to DualPivotQuicksort, which use a variety of strategies: usually plain merge sort for "structured" (i.e., long runs) data, or quicksort for unstructured data, and other strategies like counting sort for byte values. I will update the answer.Armor
@Whitewing - I updated the question. It seems like at some point (in the "details" section, I already understood that only a merge sort was used for primitives, not a true timsort, but that the algorithms were actually somewhat similar: both being "run aware merge sorts". Anyways, the intro should be more accurate now.Armor
C
22

Yes, Java 7 will use Timsort for Arrays.sort. Here is the commit: http://hg.openjdk.java.net/jdk7/jdk7/jdk/rev/bfd7abda8f79

Cant answered 25/10, 2010 at 20:51 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.