Faster than binary search for ordered list
Asked Answered
B

13

38

is there an algorithm that is faster than binary search, for searching in sorted values of array?

in my case, I have a sorted values (could be any type values) in an A array, I need to return n if the value I was looking is in range of A[n] and A[n+1]

Basion answered 30/10, 2010 at 4:33 Comment(3)
If you have a quantum computer you can try en.wikipedia.org/wiki/Grover%27s_algorithm :)Gravy
@David: The list is sorted though, so Grover's algorithm is worse than bisection search. O(sqrt N) > O(lg N)Toole
a state machine worked an order of magnitude for me on large data, but the complexity/memory for building states is much larger than sorting.Frore
W
47

You can do better than O(log n) if the values are integers, in which case the best worst-case running time you can achieve, in terms of n, is O(sqrt(log n)). Otherwise, there is no way to beat O(log n) unless there are patterns in the input sequence. There are two approaches used to beat O(log n) in the case of integers.

First, you can use y-fast trees which work by storing in a hash table all prefixes for which you are storing at least one integer with that prefix. This enables you to perform a binary search to find the length of the longest matching prefix. This enables you to find the successor of an element for which you are searching in time O(log w) where w is the number of bits in a word. There are some details to work though to make this work and use only linear space, but they aren't too bad (see the link below).

Second, you can use fusion trees, which use bit tricks to enable you to perform w^O(1) comparisons in just a constant number of instructions, yielding a running time of O(log n / log w).

The optimum tradeoff between these two data structures occurs when log w = sqrt(log n), giving a running time of O(sqrt(log n)).

For details on the above, see lectures 12 and 13 of Erik Demaine's course: http://courses.csail.mit.edu/6.851/spring07/lec.html

Witkin answered 30/10, 2010 at 5:23 Comment(2)
I'd like to know more about fusion trees. Maybe you'd be willing to explaing them: https://mcmap.net/q/410731/-understanding-fusion-treesHarlequinade
@xcott Im not sure you're not over-optimizing unless you are writing a professional numerics library.Wallet
A
8

What about the following algo? it is called Exponential Search and is one of the variations of binary search. http://en.m.wikipedia.org/wiki/Exponential_search

Searching for element k in sorted array A of size n. Lookup A[2^i] for i=0, 1, 2,... until you go beyond k's position in A. then do a binary search on the part of the array left (smaller) than i.

int exponential_search(int A[], int key)
{
  // lower and upper bound for binary search
  int lower_bound = 0;
  int upper_bound = 1;

  // calculate lower and upper bound
  while (A[upper_bound] < key) {
    lower_bound = upper_bound;
   upper_bound = upper_bound * 2;
  }
  return binary_search(A, key, lower_bound, upper_bound);
}

This algo will run on O(log idx) where idx is the index of k in A. (both stpes are in log idx). In the worst case, the algo is in O(log idx), if k is amongst the largest elements of A or bigger than any element of A. The multiplicative constant is larger than for binary search but the algo would run faster for very large arrays and when looking for data that's towards the beginning of the array.

I'D like to have some idea of the minimal size n where this algo becomes preferable to binary search, but I don't know.

Alysa answered 4/9, 2013 at 14:56 Comment(2)
Note that the multiplication here can be replaced with a simple binary shift; it's really cheap.Terri
The compiler is likely to make that optimization for you.Lemon
H
6

One possibility is to treat it like finding the roots of a function. Basically, finding:

a[i] <= i <= a[i + 1]

Is equivalent to:

a[i] - i <= 0 <= a[i + 1] - i

Then you could try something like Newton's method and so on. These kinds of algorithms frequently converge faster than a binary search when they work, but I don't know of one that is guaranteed to converge for all input.

http://en.wikipedia.org/wiki/Root-finding_algorithm

Harlequinade answered 30/10, 2010 at 4:53 Comment(5)
Newton's method requires a differentiable function, so one would have to fit an interpolating spline first. If the values are uni-modal its quite well behaved else it might diverge and act totally bizarre.Soneson
Yes. You can use a linear spline, and the derivative at any point is: f'(i) = a[i+1] - a[i]Harlequinade
Linear splines are piecewise linear, so its derivative wont be continuous. One has to go for quadratic atleast. Which is no biggie. This will turn out to be similar to [en.wikipedia.org/wiki/Interpolation_search]Soneson
I don't believe the derivative needs to be continuous in Newton's method. Thanks for the link on interpolation search.Harlequinade
Just to calrify by "this" I meant your suggestion of using linear interpolation.Soneson
T
5

Yes and no. Yes there are searches that are faster, on average, than a bisection search. But I believe that they are still O(lg N), just with a lower constant.

You want to minimize the time taken to find your element. Generally it is desirable to use fewer steps, and one way to approach this is to maximize the expected number of elements that will be eliminated at each step. With bisection, always exactly half the elements are eliminated. You can do better than this, IF you know something about the distribution of the elements. But, the algorithm for choosing the partition element is generally more complicated than choosing the midpoint, and this extra complexity may overwhelm any time savings you expected to get from using fewer steps.

Really, in a problem like this it's better to attack second-order effects like cache locality, than the search algorithm. For example, when doing a repeated binary search, the same few elements (first, second, and third quartiles) are used VERY frequently, so putting them in a single cache line could be far superior to random access into the list.

Dividing each level into say 4 or 8 equal sections (instead of 2) and doing a linear search through those could also be quicker than the bisection search, because a linear search doesn't require calculating the partition and also has fewer data dependencies that can cause cache stalls.

But all of these are still O(lg N).

Toole answered 30/10, 2010 at 4:46 Comment(3)
On a single ordered list, no. But there are much faster searches; you just need a different data structure than an ordered list. A hash would be virtually constant in lookup time, at a cost of a great deal more memory. A hybrid approach could take the approach of a dictionary.Theriot
@tchrist: The problem requires finding the pair of elements that tightly bound a sought entry that is not in the list at all. Hashing only finds exact matches.Toole
Whoops, you're right. Somehow I only read the first sentence, not the second one.Theriot
F
5

If the values in the list are evenly distributed then you could try a weighted split instead of a binary split, e.g. if the desired value is a third of the way from the current lower limit to the current value then you could try the element that is also a third of the way. This could suffer badly on lists where values are bunched up though.

Farce answered 30/10, 2010 at 4:46 Comment(2)
Some more optimization is necessary. You don't want to choose the element closest to where you guess the answer to be, you want to test a point between the guessed location and the center of list, so that with p > .5 you eliminate more than half the list. The exact optimal partition point depends on the distribution of values in the list.Toole
What you describe is exactly interpolation search. @Ben an efficient way to implement your suggestion is through a Huffman treeSoneson
O
2

First of all, measure before doing optimization.

Do you really need to optimize that search?

If so, then secondly, think about algorithmic complexity first. E.g. can you use a tree (like a std::map, say) instead of an array? If so then it depends on the relative frequency of insertions/deletions versus searches, but the premise of having a sorted array at hand indicates that searches are frequent compared to data set changes, so that it would make sense to do some little additional work for insertions/deletions, making each search much faster -- namely logarithmic time.

If you find that indeed the search times are a bottleneck that needs addressing, and no, no change of data representation is possible, and the list is short, then a linear search will generally be faster because it does less work per comparision.

Otherwise, if the list is longer, and no particular distribution of values is known or assumed, and the values can't be treated as numerical, and memory consumption should be constant (ruling out constructing a hash table, say), then binary search produces 1 bit of information per comparision and is probably the best you can do for the first search.

Cheers & hth.

Offal answered 30/10, 2010 at 6:4 Comment(0)
S
1

You can always put them in a hash table, then search will be O(1). It will be memory intensive though and if you keep adding items, the hash table might need to be re-bucketed. Re-bucketing is O(n) but it will get amortized to O(1). It essentially depends on whether you can afford that space and the potential cache misses.

Soneson answered 30/10, 2010 at 4:46 Comment(2)
It's possible that his array does not contain the value n, but does contain two values that bracket n. It's not obvious that hashing is applicable here.Harlequinade
Oh I missed that. But you could still hash first and fall back to binary search if the value is not in the key set. But this is an added complexity. In general you cannot do better than the entropy of the distribution of the values. If you knew the distribution you can use a Huffman tree to decide where you partition.Soneson
P
1

Although in the general case you cannot do better than O(log N), you can at least optimize that, thus significantly reducing the constant of proportionality in front of O(log N).

If you have to perform multiple search on the same array, these can be vectorized using SIMD extensions, thus further cutting down on computation cost.

In particular, if you are dealing with arrays of floating point numbers which satisfy certain properties, than there are ways to construct a special index which then allows to search the array in O(1).

All of the above aspects are discussed with test results in: Cannizzo, 2015, Fast and Vectorizable Alternative to Binary Search in O(1) Applicable to a Wide Domain of Sorted Arrays of Floating Point Numbers The paper comes with source code on github.

Pi answered 20/5, 2017 at 2:24 Comment(0)
S
1

It's been mentioned in misc comments, but I would think a natural & simple answer to this particular question ("any-type, values in an array") would be an Interpolation Search:

Instead of calculating the midpoint, interpolation search estimates the position of the target value, taking into account the lowest and highest elements in the array as well as length of the array. It works on the basis that the midpoint is not the best guess in many cases. For example, if the target value is close to the highest element in the array, it is likely to be located near the end of the array.

Quote from: https://en.wikipedia.org/wiki/Binary_search_algorithm

Main page: https://en.wikipedia.org/wiki/Interpolation_search

Under the assumption of a uniform distribution it can approach O(log log N)

Since CPUs are so fast compared to memory access these days (program for RAM like you once did for disk) the index / comparison calculations are likely cheap compared to each data fetch. It might also be possible to eke out a little more performance with a linear search once the search is sufficiently narrowed (exploiting memory / cache locality).

Streeter answered 2/11, 2020 at 12:23 Comment(0)
D
0

In binary search you split the list into two "sublists" and you only search the sublist that may contain the value. Depending on how large your array is, you could see a speedup if you split the array into more than two splices.

You can determine which region of the array you have to search, by keeping an index, that you search first. Like in a telephone book of a large city, where you can see from the outside, where you have to start to search. (I have trouble expressing my idea in text, and I did not find an english link yet that explains it better).

Douty answered 30/10, 2010 at 4:57 Comment(0)
G
0

If you have a huge amount of numbers to find, and by some fluke they are ALSO sorted, you could do it in O(n + m) where m is the number of numbers to find. Basically just your typical merge algorithm, with slight modification to record which value each checked number would be inserted before, if it was to be inserted into the array.

You can always trade off space... And time of other operations. Assuming all your elements are constant size p bits, you can make a massive array which stores, for each possible value you could look up, the index of the next bigger value currently stored. This array needs to be 2^p*lg(n) bits, where n is the number values stored. Each insertion or deletion is O(2^p) but typically around 2^p/n, because you have to go through updating all those indices.

But your lookup is now O(1)!

OK, OK, it's not really practical. But dividing the input into blocks in a similar fashion could possibly reduce the constant in front of your log. Possibly.

Gigantean answered 30/10, 2010 at 13:33 Comment(0)
A
0

As someone mentioned, you could try an interpolation search. But usually interpolation searches are pretty simple/dumb, with a simple linear approach (which works good if you have an even distribution of values in array A, but very poorly if the distribution is heavily skewed in some way).

The idea is to think of array A as a mathematical function (because it's sorted, a one-to-one function), and then approximate it. Say you have an array A with 100 values, where A[x]=2*x. Now you want to insert 9 into your array, and replace whatever value is closest to it.

With a binary search, you are going to hit A[50]=100, then A[25]=50, then A[12]=24, than A[6]=12, then A[3]=6, then A[4]=8, then finally A[5]=10. Set A[5]=9, and we are done.

With a linear interpolation search, taking the first and last values A[0]=0 and A[99]=198, you can calculate a linear function between the two f(x)=2*x. The inverse would be g(x)=x/2. So plug in 9, g[9]=4.5, A[5]=10 which is more than 9, check the previous value A[4]=8, and you are done. That's only 2 lookups and compares, vs 7 for binary search. With a really large array, you can see that this could significantly cut down on your lookups and compares.

Now, in the real world, you generally aren't going to have an array with a simple linear range of values like that. They are going to be skewed to one side or the other, and you are going to have to do the interpolation search multiple times recursively, or switch to a linear or binary search after the first or second interpolation, or something like that.

If you know your values in Array A are heavily skewed (for instance, you have an array A of 100 values, where the first 90 values are 1 and the last 10 values are the range 1 to 10), then you know interpolation is probably the wrong way to go. Binary search is going to get you there in about the same time, or faster.

You could get fancier and try to build some other array B which approximates the inverse function, and then seek into that, or even do some statistical analysis to create some mathematical function that approximates the inverse, but that's beyond the scope of this answer.

Applaud answered 13/10, 2021 at 19:40 Comment(0)
C
0

Yes, you can do better (www.agdresearch.com), in fact up to 8 times faster 8 * O(log(n). The trick is to split the keys/entries into significant ascii characters interspaced with junk-dna. The same trick works for quick-sort and BTree.

The algorithm is called NoChop (it doesn't need to chop into binary or higher partitions - the above approach provides up to 256 branches per node). The data-structure that enables the algorithm is called STree (sparse-tree after the central sparse-matrix that holds the keys).

Conferva answered 5/4, 2022 at 9:20 Comment(1)
Thanks for your contributions. Few things to note (1) Stackoverflow is often referred by many users, so do include all referred documents/pages in your answers (2) Prefer not to share your personal information in answers(email). You don't to attract spam and other right? Thank you once again & Good day to you :)Caldron

© 2022 - 2024 — McMap. All rights reserved.