Binary Search: how to determine half of the array
Asked Answered
S

5

6

What's the difference between this two formulas

mid = low + (high - low) / 2;


mid = (high + low) / 2;
Sagacity answered 11/9, 2020 at 12:33 Comment(1)
You can accept one of the answers by clicking on the grey checkmark below its score.Contuse
C
12

In the 2nd version, if high + low is greater than the maximum value of an int (assuming high is an int) then it can overflow, invoking undefined behavior. This particular bug is solved with the 1st version.

There are still issues with the 1st version, e.g. if low is a very large negative number, the difference can still overflow.

From c++20, you should use std::midpoint for this, which handles a whole bunch of corner cases, and does the right thing for all of them.

This seemingly simple function is actually surprisingly difficult to implement, and in fact, there's an hour long talk given by Marshall Clow at cppcon 2019, that covers the implementation of just this function.

Coastwise answered 11/9, 2020 at 12:36 Comment(0)
S
2

The first one is superior (although still not perfect, see Binary Search: how to determine half of the array):

  1. It works in cases where addition is not defined for high and low but is defined for adding an interval to low. Pointers are one such example, an object of a date type can be another.

  2. high + low can overflow the type. For a signed integral type, the behaviour is undefined.

Syriac answered 11/9, 2020 at 12:35 Comment(0)
C
1

Both suffer from potential overflow. Signed integer overflow is undefined behavior (UB).

With unsigned math (often used in array indexing), then when low <= high, low + (high - low) / 2; does not overflow unlike potentially (high + low) / 2.

Same with signed math when low <= high and 0 <= low.

To avoid any overflow with signed math (or unsigned math with low > high) and still use only int/unsigned math, I thought the below would work.

mid = high/2 + low/2 + (high%2 + low%2)/2;

Yet that can fail when the sign of high/2 + low/2 differs from sign of (high%2 + low%2).

A more robust and tested version is below. Perhaps I'll simplify later.

#include <limits.h>
#include <stdio.h>

int midpoint(int a, int b) {
  int avg = a/2 + b/2;
  int small_sum = a%2 + b%2;
  avg += small_sum/2;
  small_sum %= 2;
  if (avg < 0) {
    if (small_sum > 0) avg++;
  } else if (avg > 0) {
    if (small_sum < 0) avg--;
  }
  return avg;
}

int midpoint_test(int a, int b) {
  intmax_t lavg = ((intmax_t)a + (intmax_t)b)/2;
  int avg = midpoint(a,b);
  printf("a:%12d b:%12d avg_wide_math:%12jd avg_midpoint:%12d\n", a,b,lavg,avg);
  return lavg == avg;
}

int main(void) {
  int a[] = {INT_MIN, INT_MIN+1, -100, -99, -2, -1, 0, 1, 2, 99, 100, INT_MAX-1, INT_MAX};
  int n = sizeof a/ sizeof a[0];
  for (int i=0; i<n; i++) {
    for (int j=0; j<n; j++) {
      if (midpoint_test(a[i], a[j]) == 0) {
        puts("Oops");
        return 1;
      }
    }
  }
  puts("Success");
  return 0;
}
Conflagrant answered 11/9, 2020 at 13:54 Comment(4)
If we can assume that mid, low` and high have the same type, high >= low and low >= 0, how can mid = low + (high - low) / 2; cause an overflow? It should be fully defined for both signed and unsigned types. In binary search, these variables are index values, hence >= 0`.Contuse
@Contuse "... assume ... cause an overflow?" ---> It cannot with those assumptions. With ordered unsigned values that is covered in the answer already. Code amended for other common restricted cases. The general case is covered beginning at "To avoid any overflow ....". Yes it is common that high >= low >= 0, yet p[some_negative] is allowed when p as a pointer, points to some place in the middle of an array.Conflagrant
@Contuse IAC, it seem interesting to see what code it took to perform a robust midpoint()-like function as there are various corner cases. Mine differs from std::midpoint(a,b) as it rounds toward 0 rather than towards a. Hmmm, maybe I should call mine average() rather than midpoint()?Conflagrant
An interesting and tricky problem indeed. I found alternative implementations with 2 or 3 tests but fewer and simpler divisions. The division by 2 is usually compiled to efficient shifting code, but truncation toward 0 requires a few extra instructions.Contuse
C
0

The two formulae are different:

  • both may overflow depending on the values of low and high.
  • even when there is no overflow, they do not necessarily produce the same result: the first computes the midpoint and the second computes the average of 2 numbers.

For the rest of the discussion, we shall assume that low, mid and high have the same type. We are looking for a safe way to find the midpoint or average between low and high, which is always in the range of the type.

The first formula, mid = low + (high - low) / 2; rounds toward low if the type is signed and may overflow if the type is signed and high and low are too far appart.

The second formula, mid = (high + low) / 2; rounds toward 0, but may overflow for large values of high and/or low for both signed and unsigned types.

In your particular application, computing the index of the middle element of a sorted array to perform binary search, the index values low and high are non-negative and low <= high. With this constraint, both formulas compute the same result, but the second can overflow whereas the first cannot.

Hence for your case, you should use mid = low + (high - low) / 2; as a safe replacement for mid = (high + low) / 2;.

In the general case, computing the average (second formula) without overflow is a tricky problem. Below is a set of solutions for the average formula, along with a test program inspired from chux' answer. They can be adapted for any signed integer type:

#include <limits.h>
#include <stdio.h>
#include <stdint.h>

int average_chqrlie(int a, int b) {
    if (a <= b) {
        if (a >= 0)
            return a + ((b - a) >> 1);
        if (b < 0)
            return b - ((b - a) >> 1);
    } else {
        if (b >= 0)
            return b + ((a - b) >> 1);
        if (a < 0)
            return a - ((a - b) >> 1);
    }
    return (a + b) / 2;
}

int average_chqrlie2(int a, int b) {
    if (a > b) {
        int tmp = a;
        a = b;
        b = tmp;
    }
    if (a >= 0)
        return a + ((b - a) >> 1);
    if (b < 0)
        return b - ((b - a) >> 1);
    return (a + b) / 2;
}

int average_chqrlie3(int a, int b) {
    int half, mid;
    if (a < b) {
        half = (int)(((unsigned)b - (unsigned)a) / 2);
        mid = a + half;
        if (mid < 0)
            mid = b - half;
    } else {
        half = (int)(((unsigned)a - (unsigned)b) / 2);
        mid = b + half;
        if (mid < 0)
            mid = a - half;
    }
    return mid;
}

int average_chux(int a, int b) {
    int avg = a / 2 + b / 2;
    int small_sum = a % 2 + b % 2;
    avg += small_sum / 2;
    small_sum %= 2;
    if (avg < 0) {
        if (small_sum > 0)
            avg++;
    } else if (avg > 0) {
        if (small_sum < 0)
            avg--;
    }
    return avg;
}

int run_tests(const char *name, int (*fun)(int a, int b)) {
    int array[] = { INT_MIN, INT_MIN+1, -100, -99, -2, -1, 0, 1, 2, 99, 100, INT_MAX-1, INT_MAX };
    int n = sizeof(array) / sizeof(array[0]);
    int status = 0;
    printf("Testing %s:", name);
    for (int i = 0; i < n; i++) {
        for (int j = 0; j < n; j++) {
            int a = array[i], b = array[j];
            intmax_t lavg = ((intmax_t)a + (intmax_t)b) / 2;  // assuming sizeof(intmax_t) > size(int)
            int avg = fun(a, b);
            if (lavg != avg) {
                printf("\na:%12d  b:%12d  average_wide:%12jd  average:%12d", a, b, lavg, avg);
                status = 1;
            }
        }
    }
    puts(status ? "\nFailed" : " Success");
    return status;
}

int main() {
    run_tests("average_chqrlie", average_chqrlie);
    run_tests("average_chqrlie2", average_chqrlie2);
    run_tests("average_chqrlie3", average_chqrlie3);
    run_tests("average_chux", average_chux);
    return 0;
}
Contuse answered 13/9, 2020 at 11:33 Comment(0)
D
0

The first one will not result in overflow for a large value of low/high unlike second one. It's always preferred to use mid = low + (high - low) / 2.

Dunlop answered 13/9, 2020 at 15:8 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.