Can I deterministically sum a vector of arbitrarily arranged floating-point numbers?
Asked Answered
E

3

6

Say I have a (possibly large) vector of floating-point numbers resulting from some black box process. Is it possible to calculate a bitwise reproducible summation of these numbers?

If the black boxprocess always produces the numbers in the same order, then a bitwise reproducible summation is easy: just sum them from left to right.

But if the numbers are generated in a random order, perhaps because they are returned and collected from asynchronous processes, then it is harder: I'd have to sort them numbers.

But what if there are even more numbers, perhaps distributed across different machines, so that it's undesirable to move them?

Is there still a way to deterministically sum them?

Electrophysiology answered 8/5, 2021 at 5:1 Comment(8)
Interesting question and answer, but it would help to explicitly say "floating-point numbers". The numbers I sum are usually integers, which are not so high-maintenance.Jestinejesting
You could set-up a fixed-point variable with 2**256 digits for float32 (plus some spare depending on the maximal number of inputs), or 2**2048 digits for float64.Trusting
"bitwise reproducible" sounds like you want something you can do an equality check on. Sometimes it's useful to hash the values together and do an equality check on that. Not sure if that's relevant here, interesting question though!Mickey
@SamMason: That's correct. The goal is to pass the values through some process and then check whether the outputs are invariant between different versions of the code. However, for this to work the process must be deterministic, which is challenging if certain forms of parallelism are involved.Electrophysiology
@Electrophysiology why not just interpret the floats as bits then? e.g. just view a double as a uint64_t and just sum ints together? associativity would come for free, and parallel sums trivial. you might also want to sum the doubles for a human to see, but the test would just be looking at the bitsMickey
If I were to do such thing without reading any papers, I would group together numbers that have the same exponent (there are 11 bits so 2048 different exponents), sum their mantissas exactly (as bignums, not dropping any bits), then convert back to floating point, then finally sort and sum the partial sums. On;y the last step cannot be distributed/parallelized.Delmore
@SamMason: it's a good idea, but one in which you (very likely) throw out any possibility of having an accurate answer, which is often desirable, especially if you intend to use your summed value in some downstream calculation.Electrophysiology
@n. This seems like a decent idea, though you'll have to account for special values like infinity, NaN, and subnormal, as well as find a nice way to handle mantissa overflow (which would otherwise lead to exponent increases in a standard float).Electrophysiology
E
1

Overview: A Single-pass, Data-Preserving Method

Yes!

A method to do just this is described in the paper "Algorithms for Efficient Reproducible Floating Point Summation" by Ahrens, Demmel, and Nguyen (2020). I've included an implementation of it below.

Here we'll compare three algorithms:

  1. Conventional left-to-right summation: this is fast, inaccurate, and non-reproducible with respect to permutations in the input order.
  2. Kahan summation: this is slower, very accurate (essentially O(1) in the input size), and, while non-reproducible with respect to permutations in the input order, closer to reproducibility in a narrow sense.
  3. Reproducible summation: this is somewhat slower than Kahan summation, can be quite accurate, and is bitwise reproducible.

Conventional summation is inaccurate because as the sum grows the least-significant digits of the numbers we add to it get silently dropped. Kahan summation overcomes this by keeping a running "compensation" sum which holds onto the least-significant digits. The reproducible summation uses a similar strategy to maintain accuracy but maintains several bins corresponding to different levels of compensation. These bins also appropriately truncate significance of the input data to obtain an accurate, reproducible sum.

First Test

To study the algorithms, we'll run a test on 1 million numbers each drawn uniformly from the range [-1000, 1000]. To demonstrate both the range of answers of the algorithms as well as their reproducibility, the input will be shuffled and summed 100 times.

A result I'll assert, which the code below demonstrates rigorously, is that for each data type studied the deterministic algorithm yields the same result for each of the 100 data orderings.

My implementation includes a class which reproducible accumulators values passed to it, but there are several ways to pass the input data. We'll experiment with three of them:

  • 1ata tests a mode in which numbers are passed to the accumulator one at a time without prior knowledge of the maximum magnitude of the inputs
  • many tests a mode in which many numbers are passed to the accumulator, but without prior knowledge of the maximum magnitude of the input
  • manyc tests a mode in which many numbers are passed to the accumulator and in which we have prior knowledge of the maximum magnitude of the inputs

The results of the various algorithms are as follows:

Single-Precision Floats
==================================================================
Average Reproducible sum 1ata time   = 0.0115s (11.5x simple; 2.6x Kahan)
Average Reproducible sum many time   = 0.0047s (4.7x simple; 1.1x Kahan)
Average Reproducible sum manyc time  = 0.0036s (3.6x simple; 0.8x Kahan)
Average Kahan summation time         = 0.0044s
Average simple summation time        = 0.0010s
Reproducible Value                   = 767376.500000000 (0% diff vs Kahan)
Kahan long double accumulator value  = 767376.500000000
Error bound on RV                    = 15.542840004
Distinct Kahan (single accum) values = 1
Distinct Simple values               = 92


Double-Precision Floats
==================================================================
Average Reproducible sum 1ata time   = 0.0112s (10.5x simple; 2.6x Kahan)
Average Reproducible sum many time   = 0.0051s (4.7x simple; 1.2x Kahan)
Average Reproducible sum manyc time  = 0.0037s (3.4x simple; 0.8x Kahan)
Average simple summation time        = 0.0011s
Average Kahan summation time         = 0.0044s
Reproducible Value                   = 310844.70069914369378239 (0% diff vs Kahan)
Kahan long double accumulator value  = 310844.70069914369378239
Error bound on RV                    = 0.00000000048315059
Distinct Kahan (double accum) values = 2
Distinct Simple values               = 96

Second Test

Let's try a more challenging test than a uniform random! Instead, let's generate 1 million data points in the range [0, 2π) and use them to generate the y values of a sine wave. As before, we then shuffle these data points 100 times and see what kinds of outputs we get. The timing values are very similar to the above, so let's leave them out. This gives us:

Single-Precision Floats
==================================================================
Reproducible Value                   = 0.000000000
Kahan long double accumulator value  = -0.000000000
Error bound                          = 14.901161194
Distinct Kahan (single) values       = 96
Distinct Simple values               = 100
Kahan lower value                    = -0.000024229
Kahan upper value                    = 0.000025988
Simple lower value                   = -0.017674208
Simple upper value                   = 0.018800795
Kahan range                          = 0.000050217
Simple range                         = 0.036475003
Simple range / Kahan range           = 726

Double-Precision Floats
==================================================================
Reproducible Value                   = 0.00000000000002185
Kahan long double accumulator value  = 0.00000000000002185
Error bound                          = 0.00000000000000083
Distinct Kahan (double) values       = 93
Distinct Simple values               = 100
Kahan lower value                    = 0.00000000000000017
Kahan upper value                    = 0.00000000000006306
Simple lower value                   = -0.00000000004814216
Simple upper value                   = 0.00000000002462563
Kahan range                          = 0.00000000000006289
Simple range                         = 0.00000000007276779
Simple range / Kahan range           = 1157

Here we see that the reproducible value matches exactly the long-double Kahan summation value we use as a source of truth.

In contrast to our results on the previous, simpler test there are many distinct Kahan summation values (where the Kahan summation uses the same floating-point type as the data); using a more difficult dataset has destroyed the "partial determinism" we observed in Kahan summation on the simpler test, even though the range of values produced by Kahan summation is orders of magnitude smaller than those obtained via simple summation.

So in this instance reproducible summation gives better accuracy than Kahan summation and yields a single bitwise-reproducible result rather than ~100 different results.

Costs

Time. The reproducible summation takes ~3.5x longer than the simple summation and about the same amount of time as Kahan summation.

Memory. Using 3-fold binning requires 6*sizeof(floating_type) space for the accumulator. That is, 24 bytes for single-precision and 48 bytes for double-precision.

Additionally, a static lookup table is needed which can be shared between all accumulators of a given data type and folding. For 3-fold binning this table is 41 floats (164 bytes) or 103 doubles (824 bytes).

Discussion

So, what have we learned?

Standard summations give a wide range of results. We ran 100 tests and got 96 unique results for the double-precision types. This is clearly not reproducible and demonstrates well the problem we're trying to solve.

Kahan summation on the other hand produces very few distinct results (thus it is "more" deterministic) and takes approximately 4x as long as conventional summation.

The reproducible algorithm takes ~10x longer than simple summation if we know nothing about the numbers we'll be adding; however, if we know the maximum magnitude of our summands then the reproducible algorithm only takes ~3.5x longer than simple summing.

In comparison to Kahan summation, the reproducible algorithm sometimes takes less time (!) while ensuring that we get answers which are bitwise identical. Further, unlike the method detailed in this answer, the reproducible result matches the Kahan summation result for both single and double precision in our simple test. We also get strong guarantees that the reproducible and Kahan results have a very low relative error.

What folding level is best? 3, per the graph below.

Time and Error vs Folding

Fold-1 is very bad. For this test, 2 gives good accuracy for double, but a 10% relative error for float. Fold-3 gives good accuracy for double and float with a small time increase. So we only want fold-2 for One At A Time reproducible summation on doubles. Going higher than Fold-3 gives only marginal benefits.

Code

The full code is too long to include here; however, you can find the C++ version here; and ReproBLAS here.

Electrophysiology answered 27/5, 2021 at 1:29 Comment(0)
E
5

Overview: A Multi-pass, Data-Mutating Method

(See here for a better answer.)

Yes, there is a way.

A method to do just this is described in the paper "Fast Reproducible Floating-Point Summation" by Demmel and Nguyen (2013). I've included both a serial and parallel implementation of it below.

Here we'll compare three algorithms:

  1. Conventional left-to-right summation: this is fast, inaccurate, and non-reproducible with respect to permutations in the input order.
  2. Kahan summation: this is slower, very accurate (essentially O(1) in the input size), and, while non-reproducible with respect to permutations in the input order, closer to reproducibility in a narrow sense.
  3. Deterministic summation: this is somewhat slower than Kahan summation, can be quite accurate, and is bitwise reproducible.

Conventional summation is inaccurate because as the sum grows the least-significant digits of the numbers we add to it get silently dropped. Kahan summation overcomes this by keeping a running "compensation" sum which holds onto the least-significant digits. The deterministic summation uses a similar strategy to maintain accuracy but, before performing the summation, "pre-rounds" the input numbers to a common base so that their sum can be computed accurately without any rounding error.

Tests

To study the algorithms, we'll run a test on 1 million numbers each drawn uniformly from the range [-1000, 1000]. To demonstrate both the range of answers of the algorithms as well as their determinism, the input will be shuffled and summed 100 times. The parallel algorithms are run using 12 threads.

The "correct" summation (as determined by using Kahan summation in long double mode and choosing the most frequently occuring value) is

310844.700699143717685046795

A result I'll assert, which the code below demonstrates rigorously, is that for each data type studied the deterministic algorithm yields the same result for each of the 100 data orderings and that these results are the same for both the serial and parallel variants of the algorithm.

The results of the various algorithms are as follows:

Serial Float
=========================================================
Simple summation accumulation type   = float
Average deterministic summation time = 0.00462s
Average simple summation time        = 0.00144s (3.20x faster than deterministic)
Average Kahan summation time         = 0.00290s (1.59x faster than deterministic)
Deterministic value                  = 256430592.000000000 (very different from actual)
Distinct Kahan values                = 3  (with single-precision accumulator)
Distinct Simple values               = 93 (with single-precision accumulator)
Distinct Simple values               = 1  (with double-precision accumulator)

Parallel Float
=========================================================
Simple summation accumulation type   = float
Average deterministic summation time = 0.00576s
Average simple summation time        = 0.00206s (2.79x faster than deterministic)
Average Kahan summation time         = 0.00599s (0.96x faster than deterministic)
Deterministic value                  = 256430592.000000000 (very different from actual)
Distinct Kahan values                = 3  (with single-precision accumulator)
Distinct Simple values               = 89 (with single-precision accumulator)
Distinct Simple values               = 1  (with double-precision accumulator)

Serial Double
=========================================================
Average deterministic summation time = 0.00600s
Average simple summation time        = 0.00171s (3.49x faster than deterministic)
Average Kahan summation time         = 0.00378s (1.58x faster than deterministic)
Deterministic value                  = 310844.70069914375199005 (epsilon difference from actual value)
Distinct Kahan values                = 4
Distinct Simple values               = 88

Parallel Double
=========================================================
Average deterministic summation time = 0.01074s
Average simple summation time        = 0.00289s (3.71x faster than deterministic)
Average Kahan summation time         = 0.00648s (1.65x faster than deterministic)
Deterministic value                  = 310844.70069914375199005 (epsilon difference from actual value)
Distinct Kahan values                = 2
Distinct Simple values               = 83

Serial Long Double
=========================================================
Average deterministic summation time = 0.01072s
Average simple summation time        = 0.00215s (4.96x faster than deterministic)
Average Kahan summation time         = 0.00448s (2.39x faster than deterministic)
Deterministic value                  = 310844.700699143717685046795 (no discernable difference from actual)
Distinct Kahan values                = 3
Distinct Simple values               = 94

Parallel Long Double
=========================================================
Average deterministic summation time = 0.01854s
Average simple summation time        = 0.00466s (3.97x faster than deterministic)
Average Kahan summation time         = 0.00635s (2.91x faster than deterministic)
Deterministic value                  = 310844.700699143717685046795 (no discernable difference from actual)
Distinct Kahan values                = 1
Distinct Simple values               = 82

Discussion

So, what have we learned? From the single-precision results we see that using a double-length accumulator makes the conventional summation algorithm reproducible for this data set, though that almost certainly wouldn't be the case for arbitrary datasets. Still, this can be a cheap way to get improved accuracy when it works.

When the accumulator used by conventional summation is of the same size as the numbers being added, it works very poorly: we ran 100 tests and got nearly as many distinct results from conventional summation.

Kahan summation on the other hand produces very few distinct results (thus it is "more" deterministic) and takes approximately twice as long as conventional summation.

The deterministic algorithm takes ~4x longer than simple summation and about 1.5-3x longer than Kahan summation, but gets approximately the same answers for both the double and long double types except with bitwise determinism (it always gets the same answer no matter how the input data is sorted).

However, the single-precision floating-point gets a very bad answer, unlike the single-precision conventional and Kahan summations, which turn out to be quite close to the actual answer. Why is this?

The paper we've been working from determines that if there are n numbers in the input and we use k folding rounds (the paper recommends 2, which is what we use here), then the deterministic and conventional sums will have similar error bounds provided

n^k * e^(k-1) << 1

where e is the floating-point epsilon of the data type. These epsilon values are:

Single-precision epsilon      = 0.000000119
Double-precision epsilon      = 0.00000000000000022
Long double-precision epsilon = 0.000000000000000000108

Plugging these in to the equation along with n=1M we get:

Single condition      = 119000
Double condition      = 0.00022
Long double condition = 0.000000108

So we see that it's expected that single-precision would perform poorly for an input of this size.

Another point to note is that although long double takes 16 bytes on my machine, this is only for alignment purposes: the true length that's used for compute is only 80-bits. Therefore, two long doubles will compare numerically equally but their unused bits are arbitrary. For true bitwise reproducibility the unused bits would need to be determined and set to a known value.

Code

//Compile with:
//g++ -g -O3 repro_vector.cpp -fopenmp

//NOTE: Always comile with `-g`. It doesn't slow down your code, but does make
//it debuggable and improves ease of profiling

#include <algorithm>
#include <bitset>               //Used for showing bitwise representations
#include <cfenv>                //Used for setting floating-point rounding modes
#include <chrono>               //Used for timing algorithms
#include <climits>              //Used for showing bitwise representations
#include <iostream>
#include <omp.h>                //OpenMP
#include <random>
#include <stdexcept>
#include <string>
#include <typeinfo>
#include <unordered_map>
#include <vector>

constexpr int ROUNDING_MODE = FE_UPWARD;
constexpr int N = 1'000'000;
constexpr int TESTS = 100;

// Simple timer class for tracking cumulative run time of the different
// algorithms
struct Timer {
  double total = 0;
  std::chrono::high_resolution_clock::time_point start_time;

  Timer() = default;

  void start() {
    start_time = std::chrono::high_resolution_clock::now();
  }

  void stop() {
    const auto now = std::chrono::high_resolution_clock::now();
    const auto time_span = std::chrono::duration_cast<std::chrono::duration<double>>(now - start_time);
    total += time_span.count();
  }
};

//Simple class to enable directed rounding in floating-point math and to reset
//the rounding mode afterwards, when it goes out of scope
struct SetRoundingMode {
  const int old_rounding_mode;

  SetRoundingMode(const int mode) : old_rounding_mode(fegetround()) {
    if(std::fesetround(mode)!=0){
      throw std::runtime_error("Failed to set directed rounding mode!");
    }
  }

  ~SetRoundingMode(){
    if(std::fesetround(old_rounding_mode)!=0){
      throw std::runtime_error("Failed to reset rounding mode to original value!");
    }
  }

  static std::string get_rounding_mode_string() {
    switch (fegetround()) {
      case FE_DOWNWARD:   return "downward";
      case FE_TONEAREST:  return "to-nearest";
      case FE_TOWARDZERO: return "toward-zero";
      case FE_UPWARD:     return "upward";
      default:            return "unknown";
    }
  }
};

// Used to make showing bitwise representations somewhat more intuitive
template<class T>
struct binrep {
  const T val;
  binrep(const T val0) : val(val0) {}
};

// Display the bitwise representation
template<class T>
std::ostream& operator<<(std::ostream& out, const binrep<T> a){
  const char* beg = reinterpret_cast<const char*>(&a.val);
  const char *const end = beg + sizeof(a.val);
  while(beg != end){
    out << std::bitset<CHAR_BIT>(*beg++);
    if(beg < end)
      out << ' ';
  }
  return out;
}



//Simple serial summation algorithm with an accumulation type we can specify
//to more fully explore its behaviour
template<class FloatType, class SimpleAccumType>
FloatType serial_simple_summation(const std::vector<FloatType> &vec){
  SimpleAccumType sum = 0;
  for(const auto &x: vec){
    sum += x;
  }
  return sum;
}

//Parallel variant of the simple summation algorithm above
template<class FloatType, class SimpleAccumType>
FloatType parallel_simple_summation(const std::vector<FloatType> &vec){
  SimpleAccumType sum = 0;
  #pragma omp parallel for default(none) reduction(+:sum) shared(vec)
  for(size_t i=0;i<vec.size();i++){
    sum += vec[i];
  }
  return sum;
}



//Kahan's compensated summation algorithm for accurately calculating sums of
//many numbers with O(1) error
template<class FloatType>
FloatType serial_kahan_summation(const std::vector<FloatType> &vec){
  FloatType sum = 0.0f;
  FloatType c = 0.0f;
  for (const auto &num: vec) {
    const auto y = num - c;
    const auto t = sum + y;
    c = (t - sum) - y;
    sum = t;
  }
  return sum;
}

//Parallel version of Kahan's compensated summation algorithm (could be improved
//by better accounting for the compsenation during the reduction phase)
template<class FloatType>
FloatType parallel_kahan_summation(const std::vector<FloatType> &vec){
  //Parallel phase
  std::vector<FloatType> sum(omp_get_max_threads(), 0);
  FloatType c = 0;
  #pragma omp parallel for default(none) firstprivate(c) shared(sum,vec)
  for (size_t i=0;i<vec.size();i++) {
    const auto tid = omp_get_thread_num();
    const auto y = vec[i] - c;
    const auto t = sum.at(tid) + y;
    c = (t - sum[tid]) - y;
    sum[tid] = t;
  }

  //Serial reduction phase

  //This could be more accurate if it took the remaining compensation values
  //from above into account
  FloatType total_sum = 0.0f;
  FloatType total_c = 0.0f;
  for(const auto &num: sum){
    const auto y = num - total_c;
    const auto t = total_sum + y;
    total_c = (t - total_sum) - y;
    total_sum = t;
  }

  return total_sum;
}



// Error-free vector transformation. Algorithm 4 from Demmel and Nguyen (2013)
template<class FloatType>
FloatType ExtractVectorNew2(
  const FloatType M,
  const typename std::vector<FloatType>::iterator &begin,
  const typename std::vector<FloatType>::iterator &end
){
  // Should use the directed rounding mode of the parent thread

  auto Mold = M;
  for(auto v=begin;v!=end;v++){
    auto Mnew = Mold + (*v);
    auto q = Mnew - Mold;
    (*v) -= q;
    Mold = Mnew;
  }

  //This is the exact sum of high order parts q_i
  //v is now the vector of low order parts r_i
  return Mold - M;
}

template<class FloatType>
FloatType mf_from_deltaf(const FloatType delta_f){
  const int power = std::ceil(std::log2(delta_f));
  return static_cast<FloatType>(3.0) * std::pow(2, power);
}

//Implements the error bound discussed near Equation 6 of
//Demmel and Nguyen (2013).
template<class FloatType>
bool is_error_bound_appropriate(const size_t N, const int k){
  const auto eps = std::numeric_limits<FloatType>::epsilon();
  const auto ratio = std::pow(N, k) * std::pow(eps, k-1);
  //If ratio << 1, then the conventional non-reproducible sum and the
  //deterministic sum will have error bounds of the same order. We arbitrarily
  //choose 1e-4 to represent this
  return ratio < 1e-3;
}

//Serial bitwise deterministic summation.
//Algorithm 8 from Demmel and Nguyen (2013).
template<class FloatType>
FloatType serial_bitwise_deterministic_summation(
  std::vector<FloatType> vec,  // Note that we're making a copy!
  const int k
){
  constexpr FloatType eps = std::numeric_limits<FloatType>::epsilon();
  const auto n = vec.size();
  const auto adr = SetRoundingMode(ROUNDING_MODE);

  if(n==0){
    return 0;
  }

  if(!is_error_bound_appropriate<FloatType>(vec.size(), k)){
    std::cout<<"WARNING! Error bounds of deterministic sum are large relative to conventional summation!"<<std::endl;
  }

  FloatType m = std::abs(vec.front());
  for(const auto &x: vec){
    m = std::max(m, std::abs(x));
  }

  FloatType delta_f = n * m / (1 - 4 * (n + 1) * eps);
  FloatType Mf = mf_from_deltaf(delta_f);

  std::vector<FloatType> Tf(k);
  for(int f=0;f<k-1;f++){
    Tf[f] = ExtractVectorNew2<FloatType>(Mf, vec.begin(), vec.end());
    delta_f = n * (4 * eps * Mf / 3) / (1 - 4 * (n + 1) * eps);
    Mf = mf_from_deltaf(delta_f);
  }

  FloatType M = Mf;
  for(const FloatType &v: vec){
    M += v;
  }
  Tf[k-1] = M - Mf;

  FloatType T = 0;
  for(const FloatType &tf: Tf){
    T += tf;
  }

  return T;
}

//Parallel bitwise deterministic summation.
//Algorithm 9 from Demmel and Nguyen (2013).
template<class FloatType>
FloatType parallel_bitwise_deterministic_summation(
  std::vector<FloatType> vec,  // Note that we're making a copy!
  const int k
){
  constexpr FloatType eps = std::numeric_limits<FloatType>::epsilon();
  const auto n = vec.size();
  const auto adr = SetRoundingMode(ROUNDING_MODE);

  if(n==0){
    return 0;
  }

  if(!is_error_bound_appropriate<FloatType>(vec.size(), k)){
    std::cout<<"WARNING! Error bounds of deterministic sum are large relative to conventional summation!"<<std::endl;
  }

  std::vector<FloatType> Tf(k);

  // Note that this reduction would require communication if we had
  // implemented this to work on multiple nodes
  FloatType m = std::abs(vec.front());
  #pragma omp parallel for default(none) reduction(max:m) shared(vec)
  for(size_t i=0;i<vec.size();i++){
    m = std::max(m, std::abs(vec[i]));
  }

  // Note that this reduction would require communication if we had
  // implemented this to work on multiple nodes
  #pragma omp declare reduction(vec_plus : std::vector<FloatType> : \
    std::transform(omp_out.begin(), omp_out.end(), omp_in.begin(), omp_out.begin(), std::plus<FloatType>())) \
    initializer(omp_priv = decltype(omp_orig)(omp_orig.size()))

  #pragma omp parallel default(none) reduction(vec_plus:Tf) shared(k,m,n,vec,std::cout)
  {
    const auto adr = SetRoundingMode(ROUNDING_MODE);
    const auto threads = omp_get_num_threads();
    const auto tid = omp_get_thread_num();
    const auto values_per_thread = n / threads;
    const auto nlow = tid * values_per_thread;
    const auto nhigh = (tid<threads-1) ? ((tid+1) * values_per_thread) : n;

    FloatType delta_f = n * m / (1 - 4 * (n + 1) * eps);
    FloatType Mf = mf_from_deltaf(delta_f);

    for(int f=0;f<k-1;f++){
      Tf[f] = ExtractVectorNew2<FloatType>(Mf, vec.begin() + nlow, vec.begin() + nhigh);
      delta_f = n * (4 * eps * Mf / 3) / (1 - 4 * (n + 1) * eps);
      Mf = mf_from_deltaf(delta_f);
    }

    FloatType M = Mf;
    for(size_t i=nlow;i<nhigh;i++){
      M += vec[i];
    }
    Tf[k-1] = M - Mf;
  }

  FloatType T = 0;
  for(const FloatType &tf: Tf){
    T += tf;
  }

  return T;
}



//Convenience wrappers
template<bool Parallel, class FloatType>
FloatType bitwise_deterministic_summation(
  const std::vector<FloatType> &vec,  // Note that we're making a copy!
  const int k
){
  if(Parallel){
    return parallel_bitwise_deterministic_summation<FloatType>(vec, k);
  } else {
    return serial_bitwise_deterministic_summation<FloatType>(vec, k);
  }
}

template<bool Parallel, class FloatType, class SimpleAccumType>
FloatType simple_summation(const std::vector<FloatType> &vec){
  if(Parallel){
    return parallel_simple_summation<FloatType, SimpleAccumType>(vec);
  } else {
    return serial_simple_summation<FloatType, SimpleAccumType>(vec);
  }
}

template<bool Parallel, class FloatType>
FloatType kahan_summation(const std::vector<FloatType> &vec){
  if(Parallel){
    return serial_kahan_summation<FloatType>(vec);
  } else {
    return parallel_kahan_summation<FloatType>(vec);
  }
}



// Timing tests for the summation algorithms
template<bool Parallel, class FloatType, class SimpleAccumType>
FloatType PerformTestsOnData(
  const int TESTS,
  std::vector<FloatType> floats, //Make a copy so we use the same data for each test
  std::mt19937 gen               //Make a copy so we use the same data for each test
){
  Timer time_deterministic;
  Timer time_kahan;
  Timer time_simple;

  //Very precise output
  std::cout.precision(std::numeric_limits<FloatType>::max_digits10);
  std::cout<<std::fixed;

  std::cout<<"Parallel? "<<Parallel<<std::endl;
  if(Parallel){
    std::cout<<"Max threads = "<<omp_get_max_threads()<<std::endl;
  }
  std::cout<<"Floating type                        = "<<typeid(FloatType).name()<<std::endl;
  std::cout<<"Floating type epsilon                = "<<std::numeric_limits<FloatType>::epsilon()<<std::endl;
  std::cout<<"Simple summation accumulation type   = "<<typeid(SimpleAccumType).name()<<std::endl;
  std::cout<<"Number of tests                      = "<<TESTS<<std::endl;
  std::cout<<"Input sample = "<<std::endl;
  for(size_t i=0;i<10;i++){
    std::cout<<"\t"<<floats[i]<<std::endl;
  }

  //Get a reference value
  std::unordered_map<FloatType, uint32_t> simple_sums;
  std::unordered_map<FloatType, uint32_t> kahan_sums;
  const auto ref_val = bitwise_deterministic_summation<Parallel, FloatType>(floats, 2);
  for(int test=0;test<TESTS;test++){
    std::shuffle(floats.begin(), floats.end(), gen);

    time_deterministic.start();
    const auto my_val = bitwise_deterministic_summation<Parallel, FloatType>(floats, 2);
    time_deterministic.stop();
    if(ref_val!=my_val){
      std::cout<<"ERROR: UNEQUAL VALUES ON TEST #"<<test<<"!"<<std::endl;
      std::cout<<"Reference      = "<<ref_val                   <<std::endl;
      std::cout<<"Current        = "<<my_val                    <<std::endl;
      std::cout<<"Reference bits = "<<binrep<FloatType>(ref_val)<<std::endl;
      std::cout<<"Current   bits = "<<binrep<FloatType>(my_val) <<std::endl;
      throw std::runtime_error("Values were not equal!");
    }

    time_kahan.start();
    const auto kahan_sum = kahan_summation<Parallel, FloatType>(floats);
    kahan_sums[kahan_sum]++;
    time_kahan.stop();

    time_simple.start();
    const auto simple_sum = simple_summation<Parallel, FloatType, SimpleAccumType>(floats);
    simple_sums[simple_sum]++;
    time_simple.stop();
  }

  std::cout<<"Average deterministic summation time = "<<(time_deterministic.total/TESTS)<<std::endl;
  std::cout<<"Average simple summation time        = "<<(time_simple.total/TESTS)<<std::endl;
  std::cout<<"Average Kahan summation time         = "<<(time_kahan.total/TESTS)<<std::endl;
  std::cout<<"Ratio Deterministic to Simple        = "<<(time_deterministic.total/time_simple.total)<<std::endl;
  std::cout<<"Ratio Deterministic to Kahan         = "<<(time_deterministic.total/time_kahan.total)<<std::endl;

  std::cout<<"Reference value                      = "<<std::fixed<<ref_val<<std::endl;
  std::cout<<"Reference bits                       = "<<binrep<FloatType>(ref_val)<<std::endl;

  std::cout<<"Distinct Kahan values                = "<<kahan_sums.size()<<std::endl;
  std::cout<<"Distinct Simple values               = "<<simple_sums.size()<<std::endl;

  int count = 0;
  for(const auto &kv: kahan_sums){
    std::cout<<"\tKahan sum values (N="<<std::fixed<<kv.second<<") "<<kv.first<<" ("<<binrep<FloatType>(kv.first)<<")"<<std::endl;
    if(count++==10){
      break;
    }
  }

  count = 0;
  for(const auto &kv: simple_sums){
    std::cout<<"\tSimple sum values (N="<<std::fixed<<kv.second<<") "<<kv.first<<" ("<<binrep<FloatType>(kv.first)<<")"<<std::endl;
    if(count++==10){
      break;
    }
  }

  std::cout<<std::endl;

  return ref_val;
}

// Use this to make sure the tests are reproducible
template<class FloatType, class SimpleAccumType>
void PerformTests(
  const int TESTS,
  const std::vector<long double> &long_floats,
  std::mt19937 &gen
){
  std::vector<FloatType> floats(long_floats.begin(), long_floats.end());

  const auto serial_val = PerformTestsOnData<false, FloatType, SimpleAccumType>(TESTS, floats, gen);
  const auto parallel_val = PerformTestsOnData<true, FloatType, SimpleAccumType>(TESTS, floats, gen);

  //Note that the `long double` type may only use 12-16 bytes (to maintain
  //alignment), but only 80 bits, resulting in bitwise indeterminism in the last
  //few bits; however, the floating-point values themselves will be equal.
  std::cout<<"########################################"<<std::endl;
  std::cout<<"### Serial and Parallel values match for "
           <<typeid(FloatType).name()
           <<"? "
           <<(serial_val==parallel_val)
           <<std::endl;
  std::cout<<"########################################\n"<<std::endl;
}



int main(){
  std::random_device rd;
  // std::mt19937 gen(rd());   //Enable for randomness
  std::mt19937 gen(123456789); //Enable for reproducibility
  std::uniform_real_distribution<long double> distr(-1000, 1000);
  std::vector<long double> long_floats;
  for(int i=0;i<N;i++){
    long_floats.push_back(distr(gen));
  }

  PerformTests<double, double>(TESTS, long_floats, gen);
  PerformTests<long double, long double>(TESTS, long_floats, gen);
  PerformTests<float, float>(TESTS, long_floats, gen);
  PerformTests<float, double>(TESTS, long_floats, gen);

  return 0;
}

Edit for Eric Postpischil

Eric says:

A generator such as I described will produce numbers with largely the same quantum—all near multiples of the divisor. This may not include a wide variety of exponents in the samples, so it may not well model a distribution where the numbers are being rounded in the interior of the significand when being added. E.g., if we add many numbers of the form 123.45, they will add fine for a while, although rounding errors will grow as the partial sums accumulate. But if we add numbers of forms 12345, 123.45, and 1.2345, different errors occur earlier

Adding 1M values of 123.45 gives a single long double Kahan value of 123450000.000000002837623469532. The deterministic long double value differs from this by -0.00000000000727595761 while the deterministic double value differs by -0.00000001206353772047 while the deterministic float differs by -3.68% (as expected given its large epsilon).

Choosing 1M values randomly from the set {1.2345, 12.345, 123.45, 1234.5, 12345} gives two long double Kahan values: A=2749592287.563000000780448317528 (N=54) and B=2749592287.563000000547617673874 (N=46). The deterministic long double value matches (A) exactly; the deterministic double value differs from (A) by -0.00000020139850676247; the deterministic float value differs from (A) by -257% (again, expected).

Electrophysiology answered 8/5, 2021 at 5:1 Comment(8)
Do I understand correctly that if n > 1/e then we're screwed, because increasing k won't help?Jestinejesting
The distribution of numbers used here is unclear and may not model real applications; std::uniform_real_distribution is underspecified. The standard specifies it as uniform on the real numbers (probability density is constant over an interval) but neglects the fact that floating-point numbers are not uniformly dense over that interval. A common method of generating “uniform” distributions—generate an integer and scale to the interval—neglects finer (lower exponent) floating-point numbers and so may not model realistic results.Smear
@j_random_hacker: That's the case, yes. Theorem 7 of the paper suggests n*e<0.1 as a good limit.Electrophysiology
@EricPostpischil: "may not model real applications". Since real applications can include anything, any model will fall short of encompassing their totality, so this doesn't really seem like an interesting criticism. We do not expect the wall-time to depend on the values, just how many values there are, so that generalizes. Further, the error bounds are rigorous, so although I hand-wave at them here for demonstration, they in fact should apply to any input. Since this is the case, I don't expect the exclusion of subnormals to be problematic. Of course, you can try with your own inputs :-)Electrophysiology
Subnormals are not specifically the issue. A generator such as I described will produce numbers with largely the same quantum—all near multiples of the divisor. This may not include a wide variety of exponents in the samples, so it may not well model a distribution where the numbers are being rounded in the interior of the significand when being added. E.g., if we add many numbers of the form 123.45, they will add fine for a while, although rounding errors will grow as the partial sums accumulate. But if we add numbers of forms 12345, 123.45, and 1.2345, different errors occur earlier.Smear
@EricPostpischil: I ran the test you suggested (see edited answer) and don't see anything special happening. Where the calculations match the bound n*e<0.1 we get accurate summations.Electrophysiology
This can divide by zero FloatType delta_f = n * m / (1 - 4 * (n + 1) * eps); if you have 2**21-1 float32 values. And it can overflow if m is very large. Also, I'm not sure how copying the vector complies with "But what if there are even more numbers, perhaps distributed across different machines, so that it's undesirable to move them?" Why is copying ok, if moving is not?Trusting
@chtz, as discussed in several places above, if you had that many (2**21-1) float32 values you would be pushing the limits on the error bounds anyway. The code makes the copy to better demonstrate numerical reliability; the vector could be modified in place. Moreover, note that the part of the vector handled by each individual thread is always the same and the data communicated between threads is on the order of a single number. If the computation were distributed across machines each machine would hold part of the vector and the copy would be per-node not between nodes.Electrophysiology
D
3

TL/DR: There is simple and systematic way, even though it is not an efficient one: Convert everything to fixed-point and get the exact value. Since only the last step of conversion back to float contains rounding, the whole process is independent of the order of summands.

The idea is taken from The Exact Dot Product as Basic Tool for Long Interval Arithmetic.

The observation is that all floating point numbers can also be represented as fixed point numbers of appropriate width. With fixed point numbers multiplication and addition can be represented exactly provided that a few extra integer bits are added to account for overflow.

For instance a 64 bit binary IEE754 float can be represented as 2131 bit fixed point number with 1056 bits before and 1075 bits after the decimal point (see Table 1 in link for other FP formats, but note that the quoted numbers are factor 2 too high as they are for a product of two floats).

Adding 64 bits for overflow (assuming that a maximum of 2^64 numbers can be added at once) we arrive at at 2195 bit or 275 bytes fixed point width. In terms of 64 bit unsigned integers this is 18 before decimal point and 17 after decimal point.

For float32 it is only 3 before and 3 after decimal point, or, if one allows a decimal point in between one limbs, only 5 in total are needed.

Such arithmetic is straightforward to implement in many languages. Typically no dynamic mememory management is needed, and when unsigned integer overflow can be reliably detected (such as in C & C++) no assembly language is needed for efficient implementation.

The principle can be applied to more complex floating point problems where correct rounding is to be obtained. It is not very efficient but simple to implement and rather robubst (in the sense of testability).

EDIT: Corrected the numbers as they were factor of 2 too high. Also added float32 case (suggested from njuffa)

Digestive answered 9/5, 2021 at 19:27 Comment(2)
The required number of bits in the fixed-point accumulator specified by this answer seems too large by about a factor of two for simple accumulation, which is what OP is asking about. Instead, the number of bits appears to be based on accumulation of products of IEEE-754 binary64 operands. On modern 64-bit platforms, use of fixed-point accumulation for IEEE-754 binary32 operands seems feasible with reasonable efficiency using five 64-bit limbs.Reconsider
@Reconsider You are absolutely right. I should have read the paper more carefully. I will update the numbers in my answer.Digestive
E
1

Overview: A Single-pass, Data-Preserving Method

Yes!

A method to do just this is described in the paper "Algorithms for Efficient Reproducible Floating Point Summation" by Ahrens, Demmel, and Nguyen (2020). I've included an implementation of it below.

Here we'll compare three algorithms:

  1. Conventional left-to-right summation: this is fast, inaccurate, and non-reproducible with respect to permutations in the input order.
  2. Kahan summation: this is slower, very accurate (essentially O(1) in the input size), and, while non-reproducible with respect to permutations in the input order, closer to reproducibility in a narrow sense.
  3. Reproducible summation: this is somewhat slower than Kahan summation, can be quite accurate, and is bitwise reproducible.

Conventional summation is inaccurate because as the sum grows the least-significant digits of the numbers we add to it get silently dropped. Kahan summation overcomes this by keeping a running "compensation" sum which holds onto the least-significant digits. The reproducible summation uses a similar strategy to maintain accuracy but maintains several bins corresponding to different levels of compensation. These bins also appropriately truncate significance of the input data to obtain an accurate, reproducible sum.

First Test

To study the algorithms, we'll run a test on 1 million numbers each drawn uniformly from the range [-1000, 1000]. To demonstrate both the range of answers of the algorithms as well as their reproducibility, the input will be shuffled and summed 100 times.

A result I'll assert, which the code below demonstrates rigorously, is that for each data type studied the deterministic algorithm yields the same result for each of the 100 data orderings.

My implementation includes a class which reproducible accumulators values passed to it, but there are several ways to pass the input data. We'll experiment with three of them:

  • 1ata tests a mode in which numbers are passed to the accumulator one at a time without prior knowledge of the maximum magnitude of the inputs
  • many tests a mode in which many numbers are passed to the accumulator, but without prior knowledge of the maximum magnitude of the input
  • manyc tests a mode in which many numbers are passed to the accumulator and in which we have prior knowledge of the maximum magnitude of the inputs

The results of the various algorithms are as follows:

Single-Precision Floats
==================================================================
Average Reproducible sum 1ata time   = 0.0115s (11.5x simple; 2.6x Kahan)
Average Reproducible sum many time   = 0.0047s (4.7x simple; 1.1x Kahan)
Average Reproducible sum manyc time  = 0.0036s (3.6x simple; 0.8x Kahan)
Average Kahan summation time         = 0.0044s
Average simple summation time        = 0.0010s
Reproducible Value                   = 767376.500000000 (0% diff vs Kahan)
Kahan long double accumulator value  = 767376.500000000
Error bound on RV                    = 15.542840004
Distinct Kahan (single accum) values = 1
Distinct Simple values               = 92


Double-Precision Floats
==================================================================
Average Reproducible sum 1ata time   = 0.0112s (10.5x simple; 2.6x Kahan)
Average Reproducible sum many time   = 0.0051s (4.7x simple; 1.2x Kahan)
Average Reproducible sum manyc time  = 0.0037s (3.4x simple; 0.8x Kahan)
Average simple summation time        = 0.0011s
Average Kahan summation time         = 0.0044s
Reproducible Value                   = 310844.70069914369378239 (0% diff vs Kahan)
Kahan long double accumulator value  = 310844.70069914369378239
Error bound on RV                    = 0.00000000048315059
Distinct Kahan (double accum) values = 2
Distinct Simple values               = 96

Second Test

Let's try a more challenging test than a uniform random! Instead, let's generate 1 million data points in the range [0, 2π) and use them to generate the y values of a sine wave. As before, we then shuffle these data points 100 times and see what kinds of outputs we get. The timing values are very similar to the above, so let's leave them out. This gives us:

Single-Precision Floats
==================================================================
Reproducible Value                   = 0.000000000
Kahan long double accumulator value  = -0.000000000
Error bound                          = 14.901161194
Distinct Kahan (single) values       = 96
Distinct Simple values               = 100
Kahan lower value                    = -0.000024229
Kahan upper value                    = 0.000025988
Simple lower value                   = -0.017674208
Simple upper value                   = 0.018800795
Kahan range                          = 0.000050217
Simple range                         = 0.036475003
Simple range / Kahan range           = 726

Double-Precision Floats
==================================================================
Reproducible Value                   = 0.00000000000002185
Kahan long double accumulator value  = 0.00000000000002185
Error bound                          = 0.00000000000000083
Distinct Kahan (double) values       = 93
Distinct Simple values               = 100
Kahan lower value                    = 0.00000000000000017
Kahan upper value                    = 0.00000000000006306
Simple lower value                   = -0.00000000004814216
Simple upper value                   = 0.00000000002462563
Kahan range                          = 0.00000000000006289
Simple range                         = 0.00000000007276779
Simple range / Kahan range           = 1157

Here we see that the reproducible value matches exactly the long-double Kahan summation value we use as a source of truth.

In contrast to our results on the previous, simpler test there are many distinct Kahan summation values (where the Kahan summation uses the same floating-point type as the data); using a more difficult dataset has destroyed the "partial determinism" we observed in Kahan summation on the simpler test, even though the range of values produced by Kahan summation is orders of magnitude smaller than those obtained via simple summation.

So in this instance reproducible summation gives better accuracy than Kahan summation and yields a single bitwise-reproducible result rather than ~100 different results.

Costs

Time. The reproducible summation takes ~3.5x longer than the simple summation and about the same amount of time as Kahan summation.

Memory. Using 3-fold binning requires 6*sizeof(floating_type) space for the accumulator. That is, 24 bytes for single-precision and 48 bytes for double-precision.

Additionally, a static lookup table is needed which can be shared between all accumulators of a given data type and folding. For 3-fold binning this table is 41 floats (164 bytes) or 103 doubles (824 bytes).

Discussion

So, what have we learned?

Standard summations give a wide range of results. We ran 100 tests and got 96 unique results for the double-precision types. This is clearly not reproducible and demonstrates well the problem we're trying to solve.

Kahan summation on the other hand produces very few distinct results (thus it is "more" deterministic) and takes approximately 4x as long as conventional summation.

The reproducible algorithm takes ~10x longer than simple summation if we know nothing about the numbers we'll be adding; however, if we know the maximum magnitude of our summands then the reproducible algorithm only takes ~3.5x longer than simple summing.

In comparison to Kahan summation, the reproducible algorithm sometimes takes less time (!) while ensuring that we get answers which are bitwise identical. Further, unlike the method detailed in this answer, the reproducible result matches the Kahan summation result for both single and double precision in our simple test. We also get strong guarantees that the reproducible and Kahan results have a very low relative error.

What folding level is best? 3, per the graph below.

Time and Error vs Folding

Fold-1 is very bad. For this test, 2 gives good accuracy for double, but a 10% relative error for float. Fold-3 gives good accuracy for double and float with a small time increase. So we only want fold-2 for One At A Time reproducible summation on doubles. Going higher than Fold-3 gives only marginal benefits.

Code

The full code is too long to include here; however, you can find the C++ version here; and ReproBLAS here.

Electrophysiology answered 27/5, 2021 at 1:29 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.