# Accurate summation¶

Accurate summation of two or more addends in floating-point arithmetic is not a trivial task. It has become an active field of research since the introduction of computers using floating-point arithmetic and resulted in many different approaches, that should only be sketched in Chapter Previous work. The accurate computation of sums is not only the basis for the chapter about accurate inner products, it is also important for residual iterations, geometrical predicates, computer algebra, linear programming or multiple precision arithmetic in general, as it is motivated in [Rump2009].

## Previous work¶

### Presorting the addends¶

One of the easiest summation algorithms is the Recursive summation (Algorithm Recursive summation). Higham describes in [Higham2002] (Chapter 4) that for this simple algorithm the order of the addends has a huge impact on the accuracy of the resulting sum. The problem with presorting the data is, that even fast sorting algorithms have a complexity of and in many applications there is no information about the addends in advance available. But the analyzing techniques and results from [Higham2002] (Chapter 4) can be applied later for other algorithms. Additional tools for error analysis are given in [Brisebarre2010] (Chapter 6.1.2). For the purpose of error analysis Higham introduces a condition number for the addends:

(1)¶

1 2 3 4 5 6 | function [s] = RecursiveSummation(x, N) s = x(1); for i = 2:N s = s + x(i); end end |

If all addends have the same sign, then . In
this case an increasing ordering is suggested. When the numbers of largest
magnitude are summed up first, all smaller addends will not influence the final
result, even if all small addends together would have a sufficient large
magnitude to contribute to the final result as well. If the signs of the largest
addends differ, then and heavy
cancellation happens by computing . Such kind of
summation data is called *ill-conditioned* analogue to [Higham2002] (Chapter 4)
and has often the exact result zero. For this kind of addends Higham suggest a
decreasing pre ordering.

### Error-free transformation and distillation¶

The most basic error-free summation algorithms transform two input addends, e.g.
*a* and *b*, into a floating-point sum *x* and a floating-point approximation
error *y*. These algorithms are called *error-free transformations*
[Ogita2005], if the following is satisfied:

It is easy to see, that if is given, is valid too. The two most distinct instances fulfilling the
error-free transformation property are *FastTwoSum* (Algorithm
Error-free transformation FastTwoSum) by Dekker [Ogita2005] and *TwoSum* (Algorithm
Error-free transformation TwoSum) by Knuth [Ogita2005]. *FastTwoSum* requires three and
*TwoSum* six FLOP s. *FastTwoSum* additionally requires and thus unavoidably an expensive branch. So one has
to check for the individual use case whether three additional FLOP s
exceed the cost of a branch, see [Ogita2005] and [Brisebarre2010] (Chapter
4.3) . There exists another algorithm by Priest, which will not be taken into
account. It requires , thus a branch, and
more FLOP s than the two already presented algorithms [Brisebarre2010]
(Chapter 4.3.3).

1 2 3 4 | function [x, y] = FastTwoSum (a, b) x = fl(a + b); y = fl(fl(a - x) + b); end |

1 2 3 4 5 | function [x, y] = TwoSum (a, b) x = fl(a + b); z = fl(x - a); y = fl(fl(a - fl(x - z)) + fl(b - z)); end |

The extension of error-free transformations from 2 to *N* addends is called
*distillation* [Ogita2005].

Distillation means, that in each distillation step *k* the values of the *N*
individual addends may change, but not their sum. The goal of distillation
algorithms is, that after a finite number of steps, assume *k* steps,
approximates
[Higham2002] (Chapter 4.4). The Error-free transformation and distillation
properties are preliminaries for cascaded and compensated summation.

### Cascaded and compensated summation¶

If *FastTwoSum* (Algorithm Error-free transformation FastTwoSum) or *TwoSum* (Algorithm
Error-free transformation TwoSum) is successively applied to all elements of a vector of
addends, it is called *cascaded summation*. When each new addend gets corrected
by the previously computed error of *FastTwoSum* (like in line 5 Algorithm
Kahan’s cascaded and compensated summation), it is called *compensated
summation*. The notation with explicit usage of *FastTwoSum* has been introduced
in [Brisebarre2010] (Algorithm 6.7). Algorithm Kahan’s cascaded and compensated summation relies on sorted input data , because of the internal usage of *FastTwoSum*.

1 2 3 4 5 6 7 8 9 10 11 | % x: array of sorted addends % N: number of addends in x % s: computed sum function [s] = KahansCompensatedSummation (x, N) s = x(1); e = 0; for i = 2:N y = fl(x(i) + e); % compensation step [s, e] = FastTwoSum (s, y); end end |

Rump, Ogita and Oishi present in [Ogita2005] another interesting algorithm,
namely *SumK*, which repeats the distillation *k - 1* times, followed by a final
recursive summation. The authors have shown that after the *(k - 1)*-th
repetition of the cascaded summation, the result has improved, as if it has been
computed with *k*-fold working precision.

### Long and cascaded accumulators¶

A completely different approach is not to look for ways to cope with the errors
of floating-point arithmetic, instead to change the precision on hardware level.
Therefore Kulisch and Miranker proposed the usage of a long high-precision
accumulator on hardware level [Kulisch1986]. This approach has not been
realized so far by common hardware vendors. In his book Kulisch describes
comprehensibly the realization of *Scalar product computation units (SPU)* for
common 32 and 64 bit PC architectures or as coprocessors [Kulisch2013] (Chapter
8). Kulisch reports about two more or less successful attempts of coprocessor
realizations, the most recent one with a Field Programmable Gate Array (FPGA)
[Kulisch2013] (Chapter 8.9.3). The major issue is the time penalty of the much
slower FPGA clock rates. But as there is much improvement on that field of
research and with intelligent usage of parallelism, it might be possible to
create a SPU, that is comparable to nowadays CPU floating-point units.
Nevertheless the idea of the long accumulator resulted in a C++ toolbox called
C-XSC [1], that is currently maintained by the University of Wuppertal. The
C-XSC toolbox has been developed for several years and is thoroughly tested,
therefore its version 2.5.3 will be used as reference for checking the
correctness of the computed results later in Chapter Benchmark.

Another interesting approach came up in a paper by Malcom [Malcolm1971], who caught up Wolfes idea of cascaded accumulators. Malcom modified this idea by splitting each addend in order to add each split part to an appropriate accumulator. Finally all accumulators are summed up in decreasing order of magnitude using ordinary recursive summation. This case was treated in Chapter Presorting the addends and results in a small relative error [Higham2002] (Chapter 4.4).

### Hybrid techniques¶

Zhu and Hayes published the accurate summation algorithm *OnlineExactSum*
[Hayes2010]. This algorithm claims to be independent of the number of addends
and the condition number (see Equation (1)) of
the input. Furthermore the results of *OnlineExactSum* are correctly rounded.
*OnlineExactSum* has a constant memory footprint, as it uses a fixed number of
cascaded accumulators. Each addends exponent is examined for the choice of an
appropriate accumulator and the accumulation is done by Dekkers error-free
transformation algorithm *FastTwoSum*. In contrast to Malcoms approach, the
final sum up of the cascaded accumulators is done by *iFastSum* [Hayes2009], a
distillation algorithm like *SumK*. In their paper [Hayes2010] Zhu and Hayes
proof the correctness of their algorithm by showing, that no accumulator looses
any significant digits, and by the correctly rounded result of *iFastSum* for
the final sum up. With various run time test for several types of input data
they verified the stable and predictable behavior of *OnlineExactSum*. With this
survey on previous work, a new algorithm will be proposed in the following
chapter. Many ideas for the proposed algorithm accrued from this previous work
and are extended by this new approach.

## BucketSum¶

### Generic description¶

The proposed algorithm *BucketSum* performs basically two steps, which will be
explained comprehensively in this chapter:

**Sort and accumulate:***N*addends are sorted by magnitude and stored into*M*buckets, . All significant bits are preserved.**Summing up buckets:**compute an accurate sum of the*M*buckets.

This approach is already known from Zhu and Hayes algorithm *HybridSum*
[Hayes2009] and from Malcolm [Malcolm1971]. Instead of increasing the
precision of the accumulators, the input data is split and stored in several
shorter accumulators. So each binary64 number can be seen as an
extended-precision accumulator for the reduced input data precision. Like in
*HybridSum* [Hayes2009] an array of binary64 numbers is created, each
for accumulating a certain part of the full binary64 exponent range.
Each element of that array will be called “bucket” in this chapter. For getting
an overall picture, the algorithms for the steps 1 and 2 are presented first.
The algorithm for step 2 is a slight modification of Kahan’s cascaded and
compensated summation (Algorithm Kahan’s cascaded and compensated summation). The compensation step has been taken out of the for-loop to reduce
the data dependency. In this modified version (Algorithm Modified Kahan’s cascaded and compensated summation) all summation errors are accumulated
inside the for-loop for the final compensation step. Additionally an initial
value for the resulting sum has been introduced.

1 2 3 4 5 6 7 8 | function [s] = ModifiedKahanSum (s, x, N) err = 0; for i = 1:N [s, e] = FastTwoSum (s, x(i)); err = fl(err + e); end s = fl(s + err) % compensation step end |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | function [s] = BucketSum (x, N) % Create appropriate masks mask = CreateMasks (M); mask(1) = 0; mask(M) = NaN; % Create array of M buckets, initialized with their mask. % a(1:2) are underflow and a((M - 1):M) are overflow buckets % a(3:(M - 2)) cover SHIFT exponents a = mask; sum = 0; for i = 1:N pos = ceil (exp(x(i)) / SHIFT) + 2; % exp(): extracts biased exponent [a(pos), e] = FastTwoSum (a(pos), x(i)); a(pos - 2) = fl(a(pos - 2) + e); if (mod (i, C1) == 0) % C1: capacity of normal buckets for j = 1:(M - 2) % Tidy up normal buckets r = fl(fl(mask(j + 1) + fl(a(j) - mask(j))) - mask(j + 1)); a(j + 1) = fl(a(j + 1) + r); a(j) = fl(a(j) - r); end end if (mod (i, C2) == 0) % C2: capacity of overflow buckets sum = fl(sum + fl(a(M - 1) - mask(M - 1))); % Tidy up overflow a(M - 1) = mask(M - 1); end end for i = 1:(M - 1) % Remove masks a(i) = a(i) - mask(i); end s = ModifiedKahanSum (sum, a_{M-1 \text{ downto } 1}, M-1); end |

*BucketSum* is responsible for step 1 and presented in Algorithm
BucketSum. What distinguishes *BucketSum* from *OnlineExactSum* is
the ternary partitioning of each floating-point accumulator (bucket). This
partitioning is done in order to archive a certain cascaded, overlapping pattern
for the accurate summation, as Figure Error storage scenario of the smallest allowed addend into bucket i - 2. shows. The
“distance” between two neighboring buckets is called *shift* and identical to
the length in the partitioning. From Figure Generic significant partition. one can see, that the whole generic partitioning pattern
consists of the following parts, that are determined in Theorem 2:

**Two set bits**in the beginning.**Accumulation reserve**(): This length decides about the number of addends, that can be accumulated into a bucket, without loosing any significant bits.**Variable extension**(*guard*):*guard*is a variable extension of the following .**Accumulated exponent range**(): Each bucket is assigned to accumulate addends of a certain exponent sub range. is the length of this range. Therefore, is at least one, otherwise no addend may be added to this bucket.**Residual**(): is only used to preserve significant bits of an addend.

Another characteristic of *BucketSum* is, that there is no fixed splitting of
the input addends like in *HybridSum* [Hayes2009]. The splitting is performed
dynamically by *FastTwoSum* as one can see in Algorithm BucketSum
line 9. After giving an overview of *BucketSum*, there follows a more detailed
description of the algorithm, which starts with a formal analysis of the bucket
partitioning.

**Axiom 1.**- The “unit in the first place” (see Section Rounding) of a bucket is immutable except for the underflow or overflow range.

Axiom 1 means that during the summation process the significance of the most
significant bit of each bucket may not change. Otherwise it is not possible to
rely on a fixed exponent range for each bucket. To archive that, the leading bit
pattern “11” has been introduced. Under the assumption, that the most
significant bit of bucket *i* is , each number less than
may be added or subtracted without changing the significance
of the first bit of the bucket. This property is well known from integer
arithmetic.

**Assumption 1.**- The exponent range of floating-point numbers is unlimited.

Assumption 1 allows to ignore the under- and overflow-range for now. These two ranges will be treated in Section Realization for binary64 for the special case of binary64 values.

**Theorem 1.**- The summation error of bucket
*i*has to be added at least to bucket*i - 2*. - Proof.
- The proof for Theorem 1 will be done graphically in Figure
Four possible examples for partitioning and storing the error of the smallest
allowed addend into the neighbouring bucket.. In that Figure it is obvious, that
independent of the bit lengths , , and
the full bit precision
*p*of the addend cannot be preserved. Therefore the summation error of bucket*i*has to be added at least to bucket*i - 2*, like shown in Figure Error storage scenario of the smallest allowed addend into bucket i - 2.. In Algorithm BucketSum, line 10, this action is performed. ∎

**Assumption 2.**- The summation error of bucket
*i*is added to bucket*i - 2*.

The choice of the error bucket is dependent on the size *shift*. The “further
away” the error bucket is, the smaller *shift* has to be, as one might deduce
from Figure Error storage scenario of the smallest allowed addend into bucket i - 2.. And the smaller *shift* is, the
more buckets are required. Assumption 2 takes the first possible error bucket
according to Theorem 1, in order to reduce the number of required buckets.

**Theorem 2.**For Axiom 1 , Assumption 1, and Assumption 2, the following rules have to apply to the lengths , , and , in order to get a ternary bucket partition, that maximizes the lengths and :

- Proof.
From Figure Error storage scenario of the smallest allowed addend into bucket i - 2. three useful equations can be derived:

(2)¶

(3)¶

(4)¶

By reformulating Equation (2) to , one can derive two equivalent objective functions . For the latter one, a constrained optimization problem is given in (5).

(5)¶

The optimization problem (5) can be relaxed to the problem (6) with additionally combining the first two constraints.

(6)¶

As has the larger factor in the first constraint of (6), an optimum can be obtained for in Equation (7).

(7)¶

By respecting

*guard*and to be integers, an upward rounding of this optimum, to fulfill the optimization constraints, yields the first equation of Theorem 2. With the first equation of Theorem 2 and equation (4) an upper bound for is found:(8)¶

A lower bound is obtained by combining and equation (3):

(9)¶

Respecting the integer property of and by combining the Equations (8) and (9), the second equation of Theorem 2 is derived. Inserting the first equation of Theorem 2 into Equation (2) yields the third equation of Theorem 2. ∎

**Assumption 3.**- In the first equation of Theorem 2,
*guard*is maximized.

With Theorem 2 only an equation for the sum of *guard* and
was derived. As earlier described, *guard* is an extension , at
the cost of , which is of minor importance. Therefore it is more
desirable to maximize *guard* (Assumption 3). This allows to define *guard*
more precisely in Theorem 3.

**Theorem 3.**- Under the Assumption 3 and with Theorem 2 it holds .
- Proof.
According to Theorem 2 is constant. This means maximizing

*guard*is equivalent to minimizing . A lower bound for is given in Equation (3). Combined with the upper bound of*shift*from the second equation of Theorem 2, yields:(10)¶

Due to the minimization of , (10) becomes an equation. This inserted into the first equation of Theorem 2, proofs Theorem 3. ∎

All possible relations between *shift* and *p* can be seen in Figure
All possible ternary partitions for a given shift. Note that p = 3
\cdot shift - 2 violates the upper bound of shift in Theorem 2.. In the
following, the number of addends, that can be accumulated without loosing any
significant bits, are described by Theorem 4.

**Theorem 4.**- Given the bucket partition of Theorem 2, up to additions to a bucket can be performed without violating Axiom 1.
- Proof.
- Without loss of generality, the by magnitude largest allowed number to be added to a bucket with a “unit in the first place” is . The bucket is initialized with , thus it will not overflow for . ∎

Finally a complexity analysis of *BucketSum* (Algorithm BucketSum)
similar to that one in [Hayes2010] should be done. For each of the *N* addends
the following operations have to be performed:

- Bucket determination (line 8): 3 FLOP s [2]
*FastTwoSum*(line 9): 3 FLOP s- Error summation (line 10): 1 FLOP

After *C2* steps, the overflow bucket has to be tidied up, that requires two
additional FLOP s (lines 18-21). After *C1* steps, all *M - 2* buckets
need to be tidied up. This requires five additional FLOP s (lines 11-17)
per bucket as well. Once in the end an unmasking has to happen with *M - 1*
FLOP s (lines 23-25) and for the final sum up, Algorithm
Modified Kahan’s cascaded and compensated summation (line 26) requires
more FLOP s. All in all

are required. Assume a large number of addends . Then the final
sum up part has a small static contribution to the complexity, thus it can be
neglected. If the buckets don’t have to be tidied up during the summation for
small , an overall complexity of *7N* remains. Even if
holds, the effort for tiding up is small compared to
the seven FLOP s, that always have to be performed. Thus the complexity
of *BucketSum* is considered to be *7N*.

### Realization for binary64¶

The binary64 type has the precision *p = 53*. Therefore we get
and a significant
partitioning by Theorem 2 and Theorem 3, as shown in Figure
Significant partition for binary64..

Also one can no longer assume an infinite exponent range. Assumption 1 has to
be replaced by a concrete bucket alignment. This alignment consists of three
parts, the under- and overflow and the normal bucket part. The anchor for the
alignment is, that the least significant bit of the *shift* part of the first
normal bucket *a[0]* has the significance of the biased exponent *0*. This
anchor has been chosen, because no binary64, even the subnormal numbers
with a biased exponent of *0*, can be accumulated into a bucket smaller than
*a[0]*. All in all to cover the full exponent range of binary64, one
needs buckets. Beginning
with the first normal bucket *a[0]*, each following bucket is aligned with a
unit in the first place of *shift* bigger than its predecessor. The maximal
multiple of *shift* that fits in this pattern is . Therefore we define the topmost bucket to be an
overflow bucket. This bucket is responsible for values with a unit in the first
place of greater than , but these values are ignored in this
work. With an unreasonable effort, this overflow situation can be handled
differently. The second overflow bucket needs an exceptional alignment as well.
Its is smaller due to upper limit of the binary64
exponent range . Because of the alignment of *a[0]* and
Assumption 2, two additional error buckets for the underflow range are
required. For the underflow range , bucket
*a[-1]* follows the alignment scheme of the normal buckets and bucket *a[-2]* is
responsible for the remaining bit positions. The exponent range partition is
illustrated in Equation (11). Graphical
visualizations of the bucket alignment in the under- and overflow range are
given in the Figures Visualization of the bucket alignment in the underflow range. and
Visualization of the bucket alignment in the overflow range..

(11)¶

Finally the accumulation reserve for the normal and underflow buckets is according to Theorem 4 smaller than . For the first overflow bucket one obtains analogue to Theorem 4 an accumulation reserve of less than .

### Implementation¶

One essential element of this project is the efficient implementation of
*BucketSum*. This chapter deals with all implementation details and changes to
the pseudo-code from Algorithm BucketSum. Some potential improvements
to a floating-point using software are described in Chapters Software and compiler support and Performance. The in Chapter
Performance presented technique of partial loop unrolling can be used
to obtain an elegant side effect for the tidy up and sum up steps. In Algorithm
BucketSum all buckets are initialized with an appropriate mask. This
mask has to be considered in the tidy up process (lines 13 and 19-20) and it has
to be removed before the final sum up (lines 23-25). If two different bucket
arrays *a1* and *a2* are used, *a1* uses the masks as described in Algorithm
BucketSum and *a2* uses the negative masks. In that way the exact sum
of the unmasked values of the buckets *i* can be computed by *a1[i] + a2[i]*.
This way the number of floating-point operations dealing with masking and
unmaking are reduced a lot. Additionally the partial loop unrolling increases
the instruction-level parallelism and finally increases the tidy up values by a
factor of two. This means that less tidy up “interruptions” for the *N* addends
are required.

Another considered optimization is the avoidance of the division by the *shift*
in Algorithm BucketSum line 8. An integer division is an expensive
operation compared to multiplication and bit shifting. In [Fog2014] (p. 54-58)
one can find latencies for several instructions. For the AMD “Piledriver” the
latency for a signed or unsigned division ((I)DIV [AMD2013b] (Chapter 3))
ranges from 12-71 clock cycles. Compared to this the sum of the latencies of a
left or right bit shift (SHL/SHR [AMD2013b] (Chapter 3)) with one clock cycle
and a signed or unsigned multiplication ((I)MUL [AMD2013b] (Chapter 3)) with
4-6 clock cycles is by far smaller. As this division by the *shift* has to be
done for each addends exponent, a small speed up could be archived by replacing
the division by a multiplication followed by a bit shift, as shown in Listing
Division by 18 replacement. The idea behind the values of Listing
Division by 18 replacement is an integer optimization problem.

(12)¶

For normal and subnormal binary64 the exponents range from 0 to 2046 and
the desired division should be a cheap bit shift, thus a power of two.
Therefore the task is to find for the smallest possible power of two *y* some
minimal . This *x* was
found with the program of Listing Program to solve the integer optimization problem (Equation eq-Division by 18 optimization problem)..

1 2 3 4 5 6 | double d = 1.0; // Exponent extraction unsigned position = ((ieee754_double *)(&d))->exponent; // Perform equivalent operations unsigned pos1 = position / 18; unsigned pos2 = (position * 1821) >> 15; |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | #include <cmath> #include <iostream> int main () { unsigned div = 1; // Try some powers of two (div = 2^y) for (unsigned y = 1; y < 32; y++) { div *= 2; unsigned x = (unsigned) std::ceil ((double) div / 18.0); // Test all exponents of the IEEE 754 - 2008 binary64 normal and // subnormal range int is_valid = 1; for (unsigned i = 0; i < 2047; i++) { is_valid &= ((i / 18) == ((i * x) / div)); } if (is_valid) { std::cout << "Found: " << x << " / 2^" << y << std::endl; } } return 0; } |

## Benchmark¶

The benchmark program compares the five summation algorithms of Table Comparison of summation algorithms for input data length N with their source of implementation mentioned in brackets. The accurate summation results of the C-XSC toolbox will be used as reference values for the five types of test data.

Algorithm | FLOP s | Run-time | Space |
---|---|---|---|

Ordinary Recursive Summation (Algorithm Recursive summation) |
1 | ||

SumK (K = 2, [Lathus2012]) |
2-3 | ||

iFastSum ([Hayes2010]) |
3-5 | ||

OnlineExactSum ([Hayes2010]) |
4-6* | ||

BucketSum (Algorithm BucketSum) |
1-2* |

An asterisk “*” in Comparison of summation algorithms for input data length N indicates the
use of instruction-level parallelism, a dagger “”, that the
results for Data 3 were omitted, and a double dagger “”, that
this applies only for large dimensions. The test data for the summation
benchmark program is chosen similar to [Hayes2010]. **Data 1** are *N* random,
positive floating-point numbers, all with an exponent of . Thus
Data 1 is pretty well-conditioned . **Data 2**
is ill-conditioned. The exponents are distributed uniformly and randomly between
and , the signs are assigned randomly and the
significant is filled randomly as well. **Data 3** is similar to Data 2, but its
sum is exactly zero. **Data 4** is Anderson’s ill-conditioned data
[Anderson1999]. And finally **Data 5** is designed to especially stress the
accumulation reserve of *BucketSum*. A visualization of that test case is given
in Figure Visualization of the stress test case for roundToNearest..

For time measurement the *clock()* function [ISO-IEC-9899-2011] (Chapter
7.27.2.1) [ISO-IEC-14882-2011] (Chapter 20.11.8) is used. To keep the time
measurement as accurate as possible, all memory operations like array creation
and destruction should be kept outside of time measuring code blocks. On the
other hand, if the size of the input data *N* was chosen too small, the measured
time is too inaccurate. This requires a certain number of repeated operations
*R*, to obtain detectable results. But some algorithms like *SumK* and
*iFastSum* operate inline on the input data. Thus providing a single copy of the
data will not suffice to get identical initial conditions for each repetition.
To meet all these constraints, a large copy of elements
for summation is created, each of the repeated *N* elements with a leading zero,
as *iFastSum* and *BucketSum* imitate Fortran indexing. The systems available
main memory creates another constraint on the maximum test case size . This product should not exceed the test systems 8 GB of main
memory, otherwise the timings will become inaccurate due to swapping to hard
disk. This means for *R = 1* repetitions the theoretical maximum test case size
can be

Experimental test runs revealed, that about elements are necessary in order to obtain detectable results. Therefore the following data lengths and repetitions are defined:

- Middle dimension: elements with repetitions
- Large dimension: elements with repetitions

The benchmarks (see Figures above) show, that *BucketSum* performs best for all
given kinds of data. *BucketSum* is by factor 2-3 slower than the Ordinary
Recursive Summation and is slightly faster than *SumK* (with *K = 2*). For
middle and large data lengths *BucketSum* scales linear in contrast to
*OnlineExactSum*, which starts to scale linear at a data length of about
elements. Another interesting observation is, that
*OnlineExactSum* is dependent on the condition of the input data
for small data lengths. For *iFastSum*,
*OnlineExactSum* and *BucketSum* the results have been compared to that one of
the C-XSC toolbox using an *assert()* [ISO-IEC-14882-2011] (Chapter 19.3)
statement, thus any inaccurate result would have interrupted the benchmark. As
no interruptions occurred, all three algorithms are assumed to deliver correctly
rounded sums. The most important properties of the algorithms under test are
summarized in Table Comparison of summation algorithms for input data length N, which is a
modified extension of [Hayes2010].

Footnotes

[1] | http://www2.math.uni-wuppertal.de/wrswt/xsc/cxsc_new.html |

[2] | For simplicity integer operations are counted as FLOP s. |