Accurate summation¶
Accurate summation of two or more addends in floating-point arithmetic is not a trivial task. It has become an active field of research since the introduction of computers using floating-point arithmetic and resulted in many different approaches, that should only be sketched in Chapter Previous work. The accurate computation of sums is not only the basis for the chapter about accurate inner products, it is also important for residual iterations, geometrical predicates, computer algebra, linear programming or multiple precision arithmetic in general, as it is motivated in [Rump2009].
Previous work¶
Presorting the addends¶
One of the easiest summation algorithms is the Recursive summation (Algorithm Recursive summation). Higham describes in [Higham2002] (Chapter 4) that for this simple algorithm the order of the addends has a huge impact on the accuracy of the resulting sum. The problem with presorting the data is, that even fast sorting algorithms have a complexity of and in many applications there is no information about the addends in advance available. But the analyzing techniques and results from [Higham2002] (Chapter 4) can be applied later for other algorithms. Additional tools for error analysis are given in [Brisebarre2010] (Chapter 6.1.2). For the purpose of error analysis Higham introduces a condition number for the addends:
(1)¶
1 2 3 4 5 6 | function [s] = RecursiveSummation(x, N) s = x(1); for i = 2:N s = s + x(i); end end |
If all addends have the same sign, then . In this case an increasing ordering is suggested. When the numbers of largest magnitude are summed up first, all smaller addends will not influence the final result, even if all small addends together would have a sufficient large magnitude to contribute to the final result as well. If the signs of the largest addends differ, then and heavy cancellation happens by computing . Such kind of summation data is called ill-conditioned analogue to [Higham2002] (Chapter 4) and has often the exact result zero. For this kind of addends Higham suggest a decreasing pre ordering.
Error-free transformation and distillation¶
The most basic error-free summation algorithms transform two input addends, e.g. a and b, into a floating-point sum x and a floating-point approximation error y. These algorithms are called error-free transformations [Ogita2005], if the following is satisfied:
It is easy to see, that if is given, is valid too. The two most distinct instances fulfilling the error-free transformation property are FastTwoSum (Algorithm Error-free transformation FastTwoSum) by Dekker [Ogita2005] and TwoSum (Algorithm Error-free transformation TwoSum) by Knuth [Ogita2005]. FastTwoSum requires three and TwoSum six FLOP s. FastTwoSum additionally requires and thus unavoidably an expensive branch. So one has to check for the individual use case whether three additional FLOP s exceed the cost of a branch, see [Ogita2005] and [Brisebarre2010] (Chapter 4.3) . There exists another algorithm by Priest, which will not be taken into account. It requires , thus a branch, and more FLOP s than the two already presented algorithms [Brisebarre2010] (Chapter 4.3.3).
1 2 3 4 | function [x, y] = FastTwoSum (a, b) x = fl(a + b); y = fl(fl(a - x) + b); end |
1 2 3 4 5 | function [x, y] = TwoSum (a, b) x = fl(a + b); z = fl(x - a); y = fl(fl(a - fl(x - z)) + fl(b - z)); end |
The extension of error-free transformations from 2 to N addends is called distillation [Ogita2005].
Distillation means, that in each distillation step k the values of the N individual addends may change, but not their sum. The goal of distillation algorithms is, that after a finite number of steps, assume k steps, approximates [Higham2002] (Chapter 4.4). The Error-free transformation and distillation properties are preliminaries for cascaded and compensated summation.
Cascaded and compensated summation¶
If FastTwoSum (Algorithm Error-free transformation FastTwoSum) or TwoSum (Algorithm Error-free transformation TwoSum) is successively applied to all elements of a vector of addends, it is called cascaded summation. When each new addend gets corrected by the previously computed error of FastTwoSum (like in line 5 Algorithm Kahan’s cascaded and compensated summation), it is called compensated summation. The notation with explicit usage of FastTwoSum has been introduced in [Brisebarre2010] (Algorithm 6.7). Algorithm Kahan’s cascaded and compensated summation relies on sorted input data , because of the internal usage of FastTwoSum.
1 2 3 4 5 6 7 8 9 10 11 | % x: array of sorted addends % N: number of addends in x % s: computed sum function [s] = KahansCompensatedSummation (x, N) s = x(1); e = 0; for i = 2:N y = fl(x(i) + e); % compensation step [s, e] = FastTwoSum (s, y); end end |
Rump, Ogita and Oishi present in [Ogita2005] another interesting algorithm, namely SumK, which repeats the distillation k - 1 times, followed by a final recursive summation. The authors have shown that after the (k - 1)-th repetition of the cascaded summation, the result has improved, as if it has been computed with k-fold working precision.
Long and cascaded accumulators¶
A completely different approach is not to look for ways to cope with the errors of floating-point arithmetic, instead to change the precision on hardware level. Therefore Kulisch and Miranker proposed the usage of a long high-precision accumulator on hardware level [Kulisch1986]. This approach has not been realized so far by common hardware vendors. In his book Kulisch describes comprehensibly the realization of Scalar product computation units (SPU) for common 32 and 64 bit PC architectures or as coprocessors [Kulisch2013] (Chapter 8). Kulisch reports about two more or less successful attempts of coprocessor realizations, the most recent one with a Field Programmable Gate Array (FPGA) [Kulisch2013] (Chapter 8.9.3). The major issue is the time penalty of the much slower FPGA clock rates. But as there is much improvement on that field of research and with intelligent usage of parallelism, it might be possible to create a SPU, that is comparable to nowadays CPU floating-point units. Nevertheless the idea of the long accumulator resulted in a C++ toolbox called C-XSC [1], that is currently maintained by the University of Wuppertal. The C-XSC toolbox has been developed for several years and is thoroughly tested, therefore its version 2.5.3 will be used as reference for checking the correctness of the computed results later in Chapter Benchmark.
Another interesting approach came up in a paper by Malcom [Malcolm1971], who caught up Wolfes idea of cascaded accumulators. Malcom modified this idea by splitting each addend in order to add each split part to an appropriate accumulator. Finally all accumulators are summed up in decreasing order of magnitude using ordinary recursive summation. This case was treated in Chapter Presorting the addends and results in a small relative error [Higham2002] (Chapter 4.4).
Hybrid techniques¶
Zhu and Hayes published the accurate summation algorithm OnlineExactSum [Hayes2010]. This algorithm claims to be independent of the number of addends and the condition number (see Equation (1)) of the input. Furthermore the results of OnlineExactSum are correctly rounded. OnlineExactSum has a constant memory footprint, as it uses a fixed number of cascaded accumulators. Each addends exponent is examined for the choice of an appropriate accumulator and the accumulation is done by Dekkers error-free transformation algorithm FastTwoSum. In contrast to Malcoms approach, the final sum up of the cascaded accumulators is done by iFastSum [Hayes2009], a distillation algorithm like SumK. In their paper [Hayes2010] Zhu and Hayes proof the correctness of their algorithm by showing, that no accumulator looses any significant digits, and by the correctly rounded result of iFastSum for the final sum up. With various run time test for several types of input data they verified the stable and predictable behavior of OnlineExactSum. With this survey on previous work, a new algorithm will be proposed in the following chapter. Many ideas for the proposed algorithm accrued from this previous work and are extended by this new approach.
BucketSum¶
Generic description¶
The proposed algorithm BucketSum performs basically two steps, which will be explained comprehensively in this chapter:
- Sort and accumulate: N addends are sorted by magnitude and stored into M buckets, . All significant bits are preserved.
- Summing up buckets: compute an accurate sum of the M buckets.
This approach is already known from Zhu and Hayes algorithm HybridSum [Hayes2009] and from Malcolm [Malcolm1971]. Instead of increasing the precision of the accumulators, the input data is split and stored in several shorter accumulators. So each binary64 number can be seen as an extended-precision accumulator for the reduced input data precision. Like in HybridSum [Hayes2009] an array of binary64 numbers is created, each for accumulating a certain part of the full binary64 exponent range. Each element of that array will be called “bucket” in this chapter. For getting an overall picture, the algorithms for the steps 1 and 2 are presented first. The algorithm for step 2 is a slight modification of Kahan’s cascaded and compensated summation (Algorithm Kahan’s cascaded and compensated summation). The compensation step has been taken out of the for-loop to reduce the data dependency. In this modified version (Algorithm Modified Kahan’s cascaded and compensated summation) all summation errors are accumulated inside the for-loop for the final compensation step. Additionally an initial value for the resulting sum has been introduced.
1 2 3 4 5 6 7 8 | function [s] = ModifiedKahanSum (s, x, N) err = 0; for i = 1:N [s, e] = FastTwoSum (s, x(i)); err = fl(err + e); end s = fl(s + err) % compensation step end |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | function [s] = BucketSum (x, N) % Create appropriate masks mask = CreateMasks (M); mask(1) = 0; mask(M) = NaN; % Create array of M buckets, initialized with their mask. % a(1:2) are underflow and a((M - 1):M) are overflow buckets % a(3:(M - 2)) cover SHIFT exponents a = mask; sum = 0; for i = 1:N pos = ceil (exp(x(i)) / SHIFT) + 2; % exp(): extracts biased exponent [a(pos), e] = FastTwoSum (a(pos), x(i)); a(pos - 2) = fl(a(pos - 2) + e); if (mod (i, C1) == 0) % C1: capacity of normal buckets for j = 1:(M - 2) % Tidy up normal buckets r = fl(fl(mask(j + 1) + fl(a(j) - mask(j))) - mask(j + 1)); a(j + 1) = fl(a(j + 1) + r); a(j) = fl(a(j) - r); end end if (mod (i, C2) == 0) % C2: capacity of overflow buckets sum = fl(sum + fl(a(M - 1) - mask(M - 1))); % Tidy up overflow a(M - 1) = mask(M - 1); end end for i = 1:(M - 1) % Remove masks a(i) = a(i) - mask(i); end s = ModifiedKahanSum (sum, a_{M-1 \text{ downto } 1}, M-1); end |
BucketSum is responsible for step 1 and presented in Algorithm BucketSum. What distinguishes BucketSum from OnlineExactSum is the ternary partitioning of each floating-point accumulator (bucket). This partitioning is done in order to archive a certain cascaded, overlapping pattern for the accurate summation, as Figure Error storage scenario of the smallest allowed addend into bucket i - 2. shows. The “distance” between two neighboring buckets is called shift and identical to the length in the partitioning. From Figure Generic significant partition. one can see, that the whole generic partitioning pattern consists of the following parts, that are determined in Theorem 2:
- Two set bits in the beginning.
- Accumulation reserve (): This length decides about the number of addends, that can be accumulated into a bucket, without loosing any significant bits.
- Variable extension (guard): guard is a variable extension of the following .
- Accumulated exponent range (): Each bucket is assigned to accumulate addends of a certain exponent sub range. is the length of this range. Therefore, is at least one, otherwise no addend may be added to this bucket.
- Residual (): is only used to preserve significant bits of an addend.
Another characteristic of BucketSum is, that there is no fixed splitting of the input addends like in HybridSum [Hayes2009]. The splitting is performed dynamically by FastTwoSum as one can see in Algorithm BucketSum line 9. After giving an overview of BucketSum, there follows a more detailed description of the algorithm, which starts with a formal analysis of the bucket partitioning.
- Axiom 1.
- The “unit in the first place” (see Section Rounding) of a bucket is immutable except for the underflow or overflow range.
Axiom 1 means that during the summation process the significance of the most significant bit of each bucket may not change. Otherwise it is not possible to rely on a fixed exponent range for each bucket. To archive that, the leading bit pattern “11” has been introduced. Under the assumption, that the most significant bit of bucket i is , each number less than may be added or subtracted without changing the significance of the first bit of the bucket. This property is well known from integer arithmetic.
- Assumption 1.
- The exponent range of floating-point numbers is unlimited.
Assumption 1 allows to ignore the under- and overflow-range for now. These two ranges will be treated in Section Realization for binary64 for the special case of binary64 values.
- Theorem 1.
- The summation error of bucket i has to be added at least to bucket i - 2.
- Proof.
- The proof for Theorem 1 will be done graphically in Figure Four possible examples for partitioning and storing the error of the smallest allowed addend into the neighbouring bucket.. In that Figure it is obvious, that independent of the bit lengths , , and the full bit precision p of the addend cannot be preserved. Therefore the summation error of bucket i has to be added at least to bucket i - 2, like shown in Figure Error storage scenario of the smallest allowed addend into bucket i - 2.. In Algorithm BucketSum, line 10, this action is performed. ∎
- Assumption 2.
- The summation error of bucket i is added to bucket i - 2.
The choice of the error bucket is dependent on the size shift. The “further away” the error bucket is, the smaller shift has to be, as one might deduce from Figure Error storage scenario of the smallest allowed addend into bucket i - 2.. And the smaller shift is, the more buckets are required. Assumption 2 takes the first possible error bucket according to Theorem 1, in order to reduce the number of required buckets.
- Theorem 2.
For Axiom 1 , Assumption 1, and Assumption 2, the following rules have to apply to the lengths , , and , in order to get a ternary bucket partition, that maximizes the lengths and :
- Proof.
From Figure Error storage scenario of the smallest allowed addend into bucket i - 2. three useful equations can be derived:
(2)¶
(3)¶
(4)¶
By reformulating Equation (2) to , one can derive two equivalent objective functions . For the latter one, a constrained optimization problem is given in (5).
(5)¶
The optimization problem (5) can be relaxed to the problem (6) with additionally combining the first two constraints.
(6)¶
As has the larger factor in the first constraint of (6), an optimum can be obtained for in Equation (7).
(7)¶
By respecting guard and to be integers, an upward rounding of this optimum, to fulfill the optimization constraints, yields the first equation of Theorem 2. With the first equation of Theorem 2 and equation (4) an upper bound for is found:
(8)¶
A lower bound is obtained by combining and equation (3):
(9)¶
Respecting the integer property of and by combining the Equations (8) and (9), the second equation of Theorem 2 is derived. Inserting the first equation of Theorem 2 into Equation (2) yields the third equation of Theorem 2. ∎
- Assumption 3.
- In the first equation of Theorem 2, guard is maximized.
With Theorem 2 only an equation for the sum of guard and was derived. As earlier described, guard is an extension , at the cost of , which is of minor importance. Therefore it is more desirable to maximize guard (Assumption 3). This allows to define guard more precisely in Theorem 3.
- Theorem 3.
- Under the Assumption 3 and with Theorem 2 it holds .
- Proof.
According to Theorem 2 is constant. This means maximizing guard is equivalent to minimizing . A lower bound for is given in Equation (3). Combined with the upper bound of shift from the second equation of Theorem 2, yields:
(10)¶
Due to the minimization of , (10) becomes an equation. This inserted into the first equation of Theorem 2, proofs Theorem 3. ∎
All possible relations between shift and p can be seen in Figure All possible ternary partitions for a given shift. Note that p = 3 \cdot shift - 2 violates the upper bound of shift in Theorem 2.. In the following, the number of addends, that can be accumulated without loosing any significant bits, are described by Theorem 4.
- Theorem 4.
- Given the bucket partition of Theorem 2, up to additions to a bucket can be performed without violating Axiom 1.
- Proof.
- Without loss of generality, the by magnitude largest allowed number to be added to a bucket with a “unit in the first place” is . The bucket is initialized with , thus it will not overflow for . ∎
Finally a complexity analysis of BucketSum (Algorithm BucketSum) similar to that one in [Hayes2010] should be done. For each of the N addends the following operations have to be performed:
- Bucket determination (line 8): 3 FLOP s [2]
- FastTwoSum (line 9): 3 FLOP s
- Error summation (line 10): 1 FLOP
After C2 steps, the overflow bucket has to be tidied up, that requires two additional FLOP s (lines 18-21). After C1 steps, all M - 2 buckets need to be tidied up. This requires five additional FLOP s (lines 11-17) per bucket as well. Once in the end an unmasking has to happen with M - 1 FLOP s (lines 23-25) and for the final sum up, Algorithm Modified Kahan’s cascaded and compensated summation (line 26) requires more FLOP s. All in all
are required. Assume a large number of addends . Then the final sum up part has a small static contribution to the complexity, thus it can be neglected. If the buckets don’t have to be tidied up during the summation for small , an overall complexity of 7N remains. Even if holds, the effort for tiding up is small compared to the seven FLOP s, that always have to be performed. Thus the complexity of BucketSum is considered to be 7N.
Realization for binary64¶
The binary64 type has the precision p = 53. Therefore we get and a significant partitioning by Theorem 2 and Theorem 3, as shown in Figure Significant partition for binary64..
Also one can no longer assume an infinite exponent range. Assumption 1 has to be replaced by a concrete bucket alignment. This alignment consists of three parts, the under- and overflow and the normal bucket part. The anchor for the alignment is, that the least significant bit of the shift part of the first normal bucket a[0] has the significance of the biased exponent 0. This anchor has been chosen, because no binary64, even the subnormal numbers with a biased exponent of 0, can be accumulated into a bucket smaller than a[0]. All in all to cover the full exponent range of binary64, one needs buckets. Beginning with the first normal bucket a[0], each following bucket is aligned with a unit in the first place of shift bigger than its predecessor. The maximal multiple of shift that fits in this pattern is . Therefore we define the topmost bucket to be an overflow bucket. This bucket is responsible for values with a unit in the first place of greater than , but these values are ignored in this work. With an unreasonable effort, this overflow situation can be handled differently. The second overflow bucket needs an exceptional alignment as well. Its is smaller due to upper limit of the binary64 exponent range . Because of the alignment of a[0] and Assumption 2, two additional error buckets for the underflow range are required. For the underflow range , bucket a[-1] follows the alignment scheme of the normal buckets and bucket a[-2] is responsible for the remaining bit positions. The exponent range partition is illustrated in Equation (11). Graphical visualizations of the bucket alignment in the under- and overflow range are given in the Figures Visualization of the bucket alignment in the underflow range. and Visualization of the bucket alignment in the overflow range..
(11)¶
Finally the accumulation reserve for the normal and underflow buckets is according to Theorem 4 smaller than . For the first overflow bucket one obtains analogue to Theorem 4 an accumulation reserve of less than .
Implementation¶
One essential element of this project is the efficient implementation of BucketSum. This chapter deals with all implementation details and changes to the pseudo-code from Algorithm BucketSum. Some potential improvements to a floating-point using software are described in Chapters Software and compiler support and Performance. The in Chapter Performance presented technique of partial loop unrolling can be used to obtain an elegant side effect for the tidy up and sum up steps. In Algorithm BucketSum all buckets are initialized with an appropriate mask. This mask has to be considered in the tidy up process (lines 13 and 19-20) and it has to be removed before the final sum up (lines 23-25). If two different bucket arrays a1 and a2 are used, a1 uses the masks as described in Algorithm BucketSum and a2 uses the negative masks. In that way the exact sum of the unmasked values of the buckets i can be computed by a1[i] + a2[i]. This way the number of floating-point operations dealing with masking and unmaking are reduced a lot. Additionally the partial loop unrolling increases the instruction-level parallelism and finally increases the tidy up values by a factor of two. This means that less tidy up “interruptions” for the N addends are required.
Another considered optimization is the avoidance of the division by the shift in Algorithm BucketSum line 8. An integer division is an expensive operation compared to multiplication and bit shifting. In [Fog2014] (p. 54-58) one can find latencies for several instructions. For the AMD “Piledriver” the latency for a signed or unsigned division ((I)DIV [AMD2013b] (Chapter 3)) ranges from 12-71 clock cycles. Compared to this the sum of the latencies of a left or right bit shift (SHL/SHR [AMD2013b] (Chapter 3)) with one clock cycle and a signed or unsigned multiplication ((I)MUL [AMD2013b] (Chapter 3)) with 4-6 clock cycles is by far smaller. As this division by the shift has to be done for each addends exponent, a small speed up could be archived by replacing the division by a multiplication followed by a bit shift, as shown in Listing Division by 18 replacement. The idea behind the values of Listing Division by 18 replacement is an integer optimization problem.
(12)¶
For normal and subnormal binary64 the exponents range from 0 to 2046 and the desired division should be a cheap bit shift, thus a power of two. Therefore the task is to find for the smallest possible power of two y some minimal . This x was found with the program of Listing Program to solve the integer optimization problem (Equation eq-Division by 18 optimization problem)..
1 2 3 4 5 6 | double d = 1.0; // Exponent extraction unsigned position = ((ieee754_double *)(&d))->exponent; // Perform equivalent operations unsigned pos1 = position / 18; unsigned pos2 = (position * 1821) >> 15; |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | #include <cmath> #include <iostream> int main () { unsigned div = 1; // Try some powers of two (div = 2^y) for (unsigned y = 1; y < 32; y++) { div *= 2; unsigned x = (unsigned) std::ceil ((double) div / 18.0); // Test all exponents of the IEEE 754 - 2008 binary64 normal and // subnormal range int is_valid = 1; for (unsigned i = 0; i < 2047; i++) { is_valid &= ((i / 18) == ((i * x) / div)); } if (is_valid) { std::cout << "Found: " << x << " / 2^" << y << std::endl; } } return 0; } |
Benchmark¶
The benchmark program compares the five summation algorithms of Table Comparison of summation algorithms for input data length N with their source of implementation mentioned in brackets. The accurate summation results of the C-XSC toolbox will be used as reference values for the five types of test data.
Algorithm | FLOP s | Run-time | Space |
---|---|---|---|
Ordinary Recursive Summation (Algorithm Recursive summation) | 1 | ||
SumK (K = 2, [Lathus2012]) | 2-3 | ||
iFastSum ([Hayes2010]) | 3-5 | ||
OnlineExactSum ([Hayes2010]) | 4-6* | ||
BucketSum (Algorithm BucketSum) | 1-2* |
An asterisk “*” in Comparison of summation algorithms for input data length N indicates the use of instruction-level parallelism, a dagger “”, that the results for Data 3 were omitted, and a double dagger “”, that this applies only for large dimensions. The test data for the summation benchmark program is chosen similar to [Hayes2010]. Data 1 are N random, positive floating-point numbers, all with an exponent of . Thus Data 1 is pretty well-conditioned . Data 2 is ill-conditioned. The exponents are distributed uniformly and randomly between and , the signs are assigned randomly and the significant is filled randomly as well. Data 3 is similar to Data 2, but its sum is exactly zero. Data 4 is Anderson’s ill-conditioned data [Anderson1999]. And finally Data 5 is designed to especially stress the accumulation reserve of BucketSum. A visualization of that test case is given in Figure Visualization of the stress test case for roundToNearest..
For time measurement the clock() function [ISO-IEC-9899-2011] (Chapter 7.27.2.1) [ISO-IEC-14882-2011] (Chapter 20.11.8) is used. To keep the time measurement as accurate as possible, all memory operations like array creation and destruction should be kept outside of time measuring code blocks. On the other hand, if the size of the input data N was chosen too small, the measured time is too inaccurate. This requires a certain number of repeated operations R, to obtain detectable results. But some algorithms like SumK and iFastSum operate inline on the input data. Thus providing a single copy of the data will not suffice to get identical initial conditions for each repetition. To meet all these constraints, a large copy of elements for summation is created, each of the repeated N elements with a leading zero, as iFastSum and BucketSum imitate Fortran indexing. The systems available main memory creates another constraint on the maximum test case size . This product should not exceed the test systems 8 GB of main memory, otherwise the timings will become inaccurate due to swapping to hard disk. This means for R = 1 repetitions the theoretical maximum test case size can be
Experimental test runs revealed, that about elements are necessary in order to obtain detectable results. Therefore the following data lengths and repetitions are defined:
- Middle dimension: elements with repetitions
- Large dimension: elements with repetitions
The benchmarks (see Figures above) show, that BucketSum performs best for all given kinds of data. BucketSum is by factor 2-3 slower than the Ordinary Recursive Summation and is slightly faster than SumK (with K = 2). For middle and large data lengths BucketSum scales linear in contrast to OnlineExactSum, which starts to scale linear at a data length of about elements. Another interesting observation is, that OnlineExactSum is dependent on the condition of the input data for small data lengths. For iFastSum, OnlineExactSum and BucketSum the results have been compared to that one of the C-XSC toolbox using an assert() [ISO-IEC-14882-2011] (Chapter 19.3) statement, thus any inaccurate result would have interrupted the benchmark. As no interruptions occurred, all three algorithms are assumed to deliver correctly rounded sums. The most important properties of the algorithms under test are summarized in Table Comparison of summation algorithms for input data length N, which is a modified extension of [Hayes2010].
Footnotes
[1] | http://www2.math.uni-wuppertal.de/wrswt/xsc/cxsc_new.html |
[2] | For simplicity integer operations are counted as FLOP s. |