Why doesn't GCC optimize a*a*a*a*a*a to (a*a*a)*(a*a*a)?

Asked 21/6, 2011 at 18:49 Answered 16/6, 2016 at 18:44

Solved gcc assembly floating-point compiler-optimization fast-math

2309

I am doing some numerical optimization on a scientific application. One thing I noticed is that GCC will optimize the call pow(a,2) by compiling it into a*a, but the call pow(a,6) is not optimized and will actually call the library function pow, which greatly slows down the performance. (In contrast, Intel C++ Compiler, executable icc, will eliminate the library call for pow(a,6).)

What I am curious about is that when I replaced pow(a,6) with a*a*a*a*a*a using GCC 4.5.1 and options "-O3 -lm -funroll-loops -msse4", it uses 5 mulsd instructions:

movapd  %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm14, %xmm13

while if I write (a*a*a)*(a*a*a), it will produce

movapd  %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm13, %xmm13

which reduces the number of multiply instructions to 3. icc has similar behavior.

Why do compilers not recognize this optimization trick?

Stanzel answered 21/6, 2011 at 18:49 Comment(11)

What does "recognizing pow(a,6)" mean? – Lessen 21/6, 2011 at 18:56

I am surprised gcc does not optimize this. The 1970s FORTRAN compiler I used on CDC Cyber did this sort of transformation, even without selecting optimization. I think the Unix V6 (c. 1978) C compiler did it when optimization was enabled, though many of the optimizations it did were to save code space, a precious commodity in those days. – Serviette 21/6, 2011 at 18:57

Um... you know that aaaaaa and (aaa)*(aa*a) are not the same with floating point numbers, don't you? You'll have to use -funsafe-math or -ffast-math or something for that. – Etheline 21/6, 2011 at 18:57

I suggest you read "What Every Computer Scientist Should Know About Floating Point Arithmetic" by David Goldberg: download.oracle.com/docs/cd/E19957-01/806-3568/… after which you'll have a more complete understanding of the tar pit that you've just walked into! – Bumpy 22/6, 2011 at 9:20

A perfectly reasonable question. 20 yrs ago I asked the same general question, and by crushing that single bottleneck, reduced the execution time of a Monte Carlo simulation from 21 hours to 7 hours. The code in the inner loop was executed 13 trillion times in the process, but it got the simulation into an over-night window. (see answer below) – Unconstitutional 21/12, 2012 at 3:47

Maybe throw (a*a)*(a*a)*(a*a) into the mix, too. Same number of multiplications, but probably more accurate. – Commonplace 11/8, 2015 at 17:18

To begin with, how this is optimized greatly depends on what type a has... – Primaveria 11/1, 2016 at 12:51

Shameless plug: in addition to Goldberg's paper, I suggest reading mine, hal.archives-ouvertes.fr/file/index/docid/281429/filename/… – Meza 19/11, 2016 at 15:19

@Rok actually, a good optimizer could take that a step further. a*a only needs to be done once. That result could be reused easily enough to reduce it to only 3 multiplication operations. – Bluster 23/11, 2016 at 18:56

@penguin359: Yes, 3, exactly the same as (a*a*a)*(a*a*a), this is what I have suggested is as an alternative. What are you trying to say? – Commonplace 23/11, 2016 at 21:49

@RokKralj extremely good comment; and question creator: extremely valuable question; – Wyrick 1/8, 2022 at 17:15

2920

Because Floating Point Math is not Associative. The way you group the operands in floating point multiplication has an effect on the numerical accuracy of the answer.

As a result, most compilers are very conservative about reordering floating point calculations unless they can be sure that the answer will stay the same, or unless you tell them you don't care about numerical accuracy. For example: the -fassociative-math option of gcc which allows gcc to reassociate floating point operations, or even the -ffast-math option which allows even more aggressive tradeoffs of accuracy against speed.

Maness answered 21/6, 2011 at 18:50 Comment(8)

Yes. With -ffast-math it is doing such optimization. Good idea! But since our code concerns more accuracy than the speed, it might be better not to pass it. – Stanzel 21/6, 2011 at 19:9

IIRC C99 allows the compiler to do such "unsafe" FP optimizations, but GCC (on anything other than the x87) makes a reasonable attempt at following IEEE 754 - it's not "error bounds"; there is only one correct answer. – Vercelli 22/6, 2011 at 2:19

The implementation details of pow are neither here nor there; this answer doesn't even reference pow. – Sorbose 3/1, 2013 at 2:19

@Lohoris, the basis of my argument was that any convergence routine, dependent on the same floating point hardware as a series of multiplies, could not possibly be more accurate. This turns out not to be true, because the convergence routines cheat, in a good way. They interpolate between precomputed table values that lie on powers of 2 boundaries. The error in interpolation is therefore very small. I will also tell you that for small powers, like 6, an integer power function does just as well. Still, if you were a compiler writer, I'm sure you'd put those optimizations in pow(). – Unconstitutional 3/8, 2013 at 1:45

So my question is.. does that mean that the Intel Compiler is performing the optimization at the expense of accuracy and correctness? Or does it find another way to optimize while ensuring correctness? – Geny 27/3, 2014 at 17:59

@nedR: ICC defaults to allowing re-association. If you want to get standard-conforming behavior, you need to set -fp-model precise with ICC. clang and gcc default to strict conformance w.r.t. reassociation. – Sorbose 27/3, 2014 at 18:19

@xis, it's not really that -fassociative-math would be inaccurrate; it's just that a*a*a*a*a*a and (a*a*a)*(a*a*a) are different. It's not about accuracy; it's about standards conformance and strictly repeatable results, e.g. same results on any compiler. Floating point numbers are already not exact. It is seldomly inappropriate to compile with -fassociative-math. – Moot 24/8, 2014 at 16:11

If you want accuracy, prefer (aaa)*(aaa). Possible reasons are more balanced sizes of involved operands, aaaaa >> a (>> much greater than), and fewer operations, reducing the number of truncations. – Roughish 28/5, 2017 at 9:2

712

Lambdageek correctly points out that because associativity does not hold for floating-point numbers, the "optimization" of a*a*a*a*a*a to (a*a*a)*(a*a*a) may change the value. This is why it is disallowed by C99 (unless specifically allowed by the user, via compiler flag or pragma). Generally, the assumption is that the programmer wrote what she did for a reason, and the compiler should respect that. If you want (a*a*a)*(a*a*a), write that.

That can be a pain to write, though; why can't the compiler just do [what you consider to be] the right thing when you use pow(a,6)? Because it would be the wrong thing to do. On a platform with a good math library, pow(a,6) is significantly more accurate than either a*a*a*a*a*a or (a*a*a)*(a*a*a). Just to provide some data, I ran a small experiment on my Mac Pro, measuring the worst error in evaluating a^6 for all single-precision floating numbers between [1,2):

worst relative error using    powf(a, 6.f): 5.96e-08
worst relative error using (a*a*a)*(a*a*a): 2.94e-07
worst relative error using     a*a*a*a*a*a: 2.58e-07

Using pow instead of a multiplication tree reduces the error bound by a factor of 4. Compilers should not (and generally do not) make "optimizations" that increase error unless licensed to do so by the user (e.g. via -ffast-math).

Note that GCC provides __builtin_powi(x,n) as an alternative to pow( ), which should generate an inline multiplication tree. Use that if you want to trade off accuracy for performance, but do not want to enable fast-math.

Sorbose answered 22/6, 2011 at 15:32 Comment(14)

Note also that Visual C++ provides an 'enhanced' version of pow(). By calling _set_SSE2_enable(<flag>) with flag=1, it will use SSE2 if possible. This reduces accuracy by a bit, but improves speeds (in some cases). MSDN: _set_SSE2_enable() and pow() – Eluviation 22/6, 2011 at 17:4

@xis19: With a good math library, the same will hold for double (and actually, for any supported floating-point type). – Sorbose 22/6, 2011 at 17:10

@TkTech: using SSE2 need not necessarily reduce accuracy, even. Most modern math libraries on x86 use SSE when it's available, and many of them deliver very accurate results. – Sorbose 22/6, 2011 at 17:12

@Stephen: The MSDN documentation itself notes that there may be reduced accuracy when SSE2 instructions are used (as they are by default) due to the intermediate registers being 80bit on the FPU and 64bit when using SSE2. – Eluviation 22/6, 2011 at 17:32

@TkTech: Any reduced accuracy is due to Microsoft's implementation, not the size of the registers used. It's possible to deliver a correctly-rounded pow using only 32-bit registers, if the library writer is so motivated. There are SSE-based pow implementations that are more accurate than most x87-based implementations, and there are also implementations that trade off some accuracy for speed. – Sorbose 22/6, 2011 at 17:37

@Stephen: Sure, you can make a more accurate implementation using nothing 8bit registers and a little array. However, I was and still am specifically talking about Visual C++'s implementation of pow(). "Could" and "possible" are not "IS" :) – Eluviation 22/6, 2011 at 17:45

@TkTech: Of course, I just wanted to make clear that the reduction in accuracy is due to the choices made by the library writers, not intrinsic to the use of SSE. – Sorbose 22/6, 2011 at 17:56

I'm interested to know what you used as the "gold standard" here for calculating relative errors -- I would normally have expected it would be a*a*a*a*a*a, but that is apparently not the case! :) – Haymo 24/9, 2013 at 22:44

@j_random_hacker: since I was comparing single-precision results, double-precision suffices for a gold standard — the error from aaaaaa computed in double is *vastly smaller than the error of any of the single-precision computations. – Sorbose 24/9, 2013 at 22:47

@StephenCanon - Three comments. (a) +1. This is a very nice answer. (b) An even better gold standard: std::pow((long double)a,6). (c) There is a third way: use double precision for intermediate calculations, for example calling Szabolcs's power template function via power<6,double>(a). Now you get half an ULP accuracy (as a float result) but with only a smallish performance penalty (1.4 times longer than a*a*a*a*a*a as a float). Compare with the huge performance penalty (32.4 times longer on my machine) that results from calling std::pow(float,float). – Swingletree 14/4, 2014 at 16:57

@DavidHammen Long Double would do nothing on MSVC's implementation, and more than one other; they type pun long double to just double. You'd need to make sure long double was well supported before saying it's a better gold standard. – Insectarium 30/6, 2014 at 13:23

Maybe throw (a*a)*(a*a)*(a*a) into the mix, too. Same number of multiplications, but probably more accurate. – Commonplace 11/8, 2015 at 17:18

"Generally, the assumption is that the programmer wrote what she did for a reason, and the compiler should respect that. If you want (aaa)*(aaa), write that." On this reasoning, everything could/should be forgotten since the first macro-capable assemblers... – Arching 26/4, 2016 at 10:17

As an exercise I performed the same experiment using the exact result as the reference and of course got the same results. __builtin_powi calculates the number correctly to the nearest value. (a*a)*(a*a)*(a*a) isn't much different from (a*a*a)*(a*a*a). – Notional 8/8, 2022 at 15:35

194

Another similar case: most compilers won't optimize a + b + c + d to (a + b) + (c + d) (this is an optimization since the second expression can be pipelined better) and evaluate it as given (i.e. as (((a + b) + c) + d)). This too is because of corner cases:

float a = 1e35, b = 1e-5, c = -1e35, d = 1e-5;
printf("%e %e\n", a + b + c + d, (a + b) + (c + d));

This outputs 1.000000e-05 0.000000e+00

Brigand answered 22/6, 2011 at 22:39 Comment(6)

This is not exactly the same. Changin the order of multiplications/divisions (excluding division by 0) is safer than changin order of sum/subtraction. In my humble opinion, the compiler should try to associate mults./divs. because doing that reduce the total number of operations and beside the performance gain ther's also a precision gain. – Tectonics 7/7, 2014 at 8:22

@DarioOO: It is no safer. Multiply and divide are the same as addition and subtraction of the exponent, and changing the order can easily cause temporaries to exceed the possible range of the exponent. (Not exactly the same, because the exponent doesn't suffer loss of precision... but the representation is still quite limited, and reordering can lead to unrepresentable values) – Lamoree 4/3, 2015 at 17:47

I think you are missing some calculus background. Multplying and dividing 2 numbers introduce the same amount of error. While subtracting / addition 2 numbers may introduce a bigger error especially when the 2 numbers are order of magnitudes different, hence it is safer re-arrangin mul/divide than sub/add because it introduce a minor change in final error. – Tectonics 5/3, 2015 at 8:49

@DarioOO: the risk is different with mul/div: Reordering either makes a negligible change in the final result, or the exponent overflows at some point (where it wouldn't have before) and the result is massively different (potentially +inf or 0). – Erenow 30/7, 2015 at 4:37

@GameDeveloper Imposing a precision gain in unpredictable ways is hugely problematic. – Intromit 29/9, 2019 at 21:9

Interesting case! Because of the values choosen we have a + b == a and c + d == c and of course a+c==0, depending of the order use to add all 4 variables the result could be 0, 1.000000e-05 or 2.000000e-05! – Minny 20/4, 2023 at 11:4

Fortran (designed for scientific computing) has a built-in power operator, and as far as I know Fortran compilers will commonly optimize raising to integer powers in a similar fashion to what you describe. C/C++ unfortunately don't have a power operator, only the library function pow(). This doesn't prevent smart compilers from treating pow specially and computing it in a faster way for special cases, but it seems they do it less commonly ...

Some years ago I was trying to make it more convenient to calculate integer powers in an optimal way, and came up with the following. It's C++, not C though, and still depends on the compiler being somewhat smart about how to optimize/inline things. Anyway, hope you might find it useful in practice:

template<unsigned N> struct power_impl;

template<unsigned N> struct power_impl {
    template<typename T>
    static T calc(const T &x) {
        if (N%2 == 0)
            return power_impl<N/2>::calc(x*x);
        else if (N%3 == 0)
            return power_impl<N/3>::calc(x*x*x);
        return power_impl<N-1>::calc(x)*x;
    }
};

template<> struct power_impl<0> {
    template<typename T>
    static T calc(const T &) { return 1; }
};

template<unsigned N, typename T>
inline T power(const T &x) {
    return power_impl<N>::calc(x);
}

_{Clarification for the curious: this does not find the optimal way to compute powers, but since finding the optimal solution is an NP-complete problem and this is only worth doing for small powers anyway (as opposed to using pow), there's no reason to fuss with the detail.}

Then just use it as power<6>(a).

This makes it easy to type powers (no need to spell out 6 as with parens), and lets you have this kind of optimization without -ffast-math in case you have something precision dependent such as compensated summation (an example where the order of operations is essential).

You can probably also forget that this is C++ and just use it in the C program (if it compiles with a C++ compiler).

Hope this can be useful.

EDIT:

This is what I get from my compiler:

For a*a*a*a*a*a,

    movapd  %xmm1, %xmm0
    mulsd   %xmm1, %xmm0
    mulsd   %xmm1, %xmm0
    mulsd   %xmm1, %xmm0
    mulsd   %xmm1, %xmm0
    mulsd   %xmm1, %xmm0

For (a*a*a)*(a*a*a),

    movapd  %xmm1, %xmm0
    mulsd   %xmm1, %xmm0
    mulsd   %xmm1, %xmm0
    mulsd   %xmm0, %xmm0

For power<6>(a),

    mulsd   %xmm0, %xmm0
    movapd  %xmm0, %xmm1
    mulsd   %xmm0, %xmm1
    mulsd   %xmm0, %xmm1

Ashleighashlen answered 23/6, 2011 at 11:44 Comment(6)

Finding the optimal power tree might be hard, but since it is only interesting for small powers, the obvious answer is to precompute it once (Knuth provides a table up to 100) and use that hardcoded table (that's what gcc does internally for powi). – Hyaloplasm 31/1, 2013 at 19:11

On modern processors, the speed is limited by latency. For example, the result of a multiplication might be available after five cycles. In that situation, finding the fastest way to create some power might be more tricky. – Misbeliever 10/3, 2014 at 16:46

You could also try finding the power tree that gives the lowest upper bound for the relative rounding error, or the lowest average relative rounding error. – Misbeliever 10/3, 2014 at 16:52

Boost has also support for this, e.g. boost::math::pow<6>(n); I think it even tries to reduce the number of multiplications by extracting common factors. – Stithy 3/8, 2017 at 12:43

Note that the last one is equivalent to (a**2)**3 – Gratt 5/1, 2018 at 16:1

It's one of the cases where Fortran made the right choice (compiler can use associativity unless the user uses parentheses, a well known notation to express evaluation order) whereas C made the wrong choice (there's no way to do associative math) – Malamud 17/8, 2021 at 2:21

GCC does actually optimize a*a*a*a*a*a to (a*a*a)*(a*a*a) when a is an integer. I tried with this command:

$ echo 'int f(int x) { return x*x*x*x*x*x; }' | gcc -o - -O2 -S -masm=intel -x c -

There are a lot of gcc flags but nothing fancy. They mean: Read from stdin; use O2 optimization level; output assembly language listing instead of a binary; the listing should use Intel assembly language syntax; the input is in C language (usually language is inferred from input file extension, but there is no file extension when reading from stdin); and write to stdout.

Here's the important part of the output. I've annotated it with some comments indicating what's going on in the assembly language:

; x is in edi to begin with.  eax will be used as a temporary register.
mov  eax, edi  ; temp = x
imul eax, edi  ; temp = x * temp
imul eax, edi  ; temp = x * temp
imul eax, eax  ; temp = temp * temp

I'm using system GCC on Linux Mint 16 Petra, an Ubuntu derivative. Here's the gcc version:

$ gcc --version
gcc (Ubuntu/Linaro 4.8.1-10ubuntu9) 4.8.1

As other posters have noted, this option is not possible in floating point, because floating point arithmetic is not associative.

Percolator answered 29/3, 2014 at 6:51 Comment(3)

This is legal for integer multiplication because two's complement overflow is undefined behaviour. If there's going to be an overflow, it will happen somewhere, regardless of reordering operations. So, expressions with no overflow evaluate the same, expressions that overflow are undefined behaviour so it's ok for the compiler to change the point at which overflow happens. gcc does this with unsigned int, too. – Erenow 30/7, 2015 at 5:18

@PeterCordes: I think a better reason it's legal is that, unlike floating point multiplication, integer multiplication (mod n) is associative. Of course it's still undefined behavior to have a signed integral type overflow, but pretending it wasn't, you'd always get the same results from a*a*a*a*a*a and (a*a*a)*(a*a*a). (And of course for unsigned types the overflow isn't UB anyway.) – Norrisnorrv 1/10, 2020 at 21:37

@DanielMcLaury: Oh, yes, I left that critical requirement un-stated. :P Apparently back in 2015 I thought everyone knew that already, or was talking about the possible UB that might be a worry after establishing that the actual integer result is the same. (OTOH, I think I recall seeing a case where GCC didn't optimize signed integer math the same as unsigned, because of some overly-conservative "don't introduce UB" logic which doesn't make sense when the final result is the same.) – Erenow 1/10, 2020 at 21:56

Because a 32-bit floating-point number - such as 1.024 - is not 1.024. In a computer, 1.024 is an interval: from (1.024-e) to (1.024+e), where "e" represents an error. Some people fail to realize this and also believe that * in a*a stands for multiplication of arbitrary-precision numbers without there being any errors attached to those numbers. The reason why some people fail to realize this is perhaps the math computations they exercised in elementary schools: working only with ideal numbers without errors attached, and believing that it is OK to simply ignore "e" while performing multiplication. They do not see the "e" implicit in "float a=1.2", "a*a*a" and similar C codes.

Should majority of programmers recognize (and be able to execute on) the idea that C expression a*a*a*a*a*a is not actually working with ideal numbers, the GCC compiler would then be FREE to optimize "a*a*a*a*a*a" into say "t=(a*a); t*t*t" which requires a smaller number of multiplications. But unfortunately, the GCC compiler does not know whether the programmer writing the code thinks that "a" is a number with or without an error. And so GCC will only do what the source code looks like - because that is what GCC sees with its "naked eye".

... once you know what kind of programmer you are, you can use the "-ffast-math" switch to tell GCC that "Hey, GCC, I know what I am doing!". This will allow GCC to convert a*a*a*a*a*a into a different piece of text - it looks different from a*a*a*a*a*a - but still computes a number within the error interval of a*a*a*a*a*a. This is OK, since you already know you are working with intervals, not ideal numbers.

Paediatrician answered 23/6, 2011 at 10:7 Comment(8)

Floating point numbers are exact. They're just not necessarily exactly what you expected. Moreover, the technique with epsilon is itself an approximation to how to tackle things in reality, because the true expected error is relative to the scale of the mantissa, i.e., you're normally up to about 1 LSB out, but that can increase with every operation performed if you're not careful so consult a numerical analyst before doing anything non-trivial with floating point. Use a proper library if you possibly can. – Tuberosity 24/6, 2011 at 13:35

@DonalFellows: The IEEE standard requires that floating-point calculations yield the result that most accurately matches what the result would be if the source operands were exact values, but that does not mean they actually represent exact values. It is in many cases more helpful to regard 0.1f as being (1,677,722 +/- 0.5)/16,777,216, which should be displayed with the number of decimal digits implied by that uncertainty, than to regard it as exact quantity (1,677,722 +/- 0.5)/16,777,216 (which should be displayed to 24 decimal digits). – Grania 18/11, 2012 at 15:15

@supercat: IEEE-754 is pretty clear on the point that floating-point data do represent exact values; clauses 3.2 - 3.4 are the relevant sections. You can, of course, choose to interpret them otherwise, just as you can choose to interpret int x = 3 as meaning that x is 3+/-0.5. – Sorbose 4/1, 2013 at 13:35

@StephenCanon: I suppose it depends what you mean by "represent". In most applications, variables are used to model concrete things. In a physics simulation, for example, they may represent the X, Y, and Z components of various objects' positions and velocities, etc. If I say Distance = Math.Sqrt((x2-x1)*(x2-x1)+(y2-y1)*(y2-y1)+(z2-z1)*(z2-z1)), the purpose of Distance is to represent the Euclidean distance between (x1,y1,z1) and (x2,y2,z2). It's unlikely that the precise number stored in Distance will be the precise Euclidean distance between the two points, but... – Grania 4/1, 2013 at 16:15

@supercat: I agree entirely, but that doesn't mean that Distance isn't exactly equal to its numerical value; it means that numerical value is only an approximation to some physical quantity being modeled. – Sorbose 4/1, 2013 at 16:22

...nonetheless conventional usage would be to say that Distance represents that value, or perhaps Distance represents something that's for all practical purposes "close enough" to value, rather than explicitly stating that Distance represents the precise floating-point numerical value which would results from performing the aforementioned sequence of operations. From the point of view of the hardware performing the math primitives (multiplies, adds, sqrt, etc.) the quantities need to be evaluated exactly, but to the consumer they represent approximations. – Grania 4/1, 2013 at 16:24

My point is that if code performs someSingle = 1.0/10.0, the result will be as precise as the consumer is going to expect; if code performs someDouble = 1.0f/10.0f, the result is going to be off by many orders of magnitude more than a consumer which knew the float quantities happened to represent precise values would be apt to expect. If a Double is cast to Float and never cast back, the user will have no surprises with regard to accuracy. Conversions from Float to Double, however, are far more likely to yield "surprises". – Grania 4/1, 2013 at 16:27

For numerical analysis, your brain will thank you if you interpret floating point numbers not as intervals, but as exact values (which happen to be not exactly the values that you wanted). For example, if x is somewhere round 4.5 with an error less than 0.1, and you calculate (x + 1) - x, the "interval" interpretation leaves you with an interval from 0.8 to 1.2, while the "exact value" interpretation tells you the result will be 1 with an error of at most 2^(-50) in double precision. – Misbeliever 10/3, 2014 at 16:50

No posters have mentioned the contraction of floating expressions yet (ISO C standard, 6.5p8 and 7.12.2). If the FP_CONTRACT pragma is set to ON, the compiler is allowed to regard an expression such as a*a*a*a*a*a as a single operation, as if evaluated exactly with a single rounding. For instance, a compiler may replace it by an internal power function that is both faster and more accurate. This is particularly interesting as the behavior is partly controlled by the programmer directly in the source code, while compiler options provided by the end user may sometimes be used incorrectly.

The default state of the FP_CONTRACT pragma is implementation-defined, so that a compiler is allowed to do such optimizations by default. Thus portable code that needs to strictly follow the IEEE 754 rules should explicitly set it to OFF.

If a compiler doesn't support this pragma, it must be conservative by avoiding any such optimization, in case the developer has chosen to set it to OFF.

GCC doesn't support this pragma, but with the default options, it assumes it to be ON; thus for targets with a hardware FMA, if one wants to prevent the transformation a*b+c to fma(a,b,c), one needs to provide an option such as -ffp-contract=off (to explicitly set the pragma to OFF) or -std=c99 (to tell GCC to conform to some C standard version, here C99, thus follow the above paragraph). In the past, the latter option was not preventing the transformation, meaning that GCC was not conforming on this point: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37845

Ensor answered 27/6, 2014 at 21:3 Comment(5)

Long-lived popular questions sometimes show their age. This question was asked and answered in 2011, when GCC could be excused for not respecting exactly the then recent C99 standard. Of course now it's 2014, so GCC… ahem. – Desdemona 27/6, 2014 at 21:19

Shouldn't you be answering comparatively recent floating-point questions without an accepted answer instead, though? cough stackoverflow.com/questions/23703408 cough – Desdemona 27/6, 2014 at 21:22

I find it... disturbing that gcc does not implement C99 floating-point pragmas. – Meza 19/11, 2016 at 15:17

@DavidMonniaux pragmas are by definition optional to implement. – Allopathy 4/9, 2018 at 7:57

@TimSeguine But if a pragma is not implemented, its default value needs to be the most restrictive for the implementation. I suppose that's what David was thinking about. With GCC, this is now fixed for FP_CONTRACT if one uses an ISO C mode: it still does not implement the pragma, but in an ISO C mode, it now assumes that the pragma is off. – Ensor 4/9, 2018 at 10:25

Library functions like "pow" are usually carefully crafted to yield the minimum possible error (in generic case). This is usually achieved approximating functions with splines (according to Pascal's comment the most common implementation seems to be using Remez algorithm)

fundamentally the following operation:

pow(x,y);

has a inherent error of approximately the same magnitude as the error in any single multiplication or division.

While the following operation:

float a=someValue;
float b=a*a*a*a*a*a;

has a inherent error that is greater more than 5 times the error of a single multiplication or division (because you are combining 5 multiplications).

The compiler should be really carefull to the kind of optimization it is doing:

if optimizing pow(a,6) to a*a*a*a*a*a it may improve performance, but drastically reduce the accuracy for floating point numbers.
if optimizing a*a*a*a*a*a to pow(a,6) it may actually reduce the accuracy because "a" was some special value that allows multiplication without error (a power of 2 or some small integer number)
if optimizing pow(a,6) to (a*a*a)*(a*a*a) or (a*a)*(a*a)*(a*a) there still can be a loss of accuracy compared to pow function.

In general you know that for arbitrary floating point values "pow" has better accuracy than any function you could eventually write, but in some special cases multiple multiplications may have better accuracy and performance, it is up to the developer choosing what is more appropriate, eventually commenting the code so that noone else would "optimize" that code.

The only thing that make sense (personal opinion, and apparently a choice in GCC wichout any particular optimization or compiler flag) to optimize should be replacing "pow(a,2)" with "a*a". That would be the only sane thing a compiler vendor should do.

Tectonics answered 3/1, 2015 at 16:40 Comment(4)

downvoters should realize that this answer is perfectly fine. I can quote dozens of sources and documention to supporting my answer and I'm probably more involved with floating point precision than any downvoter would be. It is perfectly reasonable in StackOverflow adding missing information that other answers does not cover, so be polite and explain your reasons. – Tectonics 3/1, 2015 at 16:59

It seems to me that Stephen Canon's answer covers what you have to say. You seem to insist that libms are implemented with splines: they more typically use argument reduction (depending of the function being implemented) plus a single polynomial the coefficients of which have been obtained by more or less sophisticated variants of the Remez algorithm. Smoothness at junction points is not considered an objective worth pursuing for libm functions (if they end up accurate enough, they are automatically quite smooth anyway regardless of how many pieces the domain was split into). – Desdemona 3/1, 2015 at 17:33

The second half of your answer completely misses the point that compilers are supposed to produce code that implements what the source code says, period. Also you use the word “precision” when you mean “accuracy”. – Desdemona 3/1, 2015 at 17:35

Thanks for your input, I slightly corrected the answer, something new is still present in the last 2 lines ^^ – Tectonics 3/1, 2015 at 22:35

As Lambdageek pointed out float multiplication is not associative and you can get less accuracy, but also when get better accuracy you can argue against optimisation, because you want a deterministic application. For example in game simulation client/server, where every client has to simulate the same world you want floating point calculations to be deterministic.

Selfsame answered 23/6, 2011 at 12:44 Comment(6)

@Insectarium - only when the compiler doesn't reorder things, possibly in different ways according to the compiler version, target machine, etc. – Eldoria 4/9, 2014 at 21:48

@Eldoria No, it's still deterministic then. No randomness is added in any sense of the word. – Insectarium 6/9, 2014 at 19:38

@Insectarium It seems fairly clear Bjorn here is using 'deterministic' in the sense of the code giving the same result on different platforms and different compiler versions etc (external variables which may be beyond the control of the programmer) -- as opposed to lack of actual numeric randomness at run time. If you are pointing out that this isn't a proper use of the word, I'm not going to argue with that. – Eldoria 8/9, 2014 at 14:15

@Eldoria Except even in your interpretation of what he says, it's still wrong; that's the entire point of IEEE 754, to provide identical characteristics for most (if not all) operations across platforms. Now, he made no mention of platforms or compiler versions, which would be a valid concern if you want every single operation on every remote server/client to be identical....but this isn't obvious from his statement. A better word might be "reliably similar" or something. – Insectarium 9/9, 2014 at 18:44

@Insectarium you're wasting everybody's time, including your own, by arguing semantics. His meaning was clear. – Handbook 3/12, 2014 at 14:59

@Handbook The entire point of standards IS semantics; his meaning was decidedly not clear. – Insectarium 8/12, 2014 at 2:54

I would not have expected this case to be optimized at all. It can't be very often where an expression contains subexpressions that can be regrouped to remove entire operations. I would expect compiler writers to invest their time in areas which would be more likely to result in noticeable improvements, rather than covering a rarely encountered edge case.

I was surprised to learn from the other answers that this expression could indeed be optimized with the proper compiler switches. Either the optimization is trivial, or it is an edge case of a much more common optimization, or the compiler writers were extremely thorough.

There's nothing wrong with providing hints to the compiler as you've done here. It's a normal and expected part of the micro-optimization process to rearrange statements and expressions to see what differences they will bring.

While the compiler may be justified in considering the two expressions to deliver inconsistent results (without the proper switches), there's no need for you to be bound by that restriction. The difference will be incredibly tiny - so much so that if the difference matters to you, you should not be using standard floating point arithmetic in the first place.

Strickler answered 21/6, 2011 at 18:52 Comment(1)

As noted by another commenter, this is untrue to the point of being absurd; the difference could be as much as half to 10% of the cost, and if run in a tight loop, that will translate to many instructions wasted to get what could be an insignificant amount of additional precision. Saying you shouldn't be using standard FP when you are doing a monte carlo is sort of like saying you should always use an airplane to get across country; it ignores many externalities. Finally, this is NOT an uncommon optimization; dead code analysis and code reduction/refactor is very common. – Insectarium 30/6, 2014 at 13:29

There are already a few good answers to this question, but for the sake of completeness I wanted to point out that the applicable section of the C standard is 5.1.2.2.3/15 (which is the same as section 1.9/9 in the C++11 standard). This section states that operators can only be regrouped if they are really associative or commutative.

Mccary answered 1/10, 2013 at 19:33 Comment(1)

Or you can just say that the compiler must perform these operations from left to right, that is take a, multiply by a five times, but the compiler has the freedom to substitute any code that produces exactly the same result. For example, a+a+a = 3*a (which always gives the same result). – Misbeliever 20/11, 2022 at 20:48

gcc actually can do this optimization, even for floating-point numbers. For example,

double foo(double a) {
  return a*a*a*a*a*a;
}

becomes

foo(double):
    mulsd   %xmm0, %xmm0
    movapd  %xmm0, %xmm1
    mulsd   %xmm0, %xmm1
    mulsd   %xmm1, %xmm0
    ret

with -O -funsafe-math-optimizations. This reordering violates IEEE-754, though, so it requires the flag.

Signed integers, as Peter Cordes pointed out in a comment, can do this optimization without -funsafe-math-optimizations since it holds exactly when there is no overflow and if there is overflow you get undefined behavior. So you get

foo(long):
    movq    %rdi, %rax
    imulq   %rdi, %rax
    imulq   %rdi, %rax
    imulq   %rax, %rax
    ret

with just -O. For unsigned integers, it's even easier since they work mod powers of 2 and so can be reordered freely even in the face of overflow.

Tulley answered 16/6, 2016 at 18:44 Comment(1)

Godbolt link with double, int and unsigned. gcc and clang both optimize all three the same way (with -ffast-math) – Erenow 17/6, 2016 at 0:9

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags