Can I make my compiler use fast-math on a per-function basis?
Asked Answered
C

2

8

Suppose I have

template <bool UsesFastMath> void foo(float* data, size_t length);

and I want to compile one instantiation with -ffast-math (--use-fast-math for nvcc), and the other instantiation without it.

This can be achieved by instantiating each of the variants in a separate translation unit, and compiling each of them with a different command-line - with and without the switch.

My question is whether it's possible to indicate to popular compilers (*) to apply or not apply -ffast-math for individual functions - so that I'll be able to have my instantiations in the same translation unit.

Notes:

  • If the answer is "no", bonus points for explaining why not.
  • This is not the same questions as this one, which is about turning fast-math on and off at runtime. I'm much more modest...

(*) by popular compilers I mean any of: gcc, clang, msvc icc, nvcc (for GPU kernel code) about which you have that information.

Coincident answered 19/11, 2016 at 23:34 Comment(2)
nvcc: No. Compilation flags are applied on a per-compilation-unit basis. No equivalent function attributes exist to apply this on a per-function basis. If you want different flags applied, stick the code into different compilation units (you can include the source code from the same file, if you desire). For tight local control, various CUDA device intrinsics (or worst case, some inline assembly) can provide much of what you need.Kaufman
I have supplied an answer as suggestedKaufman
K
4

As of CUDA 7.5 (the latest version I am familiar with, although CUDA 8.0 is currently shipping), nvcc does not support function attributes that allow programmers to apply specific compiler optimizations on a per-function basis.

Since optimization configurations set via command line switches apply to the entire compilation unit, one possible approach is to use as many different compilation units as there are different optimization configurations, as already noted in the question; source code may be shared and #include-ed from a common file.

With nvcc, the command line switch --use_fast_math basically controls three areas of functionality:

  • Flush-to-zero mode is enabled (that is, denormal support is disabled)
  • Single-precision reciprocal, division, and square root are switched to approximate versions
  • Certain standard math functions are replaced by equivalent, lower-precision, intrinsics

You can apply some of these changes with per-operation granularity by using appropriate intrinsics, others by using PTX inline assembly.

Kaufman answered 21/11, 2016 at 18:56 Comment(1)
It's the same for CUDA 8.0 AFAICT.Coincident
S
13

In GCC you can declare functions like following:

__attribute__((optimize("-ffast-math")))
double
myfunc(double val)
{
    return val / 2;
}

This is GCC-only feature.

See working example here -> https://gcc.gnu.org/ml/gcc/2009-10/msg00385.html

It seems that GCC not verifies optimize() arguments. So typos like "-ffast-match" will be silently ignored.

Selfcongratulation answered 20/11, 2016 at 10:7 Comment(2)
"GCC-only feature" - that just means it doesn't follow from any public standard, right? Or are you also saying that, say, clang doesn't have such a feature?Coincident
I think it implies both at the same timeSorenson
K
4

As of CUDA 7.5 (the latest version I am familiar with, although CUDA 8.0 is currently shipping), nvcc does not support function attributes that allow programmers to apply specific compiler optimizations on a per-function basis.

Since optimization configurations set via command line switches apply to the entire compilation unit, one possible approach is to use as many different compilation units as there are different optimization configurations, as already noted in the question; source code may be shared and #include-ed from a common file.

With nvcc, the command line switch --use_fast_math basically controls three areas of functionality:

  • Flush-to-zero mode is enabled (that is, denormal support is disabled)
  • Single-precision reciprocal, division, and square root are switched to approximate versions
  • Certain standard math functions are replaced by equivalent, lower-precision, intrinsics

You can apply some of these changes with per-operation granularity by using appropriate intrinsics, others by using PTX inline assembly.

Kaufman answered 21/11, 2016 at 18:56 Comment(1)
It's the same for CUDA 8.0 AFAICT.Coincident

© 2022 - 2024 — McMap. All rights reserved.