GCC PowerPC avoiding .rodata section for floats

Asked 19/8, 2017 at 10:45 Answered 20/8, 2017 at 22:17

Solved c gcc assembly floating-point powerpc

I'm writing C code and compile it for the PowerPC architecture. That said C code contains floating point variable constants which I want to be placed in the .text section instead of .rodata so the function code is self-contained.

The problem with this is that in PowerPC, the only way to move a floating point value into a floating point register is by loading it from memory. It is an instruction set restriction.

To convince GCC to help me, I tried declaring the floats as static const. No difference. Using pointers, same results. Using __attribute__((section(".text"))) for the function, same results and for each floating point variable individually:

error: myFloatConstant causes a section type conflict with myFunction

I also tried disabling optimizations via #pragma GCC push_options #pragma GCC optimize("O0") and #pragma GCC pop_options. Plus pretending I have an unsigned int worked:

unsigned int *myFloatConstant = (unsigned int *) (0x11000018);
*myFloatConstant = 0x4C000000;

Using the float:

float theActualFloat = *(float *) myFloatConstant;

I still would like to keep -O3 but it again uses .rodata so a potential answer would include which optimization flag causes the floats to be placed in .rodata since starting from -O1 this is happening?

Best case scenario would be that I can use floats "normally" in the code plus maximum optimizations and they never get placed in .rodata at all.

What I imagine GCC to possibly do is placing the float constant in-between the code by mixing data and code, loading from that place into a floating point register and continue. This is possible to write manually I believe but how to make GCC do that? Forcing the attribute per variable causes the error from above but technically this should be feasible.

Mislay answered 19/8, 2017 at 10:45 Comment(7)

The POWER ABI is a bit funny; see man gcc and the POWER -msdata option in particular. On the GCC dev mailing list, someone mentioned that adding -G 0 to gcc options "fixes" this; could you try that and report whether that makes gcc do what you prefer? – Catachresis 19/8, 2017 at 11:21

Why do you want the code to be "self contained?" – Silicosis 19/8, 2017 at 11:34

@Silicosis I guess to optimize cache usage, reduce TLB misses, cache faults etc? – Sassafras 19/8, 2017 at 11:43

@Silicosis Maybe the code is used in a non-standard way (e.g. code injected into microcontroller RAM and executed there) which requires to be absolutely position-independent. – Korwun 19/8, 2017 at 11:51

@AnttiHaapala: Modern CPUs (including x86 and PowerPC) have split L1 caches, and separate first-level TLBs, for instructions and data. Loading data from the same cache line that's currently executing can hit in L2, though. (Unless your L2 is exclusive with L1D, like on AMD Bulldozer-family). The dTLB can still miss, too. It's common on x86 for the L2 TLB to hold evicted entries from iTLB and dTLB, but the entry for the current page will be in the iTLB, and there's no reason to expect it to be in the L2 TLB, so the dTLB may well trigger a page walk. – Exmoor 21/8, 2017 at 0:14

Wasting L1I cache footprint on data, and wasting L1D cache footprint on code, is usually not a good idea. Other than that, it's not worse than separate data, but it's probably not much better. However, having your data in the actual .text section near your function isn't inherently bad, and doesn't waste anything if they're in separate cache lines. If it's in the cache-line after, maybe L2 prefetch will even bring in the data before it's demand-loaded. – Exmoor 21/8, 2017 at 0:16

(writing near executing code is slow on some CPUs. This may only affect x86, not PPC, because x86 has I-cache coherent with data cache. But read-only data won't cause self-modifying-code pipeline flushes or other nasty effects.) – Exmoor 21/8, 2017 at 0:25

Using GCC 7.1.0 powerpc-eabi (cross compiler under Linux) the following code worked for me:

float test(void)
{
    int x;
    volatile float y;
    float theActualFloat;

    *(float *)&x = 1.2345f;
    *(int *)&y = x;
    theActualFloat = y;

    return theActualFloat;
}

Resulting assembly code:

test:
    stwu 1,-24(1)
    lis 9,0x3f9e
    ori 9,9,0x419
    stw 9,8(1)
    lfs 1,8(1)
    addi 1,1,24
    blr

Explaination:

In the line *(float *)&x = value you write to an integer which will be optimized by the compiler. The compiler will perform an integer operation which does not access floating point values in .rodata.

The line *(int *)&y = x is a pure integer operation anyway.

The line theActualFloat = y cannot be optimized due to the volatile so the compiler has to write the integer to the variable on the stack and it has to read the result from the variable.

Korwun answered 19/8, 2017 at 11:50 Comment(3)

This is fine but what about a generic function/macro to return a float instead of duplicating it entirely and changing the *(float *)&x = 1.2345f line? – Mislay 20/8, 2017 at 10:14

Type-punning with pointer-casts violates strict aliasing. Use a union if possible (which is guaranteed to work in C99 and later, as well as GNU89 and GNU C++). I guess you could make the union volatile` to force the compiler to store to it with data from immediates, instead of optimizing away the compile-time constant. It would be nice if there was a way to get that without forcing the compiler to redo it every time after inlining this function into a loop, though. I guess that's not really relevant since the OP wants a stand-alone function. – Exmoor 21/8, 2017 at 0:28

@Mislay and Martin: IIRC, PowerPC really doesn't like store/reload, especially between FPU and integer. I tried to google up something about this, but mostly found stuff like alex-simon.blogspot.ca/2010/04/load-hit-store.html which has some useful C++-programming suggestions but seems pretty fuzzy on the microarchitectural details. (Apparently at least some PowerPC uarches can do store-hit-load forwarding like x86 does, for integer store/reload, when the load isn't wider than the previous store. gcc.gnu.org/bugzilla/show_bug.cgi?id=71310) – Exmoor 21/8, 2017 at 0:45

I found another solution which avoids stack frame creation and .rodata usage but requires an absolute memory address to store the float in:

static inline volatile float *getFloatPointer(int address, float value) {
    float *pointer = (float *) address;
    *pointer = value;

    return pointer;
}

It is used like this:

volatile float *myFloat = getFloatPointer(0x12345678, 30.f);
printf("%f", *myFloat);

It is important to not make a local float variable, only volatile pointers so it won't use .rodata again.

Mislay answered 20/8, 2017 at 22:17 Comment(1)

You definitely want to do float foo = *myFloat; outside of a loop, because the compiler has to actually emit a load instruction every time you use *myFloat, because it's a pointer-to-volatile. You just need to stop the compiler from doing constant-propagation all the way to a compile-time constant float which it will put in .rodata. You don't need or want to stop if from keeping the constant in a register for your whole function. – Exmoor 21/8, 2017 at 0:52

Recommended topics

Hot tags