ARM prefetch workaround

Asked 8/9, 2017 at 14:33 Answered 9/9, 2017 at 4:51

assembly gcc arm armv6 speculative-execution

I have a situation where some of the address space is sensitive in that you read it you crash as there is nobody there to respond to that address.

pop {r3,pc}
bx r0

   0:   e8bd8008    pop {r3, pc}
   4:   e12fff10    bx  r0

   8:   bd08        pop {r3, pc}
   a:   4700        bx  r0

The bx was not created by the compiler as an instruction, instead it is the result of a 32 bit constant that didnt fit as an immediate in a single instruction so a pc relative load is setup. This is basically the literal pool. And it happens to have bits that resemble a bx.

Can easily write a test program to generate the issue.

unsigned int more_fun ( unsigned int );
unsigned int fun ( void )
{
    return(more_fun(0x12344700)+1);
}

00000000 <fun>:
   0:   b510        push    {r4, lr}
   2:   4802        ldr r0, [pc, #8]    ; (c <fun+0xc>)
   4:   f7ff fffe   bl  0 <more_fun>
   8:   3001        adds    r0, #1
   a:   bd10        pop {r4, pc}
   c:   12344700    eorsne  r4, r4, #0, 14

What appears to be happening is the processor is waiting on data coming back from the pop (ldm) moves onto the next instruction bx r0 in this case, and starts a prefetch at the address in r0. Which hangs the ARM.

As humans we see the pop as an unconditional branch, but the processor does not it keeps going through the pipe.

Prefetching and branch prediction are nothing new (we have the branch predictor off in this case), decades old, and not limited to ARM, but the number of instruction sets that have the PC as GPR and instructions that to some extent treat it as non-special are few.

I am looking for a gcc command line option to prevent this. I cant imagine we are the first ones to see this.

I can of course do this

-march=armv4t


00000000 <fun>:
   0:   b510        push    {r4, lr}
   2:   4803        ldr r0, [pc, #12]   ; (10 <fun+0x10>)
   4:   f7ff fffe   bl  0 <more_fun>
   8:   3001        adds    r0, #1
   a:   bc10        pop {r4}
   c:   bc02        pop {r1}
   e:   4708        bx  r1
  10:   12344700    eorsne  r4, r4, #0, 14

preventing the problem

Note, not limited to thumb mode, gcc can produce arm code as well for something like this with the literal pool after the pop.

unsigned int more_fun ( unsigned int );
unsigned int fun ( void )
{
    return(more_fun(0xe12fff10)+1);
}

00000000 <fun>:
   0:   e92d4010    push    {r4, lr}
   4:   e59f0008    ldr r0, [pc, #8]    ; 14 <fun+0x14>
   8:   ebfffffe    bl  0 <more_fun>
   c:   e2800001    add r0, r0, #1
  10:   e8bd8010    pop {r4, pc}
  14:   e12fff10    bx  r0

Hoping someone knows a generic or arm specific option to do an armv4t like return (pop {r4,lr}; bx lr in arm mode for example) without the baggage or puts a branch to self immediately after a pop pc (seems to solve the problem the pipe is not confused about b as an unconditional branch.

EDIT

ldr pc,[something]
bx rn

also causes a prefetch. which is not going to fall under -march=armv4t. gcc intentionally generates ldrls pc,[]; b somewhere for switch statements and that is fine. Didnt inspect the backend to see if there are other ldr pc,[] instructions generated.

EDIT

Looks like ARM did report this as an Errata (erratum 720247, "Speculative Instruction fetches can be made anywhere in the memory map"), wish I had known that before we spent a month on it...

Wildon answered 8/9, 2017 at 14:33 Comment(15)

"(avoid the pop {pc}" - here should the parenthesis close I guess? I.e. padding with nops would be fine for you. It's not 100% clear with missing ")", but doesn't make much sense why you wouldn't like padding. Thinking about it, a super intelligent compiler would pad only in case there is accidental branch instruction in the data, otherwise the data may follow without extra padding. (and sorry, I have no idea if gcc does contain anything to help you) – Wurth 8/9, 2017 at 15:24

What I'm wondering is: Doesn't ARM usually have the notion of uncacheable memory? If the SoC tries to preload unconnected addresses, something must be wrong with the tables that tell it which regions can be cached. – Apostrophe 8/9, 2017 at 17:26

@Wurth re-wrote the question (again). I have not yet determined if for example a ldr(bhd) instruction that is register based starts a read that ultimately hangs. There may be other instructions a branch to self (branch to the same address as the branch) being used after the pop so far solves the problem, would rather not have to use a custom gnu toolchain. likewise doing the armv4t thing which gcc already does, on a return with a pc, would work fine, it is not confused about a bx. – Wildon 9/9, 2017 at 2:26

@Apostrophe caching and an instruction fetch are two different things the instruction fetch can go to any address (in this case I think it does either a 4 word or an 8 word read, aligned around the address in question). The cache/mmu is not going to block a fetch, I dont think the mmu has an instruction/data control and that wouldnt work anyway as you do both fetching and data accesses (the literal pool if nothing else) from the .text. – Wildon 9/9, 2017 at 2:29

it is the chip designer that determines what the amba/axi bus(ses) are connected to and how they respond, and up to the designer as to how much of the address space is covered, etc...in our case the arm is a small part of a bigger design, the whole address space of the arm is programmable very much like pcie, where we can change various sized chunks of space to point at the rest of the chip, but like AXI, the other parts of the chip use a bus that doesnt time out (by design) if the programmer hits a space that has no target to respond. – Wildon 9/9, 2017 at 2:32

since the literal pool and registers can contain pretty much anything you cant control where and how this will go off the rails, cant protect the address space, in order for this system to work we have to aim at various things and address them correctly , so there are always some sensitive holes to avoid. need to instead have the tools help to not generate this. – Wildon 9/9, 2017 at 2:33

This is an arm11 mpcore which is a newer/different core than the arm1176 and prior arm11. I cant get a pi-zero to break yet (it doesnt have sensitive areas but there is at least one read on clear status register I can aim at), and have not tried the newer (armv7) mpcores to see what they do. we can fall back to forcing armv4t, I dont know what baggage that carries, if it is just an extra few percent of instructions that is fine I guess... – Wildon 9/9, 2017 at 2:35

I can't figure out why the answer isn't bug the processor manufacture for hardware that doesn't hang up. The processor is obviously doing something very wrong when provably unreachable code hangs. – Dreyer 1/1, 2018 at 5:29

@Dreyer support contract ran out a long time ago. could attempt to figure it out, but the silicon is already in production, a long time now. Looking at 10s of millions of dollars minimum. a few lines of code in gcc, costs a tad less – Wildon 1/1, 2018 at 15:42

It is actually a gcc bug as they are producing bad code for the target just like any other illegal sequence. because the last errata went out years ago, and we are not a big enough player on their radar there is no reason to support us. This is pretty typical with gcc bugs, I have seen many that are simply ignored and left in place indefinitely. par for the course. – Wildon 1/1, 2018 at 15:44

it is not necessarily a processor bug this is just how this processor works. nature of the beast with a pipelined processor. Hard enough to explain to most folks to even understand how these things work much less when there is a nuance. It just wasnt documented properly, nor was there an errata in time to show this behavior. Again this is somewhat typical with processors and that is why the compiler and processor core need to be matched up. isnt happening in this case. we had to go with -march=armv4t to solve the problem. – Wildon 1/1, 2018 at 15:47

kind of funny with the news of the day with meltdown and spectre, this is technically a speculative execution issue, wee tiny compared the real/serious speculative execution in the cores that followed this one. but that is what is going on it is starting to pre-execute an instruction in the pipe causing a fetch and then the side effect of that. – Wildon 5/1, 2018 at 2:6

Could you give a reference to the erratum? – Sinkage 26/1, 2021 at 16:12

720247: Speculative Instruction fetches can be made anywhere in the memory map – Wildon 26/1, 2021 at 20:13

@old_timer: Thanks, interesting! I added the link to the question. – Sinkage 27/1, 2021 at 0:9

https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html has a -mpure-code option, which doesn't put constants in code sections. "This option is only available when generating non-pic code for M-profile targets with the MOVT instruction." so it probably loads constants with a pair of mov-immediate instructions instead of from a constant-pool.

This doesn't fully solve your problem though, since speculative execution of regular instructions (after a conditional branch inside a function) with bogus register contents could still trigger access to unpredictable addresses. Or just the first instruction of another function might be a load, so falling through into another function isn't always safe either.

I can try to shed some light on why this is obscure enough that compilers don't already avoid it.

Normally, speculative execution of instructions that fault is not a problem. The CPU doesn't actually take the fault until it becomes non-speculative. Incorrect (or non-existent) branch prediction can make the CPU do something slow before figuring out the right path, but there should never be a correctness problem.

Normally, speculative loads from memory are allowed in most CPU designs. But memory regions with MMIO registers obviously have to be protected from this. In x86 for example, memory regions can be WB (normal, write-back cacheable, speculative loads allowed), or UC (Uncacheable, no speculative loads). Not to mention write-combining write-through...

You probably need something similar to solve your correctness problem, to stop speculative execution from doing something that will actually explode. This includes speculative instruction-fetch triggered by a speculative bx r0. (Sorry I don't know ARM, so I can't suggest how you'd do that. But this is why it's only a minor performance problem for most systems, even though they have MMIO registers that can't be speculatively read.)

I think it's very unusual to have a setup that lets the CPU do speculative loads from addresses that crash the system instead of just raising an exception when / if they become non-speculative.

we have the branch predictor off in this case

This may be why you're always seeing speculative execution beyond an unconditional branch (the pop), instead of just very rarely.

Nice detective work with using a bx to return, showing that your CPU detects that kind of unconditional branch at decode, but doesn't check the pc bit in a pop. :/

In general, branch prediction has to happen before decode, to avoid fetch bubbles. Given the address of a fetch block, predict the next block-fetch address. Predictions are also generated at the instruction level instead of fetch-block level, for use by later stages of the core (because there can be multiple branch instructions in a block, and you need to know which one is taken).

That's the generic theory. Branch prediction isn't 100%, so you can't count on it to solve your correctness problem.

x86 CPUs can have performance problems where the default prediction for an indirect jmp [mem] or jmp reg is the next instruction. If speculative execution starts something that's slow to cancel (like div on some CPUs) or triggers a slow speculative memory access or TLB miss, it can delay execution of the correct path once it's determined.

So it's recommended (by optimization manuals) to put ud2 (illegal instruction) or int3 (debug trap) or similar after a jmp reg. Or better, put one of the jump-table destinations there so "fall-through" is a correct prediction some of the time. (If the BTB doesn't have a prediction, next-instruction is about the only sane thing it can do.)

x86 doesn't normally mix code with data, though, so this is more likely to be a problem for architectures where literal pools are common. (But loads from bogus addresses can still happen speculatively after indirect branches, or mispredicted normal branches.

e.g. if(address_good) { call table[address](); } could easily mispredict and trigger speculative code-fetch from a bad address. But if the eventual physical address range is marked uncacheable, the load request would stop in the memory controller until it was known to be non-speculative

A return instruction is a type of indirect branch, but it's less likely that a next-instruction prediction is useful. So maybe bx lr stalls because speculative fall-through is less likely to be useful?

pop {pc} (aka LDMIA from the stack pointer) is either not detected as a branch in the decode stage (if it doesn't specifically check the pc bit), or it's treated as generic indirect branch. There are certainly other use-cases for ld into pc as a non-return branch, so detecting it as a probable return would require checking the source register encoding as well as the pc bit.

Maybe there's a special (internal hidden) return-address predictor stack that helps get bx lr predicted correctly every time, when paired with bl? x86 does this, to predict call/ret instructions.

Have you tested if pop {r4, pc} is more efficient than pop {r4, lr} / bx lr? If bx lr is handled specially in more than just avoiding speculative execution of garbage, it might be better to get gcc to do that, instead of having it lead its literal pool with a b instruction or something.

Appear answered 9/9, 2017 at 4:51 Comment(1)

Comments are not for extended discussion; this conversation has been moved to chat. – Earnestineearnings 11/9, 2017 at 12:42

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags