compensating latency on ARM interrupts?

Asked 11/9, 2012 at 13:40 Answered 30/4, 2019 at 20:32

Solved arm counter interrupt stm32 low-latency

I'm working on a project on a STM32F4 CPU, generating signals.

I have a generic timer on CPU clock (no prescaler) on a STM32 triggering interrupts on overflow, to generate a periodic signal with GPIO afterwards.

I need to trigger thr GPIO at a very precise time (basically down to one CPU cycle precision). I've managed to reduce this jitter to +-5 cycles by setting priorities & al, but this jitter exists, depending on what the CPU was doing.

I need to compensate this few cycles jitter. Adding a few cycles more latency isn't a problem as long as I toggle GPIOs at a precise time.

My idea was to read the current value of the counter, and have an active loop of FIXED_NUMBER-CURRENT_VALUE time, ensuring I would exit the loop at precise times.

However, doing a simple loop in C - being a FOR loop, or a while(counter->value < TARGET) doesn't work as it ADDS jitter instead of reducing it.

Am I doing something wrong / naive ? Should I do it in assembly ? how would that be different from C (I checked the disassembly with GCC to check loop was not optimized away nor was I hitting memory ?)

(I ensured with empty, non optimized but not hitting memory loop body)

edit : see this example on AVR (much more stable I know) See by example http://lucidscience.com/pro-vga%20video%20generator-7.aspx (search for "jitter")

edit2 : I tried a simple loop in assembly such as (r0 is my counter, number of cycles to wait, in a register)

loop : SUBS r0,#1 ; tried with 2 also
       BGE loop

and, again, jitter is better without it.

To sumit up, I already know how much I should delay. I just need a way to have a branch of code consume reliably N cycles in a case and M in another. Unfortunately, branches alone don't seem to work because a pipeline refill doesn't seem to take a reliable number of cycles, and conditional expressions don't either because they always take the same number of cycles (sometimes doing nothing).

Would running from RAM instead of flash improve consistency ? (NB stm32f4 have a flash prefetch..)

Tillfourd answered 11/9, 2012 at 13:40 Comment(3)

You understand you are not going to get accuracy to a single cycle yes? If you want that accuracy have the timer feed the gpio directly or use the timer output directly. Doesnt matter what processor you are using, they all tend to complete the current instruction before starting to handle the interrupt. The number of clock cycles to complete an instruction varies from instruction to instruction, if there is even a single exception you cannot meet your cycle timing. – Familist 11/9, 2012 at 13:43

C or any compiled language is definitely your enemy if you are looking for extreme performance or accuracy. The processor you are using can run as fast as 168mhz or somewhere in that range with data and instruction caching, if you are nowhere near that speed (note that flash does not get any faster you are still bound by it if you run from cache) then increase your speed changing your requirement to be +/- many cycles. – Familist 11/9, 2012 at 13:47

to dwelch - yes, multiple length instructions will be interrupted, thus giving jitter. However, getting the value of the counter from the interrupt handler can tell me after the fact how much I should wait. There, I shouldn't be interrupted, so it seems theoretically feasible to achieve a fixed latency after this wait loop ? See by example lucidscience.com/pro-vga%20video%20generator-7.aspx : "The code [...]reads the value of the timer that triggered the interrupt, and then either skips or jumps based on certain values [...] to completely remove the interrupt jitter." – Tillfourd 11/9, 2012 at 14:44

(It is ironic that a question about reducing response latency took three years to get an answer.)

+/- 5 cycles sounds awfully familiar. You are likely hitting wait states accessing the Flash controller during interrupt dispatch.

The CPU needs to do three things during interrupt dispatch:

Load the vector table entry.
Load the initial code of your interrupt routine.
Write some of the registers out to the stack.

If your vector table and/or interrupt routine code are in Flash, the fetches in items 1 and 2 go to Flash. When running the CPU at its highest rated speeds (up to 168MHz), accesses to Flash entail five wait states. This means that an access to Flash can take either 1 or 6 cycles, depending on whether the data being requested is in the Flash cache. If you're seeing exactly 0 or 5 cycles of latency, this is a likely culprit. This problem is most easily fixed by moving the ISR code and the vector table into RAM. You can also "fix" it by disabling the Flash cache, which will cause Flash accesses to be predictably slow.

There is a sneakier factor that may also be biting you: if the code being interrupted is also using Flash, the interrupt dispatch may have to wait for its Flash accesses to complete, assuming it misses cache. You can fix this by also moving the interrupted code into RAM, but at this point it's starting to sound like nothing lives in Flash. There's a way to keep the code in Flash that I mention below.

Finally, there's a yet sneakier thing: if you have other interrupts that may occur right before your latency-sensitive interrupt, it is possible for that interrupt to get -5 cycles of latency due to tail chaining.

My solution to the second two problems I listed is a little weird: make sure the processor is idle, i.e. not taking another interrupt or fetching from Flash, when your interrupt occurs. The way I did this is by configuring a lower-priority interrupt to arrive just before my latency-sensitive interrupt (using a timer); that ISR simply executes a wait-for-interrupt instruction, wfi.

These are surmountable problems. I disagree with the commenters that you need to abandon C and write in assembly language; my m4vgalib system contains almost no assembly language and has very low jitter.

I discuss these very same problems and my solutions in more detail in one section of an article on my blog.

Demott answered 12/6, 2015 at 17:20 Comment(0)

Cliff is correct, there is no way to get to a single CPU cycle accuracy on a CPU core with interrupts, flash wait states, and pipelines. AFAIK, the somewhat odd Parallax "Propeller" is one of the few "high performance" MCU'ish core that can guarantee cycle time consistency as it does not support interrupts (but rather 8 cores in a "rotating" access hub).

Shan answered 30/4, 2019 at 20:32 Comment(0)

Recommended topics

Hot tags