How to obtain reliable Cortex M4 short delays

Asked 12/5, 2014 at 14:52 Answered 1/6, 2017 at 8:5

I am porting some code from an M3 to an M4 which uses 3 NOPs to provide a very short delay between serial output clock changes. The M3 instruction set defines the time for a NOP as 1 cycle. I notice that NOPs in the M4 do not necessarily delay any time at all. I am aware that I will need to disable compiler optimisation but I'm looking for a low level command that will give me reliable, repeatable times. In practice in this particular case the serial is used very occasionally and could be very slow but I'd still like to know the best way to obtain cycle level delays.

Contextual answered 12/5, 2014 at 14:52 Comment(5)

Are you unable to use a UART or peripheral timer? – Achromat 12/5, 2014 at 16:42

No I have no timers available that could be setup in time or spare for free running. – Contextual 13/5, 2014 at 8:46

the uart has its own clock divisor. – Varlet 14/5, 2014 at 3:33

I am unable to use a UART or peripheral timer to generate a 24ns delay. – Contextual 14/5, 2014 at 10:22

According to the ARM Cortex-M3 Devices Generic User Guide the NOP instruction will not necessarily consume any time on a Cortex M3 too. – Underclassman 26/6, 2018 at 10:27

If you need such very short, but deterministic "at least" delays, maybe you could consider using other instructions than nop which have deterministic nonzero latency.

The Cortex-M4 NOP as described is not necessarily time consuming.

You could replace it to, say and reg, reg, or something coarsely equivalent to a nop in the context. Alternatively, when toggling GPIO, you could also repeat the I/O instructions themselves to enforce the minimal length of a state (such as if your GPIO writing instruction takes at least 5ns, repeat it five times to get at least 25ns). This could even work well within C if you were inserting nops in a C program (just repeat the writes to the port, if it's volatile as it should be, the compiler wouldn't remove the repeated accesses).

Of course this only applies to very short delays, otherwise for short delays, like mentioned by others, busy loops waiting for some timing source would work much better (they take at least the clocks required to sample the timing source, set up the target, and go through once the wait loop).

Rhabdomancy answered 1/6, 2017 at 8:5 Comment(1)

Many thanks, as I said below I'm using MOV R0,#1. It's been in use on many production units since shortly after I wrote the question in 2014 and so far it's worked perfectly. – Contextual 1/6, 2017 at 8:18

Use the cycle-counting register (DWT_CYCCNT) to get high-precision timing!

Note: I have also tested this using digital pins and an oscilloscope, and it is extremely accurate.

See stopwatch_delay(ticks) and supporting code below, which uses the STM32's DWT_CYCCNT register, specifically designed to count actual clock ticks, located at address 0xE0001004.

See main for an example which uses STOPWATCH_START/STOPWATCH_STOP to measure how long the stopwatch_delay(ticks) actually took, using CalcNanosecondsFromStopwatch(m_nStart, m_nStop).

Modify the ticks input to make adjustments

uint32_t m_nStart;               //DEBUG Stopwatch start cycle counter value
uint32_t m_nStop;                //DEBUG Stopwatch stop cycle counter value

#define DEMCR_TRCENA    0x01000000

/* Core Debug registers */
#define DEMCR           (*((volatile uint32_t *)0xE000EDFC))
#define DWT_CTRL        (*(volatile uint32_t *)0xe0001000)
#define CYCCNTENA       (1<<0)
#define DWT_CYCCNT      ((volatile uint32_t *)0xE0001004)
#define CPU_CYCLES      *DWT_CYCCNT
#define CLK_SPEED         168000000 // EXAMPLE for CortexM4, EDIT as needed

#define STOPWATCH_START { m_nStart = *((volatile unsigned int *)0xE0001004);}
#define STOPWATCH_STOP  { m_nStop = *((volatile unsigned int *)0xE0001004);}


static inline void stopwatch_reset(void)
{
    /* Enable DWT */
    DEMCR |= DEMCR_TRCENA; 
    *DWT_CYCCNT = 0;             
    /* Enable CPU cycle counter */
    DWT_CTRL |= CYCCNTENA;
}

static inline uint32_t stopwatch_getticks()
{
    return CPU_CYCLES;
}

static inline void stopwatch_delay(uint32_t ticks)
{
    uint32_t end_ticks = ticks + stopwatch_getticks();
    while(1)
    {
            if (stopwatch_getticks() >= end_ticks)
                    break;
    }
}

// WARNING: ONLY VALID FOR <25ms measurements due to scaling by 1000!
uint32_t CalcNanosecondsFromStopwatch(uint32_t nStart, uint32_t nStop)
{
    uint32_t nDiffTicks;
    uint32_t nSystemCoreTicksPerMicrosec;
    
    // Convert (clk speed per sec) to (clk speed per microsec)
    nSystemCoreTicksPerMicrosec = CLK_SPEED / 1000000;
    
    // Elapsed ticks
    nDiffTicks = nStop - nStart;
    
    // Elapsed nanosec = 1000 * (ticks-elapsed / clock-ticks in a microsec)
    return 1000 * nDiffTicks / nSystemCoreTicksPerMicrosec;
} 

void main(void)
{
    int timeDiff = 0;
    stopwatch_reset();
    
    // =============================================
    // Example: use a delay, and measure how long it took
    STOPWATCH_START;
    stopwatch_delay(168000); // 168k ticks is 1ms for 168MHz core
    STOPWATCH_STOP;
    
    timeDiff = CalcNanosecondsFromStopwatch(m_nStart, m_nStop);
    printf("My delay measured to be %d nanoseconds\n", timeDiff);
    
    // =============================================
    // Example: measure function duration in nanosec
    STOPWATCH_START;
    // run_my_function() => do something here
    STOPWATCH_STOP;
    
    timeDiff = CalcNanosecondsFromStopwatch(m_nStart, m_nStop);
    printf("My function took %d nanoseconds\n", timeDiff);
}

Update: adding the concise solution alluded to by @vgru in comments section

// general but accurate (5% err at 10us delay, but 22% err at 1us delay)
#pragma GCC push_options
#pragma GCC optimize ("O3")
void delayUS_DWT(uint32_t us) {
    volatile uint32_t cycles = (SystemCoreClock/1000000L)*us;
    volatile uint32_t start = DWT->CYCCNT;
    do  {
    } while(DWT->CYCCNT - start < cycles);
}
#pragma GCC pop_options

Also adding the most accurate but inflexible ASM solution in the same link from @vgru

// most accurate but the '16' needs to be adjusted if <84MHz
#define delayUS_ASM(us) do {\
    asm volatile (  "MOV R0,%[loops]\n\t"\
            "1: \n\t"\
            "SUB R0, #1\n\t"\
            "CMP R0, #0\n\t"\
            "BNE 1b \n\t" : : [loops] "r" (16*us) : "memory"\
              );\
} while(0)

Foucquet answered 21/4, 2017 at 4:51 Comment(9)

You can also verify this behavior with an oscilloscope, and digital pins. – Foucquet 21/4, 2017 at 4:53

This gives me short delays but not very short delays. – Contextual 3/5, 2017 at 10:8

@Ant, you can set the delay in ticks as needed; How short were you hoping for? – Foucquet 29/10, 2019 at 3:54

The delay I wanted was 3 cycles. – Contextual 29/10, 2019 at 9:53

Same comments as in this answer. On a 168MHz processor, DWT_CYCCNT overflows after 25 seconds, but when you do 1000 * nDiffTicks, you will overflow it after 25ms, which is unnecessary. stopwatch_reset() is also usually not needed, although if you remove it then stopwatch_getticks() >= end_ticks won't work. I would suggest a simpler (and correct) implementation like the delayUS_DWT function posted near the end of this article. – Uremia 2/12, 2019 at 13:37

Surly the best way to measure cycles. Can you give a documentation where DWT registers are presented and explain how it works ? ARM and ST's doc I could find only presents the registers fields and directly put it in relation with ETM, ITM and so one – Strengthen 22/4, 2020 at 7:57

@Uremia thank you for sharing -- included the "concise solution" and "most accurate but inflexible" solution, respectively, which I am glad to also see scope measurements for – Foucquet 14/9, 2023 at 1:14

@bunkerdive: Thanks! I wonder why the local variables in the concise solution are volatile in the article though? It probably does not have enormous difference, but it's slightly unusual to see it. Seems the loop is just a bit tighter if we remove volatile (godbolt.org/z/v68rTEq3z). – Uremia 15/9, 2023 at 9:58

@Uremia this is a good question -- at face value I do not see why those non-module vars should not move into registers during the loop (nefarious stack/register manipulation aside). Even putting a const in front of them results in (nearly) identical instructions. Love the Compiler Explorer btw. – Foucquet 15/9, 2023 at 22:28

If you need such very short, but deterministic "at least" delays, maybe you could consider using other instructions than nop which have deterministic nonzero latency.

The Cortex-M4 NOP as described is not necessarily time consuming.

Rhabdomancy answered 1/6, 2017 at 8:5 Comment(1)

For any reliable timing, I always suggest using a general purpose timer. Your part may have a timer that is capable of clocking high enough to give you the timing you need. For serial, is there a reason you can't use a corresponding serial peripheral? Most of the Cortex M3/M4s that I'm aware of offer USARTS, I2C, and SPI, with multiple also offering SDIO, which should cover most needs.

If that is not possible, this stackoverflow question/answer details using the cycle counter, if available, on a Cortex M3/M4. You could grab the cycle counter and add a few to it and poll it, but I don't think you would achieve anything reasonably below ~8 cycles for minimum delay with this method.

Achromat answered 12/5, 2014 at 16:49 Comment(1)

This is not standard serial, for SPI and I2C I am happily using peripherals. This needs to be GPIO driven with a few cycles delay. I agree also that the cycle counter wouldn't work. – Contextual 13/5, 2014 at 8:49

Well first you have to run from ram not flash as the flash timing is going to be slow, one nop can take many cycles. the gpio accesses should take a few clocks at least as well so you probably wont need/want nops just pound on the gpio. The branch at the end of the loop will be noticeable as well. you should write a few instructions to ram and branch to it and see how fast you can wiggle the gpio.

The bottom line though is that if you are on such a tight budget that your serial clock is that close to your processor clock in speed, it is very likely you are not going to get this to work with this processor. upping the pll in the processor wont change the flash speed, it can make it worse (relative to the processor clock) the sram should scale though so if you have headroom left on your processor clock and the power budget to support that then repeat the experiment in sram with a faster processor clock speed.

Varlet answered 14/5, 2014 at 3:41 Comment(6)

In practice 3 NOPs gives me just the time I want but I don't think that is good enough as the documentation states that they may be removed by the pipeline. I could imagine shipping product with a next version processor that has better optimisation and suddenly nothing works as previously. I'm looking for a reliable method of inserting a few nanosecond delay. I'm currently using MOV R0,#1 after turning off compiler optimisation, as I have found no comment about these being removed. – Contextual 14/5, 2014 at 10:26

I would think about that statement, what would cause them to decide to remove them from the pipeline, what internal or outside forces, if your code is not changing, the system is tightly controlled the core would not have any new inputs or fetch variations, etc that would cause the pipe to not do the same thing it has always been doing. Now on the otherhand, sure from one rev of chip to another that might change, but you can look at the rev of the cores that are available and the rev that the chip vendor is using (I suspect they dont just pop out a cortex-m4 and replace it with another – Varlet 14/5, 2014 at 15:1

during a simple chip spin, but who knows. – Varlet 14/5, 2014 at 15:2

The bottom line is the same if the best you can do is three nops to get your timing, and this is not a PIC, you are too tight, you need some other chip, your processor speed to signal speed does not have enough margin. – Varlet 14/5, 2014 at 15:3

What would cause them to decide to remove them from the pipeline? Because they are implementing what they documented - the documentation says they may be removed. Need some other chip - this product is in production it's not a bedroom hobby. – Contextual 15/5, 2014 at 12:1

well that is part of the game of documents, it may be that all implementation of that core or family of cores does it some of the time. It may be that some versions of one of the cores does it and the others dont. Once you get past those questions, then if there is a core that "sometimes" does it then the question is what determines the times it does and doesnt, and I certainly dont know the answers to any of those questions. I still contend with this processor even if the nops execute EVERY time, you are too tight on your processor speed to signal ratio. – Varlet 15/5, 2014 at 18:28

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags