I am porting some code from an M3 to an M4 which uses 3 NOPs to provide a very short delay between serial output clock changes. The M3 instruction set defines the time for a NOP as 1 cycle. I notice that NOPs in the M4 do not necessarily delay any time at all. I am aware that I will need to disable compiler optimisation but I'm looking for a low level command that will give me reliable, repeatable times. In practice in this particular case the serial is used very occasionally and could be very slow but I'd still like to know the best way to obtain cycle level delays.
If you need such very short, but deterministic "at least" delays, maybe you could consider using other instructions than nop
which have deterministic nonzero latency.
The Cortex-M4 NOP as described is not necessarily time consuming.
You could replace it to, say and reg, reg
, or something coarsely equivalent to a nop
in the context. Alternatively, when toggling GPIO, you could also repeat the I/O instructions themselves to enforce the minimal length of a state (such as if your GPIO writing instruction takes at least 5ns, repeat it five times to get at least 25ns). This could even work well within C if you were inserting nops in a C program (just repeat the writes to the port, if it's volatile
as it should be, the compiler wouldn't remove the repeated accesses).
Of course this only applies to very short delays, otherwise for short delays, like mentioned by others, busy loops waiting for some timing source would work much better (they take at least the clocks required to sample the timing source, set up the target, and go through once the wait loop).
Use the cycle-counting register (DWT_CYCCNT) to get high-precision timing!
Note: I have also tested this using digital pins and an oscilloscope, and it is extremely accurate.
See stopwatch_delay(ticks
) and supporting code below, which uses the STM32's DWT_CYCCNT register, specifically designed to count actual clock ticks, located at address 0xE0001004.
See main
for an example which uses STOPWATCH_START
/STOPWATCH_STOP
to measure how long the stopwatch_delay(ticks)
actually took, using CalcNanosecondsFromStopwatch(m_nStart, m_nStop)
.
Modify the ticks
input to make adjustments
uint32_t m_nStart; //DEBUG Stopwatch start cycle counter value
uint32_t m_nStop; //DEBUG Stopwatch stop cycle counter value
#define DEMCR_TRCENA 0x01000000
/* Core Debug registers */
#define DEMCR (*((volatile uint32_t *)0xE000EDFC))
#define DWT_CTRL (*(volatile uint32_t *)0xe0001000)
#define CYCCNTENA (1<<0)
#define DWT_CYCCNT ((volatile uint32_t *)0xE0001004)
#define CPU_CYCLES *DWT_CYCCNT
#define CLK_SPEED 168000000 // EXAMPLE for CortexM4, EDIT as needed
#define STOPWATCH_START { m_nStart = *((volatile unsigned int *)0xE0001004);}
#define STOPWATCH_STOP { m_nStop = *((volatile unsigned int *)0xE0001004);}
static inline void stopwatch_reset(void)
{
/* Enable DWT */
DEMCR |= DEMCR_TRCENA;
*DWT_CYCCNT = 0;
/* Enable CPU cycle counter */
DWT_CTRL |= CYCCNTENA;
}
static inline uint32_t stopwatch_getticks()
{
return CPU_CYCLES;
}
static inline void stopwatch_delay(uint32_t ticks)
{
uint32_t end_ticks = ticks + stopwatch_getticks();
while(1)
{
if (stopwatch_getticks() >= end_ticks)
break;
}
}
// WARNING: ONLY VALID FOR <25ms measurements due to scaling by 1000!
uint32_t CalcNanosecondsFromStopwatch(uint32_t nStart, uint32_t nStop)
{
uint32_t nDiffTicks;
uint32_t nSystemCoreTicksPerMicrosec;
// Convert (clk speed per sec) to (clk speed per microsec)
nSystemCoreTicksPerMicrosec = CLK_SPEED / 1000000;
// Elapsed ticks
nDiffTicks = nStop - nStart;
// Elapsed nanosec = 1000 * (ticks-elapsed / clock-ticks in a microsec)
return 1000 * nDiffTicks / nSystemCoreTicksPerMicrosec;
}
void main(void)
{
int timeDiff = 0;
stopwatch_reset();
// =============================================
// Example: use a delay, and measure how long it took
STOPWATCH_START;
stopwatch_delay(168000); // 168k ticks is 1ms for 168MHz core
STOPWATCH_STOP;
timeDiff = CalcNanosecondsFromStopwatch(m_nStart, m_nStop);
printf("My delay measured to be %d nanoseconds\n", timeDiff);
// =============================================
// Example: measure function duration in nanosec
STOPWATCH_START;
// run_my_function() => do something here
STOPWATCH_STOP;
timeDiff = CalcNanosecondsFromStopwatch(m_nStart, m_nStop);
printf("My function took %d nanoseconds\n", timeDiff);
}
Update: adding the concise solution alluded to by @vgru in comments section
// general but accurate (5% err at 10us delay, but 22% err at 1us delay)
#pragma GCC push_options
#pragma GCC optimize ("O3")
void delayUS_DWT(uint32_t us) {
volatile uint32_t cycles = (SystemCoreClock/1000000L)*us;
volatile uint32_t start = DWT->CYCCNT;
do {
} while(DWT->CYCCNT - start < cycles);
}
#pragma GCC pop_options
Also adding the most accurate but inflexible ASM solution in the same link from @vgru
// most accurate but the '16' needs to be adjusted if <84MHz
#define delayUS_ASM(us) do {\
asm volatile ( "MOV R0,%[loops]\n\t"\
"1: \n\t"\
"SUB R0, #1\n\t"\
"CMP R0, #0\n\t"\
"BNE 1b \n\t" : : [loops] "r" (16*us) : "memory"\
);\
} while(0)
DWT_CYCCNT
overflows after 25 seconds, but when you do 1000 * nDiffTicks
, you will overflow it after 25ms, which is unnecessary. stopwatch_reset()
is also usually not needed, although if you remove it then stopwatch_getticks() >= end_ticks
won't work. I would suggest a simpler (and correct) implementation like the delayUS_DWT
function posted near the end of this article. –
Uremia const
in front of them results in (nearly) identical instructions. Love the Compiler Explorer btw. –
Foucquet If you need such very short, but deterministic "at least" delays, maybe you could consider using other instructions than nop
which have deterministic nonzero latency.
The Cortex-M4 NOP as described is not necessarily time consuming.
You could replace it to, say and reg, reg
, or something coarsely equivalent to a nop
in the context. Alternatively, when toggling GPIO, you could also repeat the I/O instructions themselves to enforce the minimal length of a state (such as if your GPIO writing instruction takes at least 5ns, repeat it five times to get at least 25ns). This could even work well within C if you were inserting nops in a C program (just repeat the writes to the port, if it's volatile
as it should be, the compiler wouldn't remove the repeated accesses).
Of course this only applies to very short delays, otherwise for short delays, like mentioned by others, busy loops waiting for some timing source would work much better (they take at least the clocks required to sample the timing source, set up the target, and go through once the wait loop).
For any reliable timing, I always suggest using a general purpose timer. Your part may have a timer that is capable of clocking high enough to give you the timing you need. For serial, is there a reason you can't use a corresponding serial peripheral? Most of the Cortex M3/M4s that I'm aware of offer USARTS, I2C, and SPI, with multiple also offering SDIO, which should cover most needs.
If that is not possible, this stackoverflow question/answer details using the cycle counter, if available, on a Cortex M3/M4. You could grab the cycle counter and add a few to it and poll it, but I don't think you would achieve anything reasonably below ~8 cycles for minimum delay with this method.
Well first you have to run from ram not flash as the flash timing is going to be slow, one nop can take many cycles. the gpio accesses should take a few clocks at least as well so you probably wont need/want nops just pound on the gpio. The branch at the end of the loop will be noticeable as well. you should write a few instructions to ram and branch to it and see how fast you can wiggle the gpio.
The bottom line though is that if you are on such a tight budget that your serial clock is that close to your processor clock in speed, it is very likely you are not going to get this to work with this processor. upping the pll in the processor wont change the flash speed, it can make it worse (relative to the processor clock) the sram should scale though so if you have headroom left on your processor clock and the power budget to support that then repeat the experiment in sram with a faster processor clock speed.
© 2022 - 2024 — McMap. All rights reserved.