Is there an equivalent instruction to rdtsc in ARM?
Asked Answered
B

3

29

For my project I must use inline assembly instructions such as rdtsc to calculate the execution time of some C/C++ instructions.

The following code seems to work on Intel but not on ARM processors:

{unsigned a, d;asm volatile("rdtsc" : "=a" (a), "=d" (d)); t0 = ((unsigned long)a) | (((unsigned long)d) << 32);}
//The C++ statement to measure its execution time
{unsigned a, d;asm volatile("rdtsc" : "=a" (a), "=d" (d)); t1 = ((unsigned long)a) | (((unsigned long)d) << 32);}
time = t1-t0;

My question is:

How to write an inline assembly code similar to the above (to calculate the execution elapsed time of an instruction) to work on ARM processors?

Biography answered 6/11, 2016 at 20:26 Comment(5)
rdtsc on multi-core processors can have issues. see msdn.microsoft.com/en-us/library/ee417693(VS.85).aspxDemission
Single instructions will have variable timings based on cache etc. Better to loop thousands of times over it/them and use the perf_events() common functionality to make it work on all supported CPUs.Popovich
@RichardCritten rdtsc() is very reliable on all modern CPUs. Even on multisocket systems, that have a few years old CPUs, I get nearly identical values for rdtsc() over all the cores. Only very old systems, that don't have constant_tsc and nonstop_tsc() in their capabilities, have those issues mentioned in the microsoft document.Tricia
@KaiPetzke did you check the date of my comment ?Demission
@RichardCritten Yes, I did check the date. But in particular, because that linked Microsoft document is still available and was not updated, I found it necessary to comment, that this is now a problem of the past.Tricia
S
22

You should read the PMCCNTR register of a co-processor p15 (not an actual co-processor, just an entry point for CPU functions) to obtain a cycle count. Note that it is available to an unprivileged app only if:

  1. Unprivileged PMCCNTR reads are alowed:

    Bit 0 of PMUSERENR register must be set to 1 (official docs)

  2. PMCCNTR is actually counting cycles:

    Bit 31 of PMCNTENSET register must be set to 1 (official docs)

This is a real-world example of how it`s done.

Strawflower answered 6/11, 2016 at 22:5 Comment(8)
@Biography Note that the answer above is valid for ARMv6 and above. Older arch versions might have their own methods of getting this data (specific to a partcular chip - so the info is to be found in the datasheet for the chip), while some ARM-based chips don't provide such data at all.Talkie
My ARM CPU is ARM7A, confirmed that by using the compiler Macro__ARM_ARCH_7A__, however, when I try to use the instruction asm volatile("mrc p15, 0, %0, c9, c13, 0" : "=r"(pmccntr));, the compiler gives the error message: Error “no such instruction” asm volatile("mrc p15, 0, %eax, c9, c13, 0" : "=r"(pmccntr));Biography
My Build Environment= PLATFORM_VERSION_CODENAME=REL PLATFORM_VERSION=4.3 TARGET_PRODUCT=full_manta TARGET_BUILD_VARIANT=eng TARGET_BUILD_TYPE=release TARGET_BUILD_APPS= TARGET_ARCH=arm TARGET_ARCH_VARIANT=armv7-a-neon TARGET_CPU_VARIANT=cortex-a15 HOST_ARCH=x86 HOST_OS=linux HOST_OS_EXTRA=Linux-3.16.0-70-generic-x86_64-with-Ubuntu-14.04-trusty HOST_BUILD_TYPE=release BUILD_ID=JWR66V OUT_DIR=outBiography
@hidefromkgb: This is the code that I used but it gives the above error. {uint32_t pmccntr;asm volatile("mrc p15, 0, %0, c9, c13, 0" : "=r"(pmccntr));t0=static_cast<int64_t>(pmccntr) * 64;} //The C++ statement to measure its execution time {uint32_t pmccntr;asm volatile("mrc p15, 0, %0, c9, c13, 0" : "=r"(pmccntr));t1=static_cast<int64_t>(pmccntr) * 64;} time = t1-t0;Biography
@Biography You are, most probably, using the wrong binutils, since as is definitely trying to assemble X86 instead of ARM7A. And, BTW, * 64 is equivalent to << 6, and the result does not have to be either promoted to uint64_t or multiplied until (T1 – T0) is calculated. As the difference is typically way smaller than 2²⁶, multiplying it to 64 won`t require promotion to a 64-bit type.Strawflower
@hidefromkgb: I followed the AOSP guidelines and I did not change anything in the binutils. How to force the as to assemble ARM7A instead of X86?Biography
@Biography AOSP is by itself useless in your case: it does not allow anything but Java as a language in which apps can be written. And that`s for a good reason, as there are many different hardware architectures that support Android, so compiling machine code for all of them is a pain — and still you`d leave out those which aren`t yet supported. What you really need is Android NDK. NDK is positioned as a last-resort kit intended for programmers who positively do know what they are doing.Strawflower
The URL provided is really helpful. Thanks.Glomma
C
12

For Arm64, the system register CNTVCT_EL0 can be used to retrieve the counter from user space.

// SPDX-License-Identifier: GPL-2.0
u64 rdtsc(void)
{
    u64 val;

    /*
     * According to ARM DDI 0487F.c, from Armv8.0 to Armv8.5 inclusive, the
     * system counter is at least 56 bits wide; from Armv8.6, the counter
     * must be 64 bits wide.  So the system counter could be less than 64
     * bits wide and it is attributed with the flag 'cap_user_time_short'
     * is true.
     */
    asm volatile("mrs %0, cntvct_el0" : "=r" (val));

    return val;
}

Please refer this patch https://lore.kernel.org/patchwork/patch/1305380/ for more details.

Crept answered 14/6, 2021 at 10:1 Comment(7)
Do you think it's appropriate to relicense that GPL 2.0 code from the Linux kernel as CC BY-SA by posting it on StackOverflow?Aquiculture
@JeffHammond Thank you for point it. I added the GPL 2.0 license.Crept
@JeffHammond: can you put a license on a sequence of two assembler instructions?Otway
Does the GPL include a minimum number of things that are copied before it applies?Aquiculture
I tried this code, but it didn't work for me. It doesn't make sense that a sequence of code with 2304 multiplications execute in 30 cycles.Aldwin
It is not working for me either. I tried to approximate the CPU frequency in a 2 GHz aarch64 processor. s = rdtsc(); sleep(1); e = rdtsc(); freq = (double)(e - s) / 10e9. This code is reporting 20 MHz.Missing
CNTVCT_EL0, CNTPCT_EL0 and friends aren't CPU cycle counters like x86 rdtsc is. Instead, they tick at some implementation-defined constant frequency which can be read from CNTFRQ_EL0. The ARM docs say the frequency is typically between 1 and 50 MHz. The good thing is they can be used for timing even when the CPU clock frequency may vary due to power saving modes, etc.Modest
T
1

I found a NON-GPL, but BSD source about reading the system's real time counter on ARM64 here: https://github.com/cloudius-systems/osv/blob/master/arch/aarch64/arm-clock.cc

Even better, the code there doesn't only deliver some sort of ticks as rdtsc() does on Intel/AMD, it even reports the frequency of those ticks. And yes, it's in sync on all cores on multicores. So it can be useful for many things, including benchmarking or keeping track of the threads in a thread pool etc. Of course, it won't have the long term stability of a system clock, that is synced to an external time source via ntp.

The class arm_clock defined in the cited code might be overkill for many purposes. It for example also shows how to set hardware timers, which is something, that a normal user mode process likely won't have the permissions to do. Here is an excerpt of the most important parts to just read TSC and frequencies. It compiles fine with recent GCC on Intel, AMD and ARM. Of course, the frequency reading is provided only on ARM:

#ifdef __ARM_ARCH_ISA_A64
// Adapted from: https://github.com/cloudius-systems/osv/blob/master/arch/aarch64/arm-clock.cc
uint64_t rdtsc() {
    //Please note we read CNTVCT cpu system register which provides
    //the accross-system consistent value of the virtual system counter.
    uint64_t cntvct;
    asm volatile ("mrs %0, cntvct_el0; " : "=r"(cntvct) :: "memory");
    return cntvct;
}

uint64_t rdtsc_barrier() {
    uint64_t cntvct;
    asm volatile ("isb; mrs %0, cntvct_el0; isb; " : "=r"(cntvct) :: "memory");
    return cntvct;
}

uint32_t rdtsc_freq() {
    uint32_t freq_hz;
    asm volatile ("mrs %0, cntfrq_el0; isb; " : "=r"(freq_hz) :: "memory");
    return freq_hz;
}
#else
#include <x86intrin.h>
uint64_t rdtsc(){ return __rdtsc(); }
#endif

Execution times tested by me were in the range of 7 ns for rdtsc() and 30 ns for rdtsc_barrier(), which are quite similar to Intel and AMD.

Tricia answered 24/2 at 20:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.