__rdtsc/__rdtscp for ARM Mac M1/M2?
Asked Answered
T

1

5

I want to insert some time measurement into my code. On x64 I use __rdtscp. Is there something similar for the mac m1/m2? Specifically something that isn't a system call and high resolution.

Triecious answered 10/12, 2022 at 22:53 Comment(2)
Can you explain the problem first? What requirements do you have? There might be better ways.Proud
@Proud profiling code that's already really fast. Something weird is happening so I'm trying to guesstimate which functions are eating up what % of time. I don't think my check are in any loops but I rather not ruin the L1 cache by getting a rough estimate of timeTriecious
R
7

Just use clock_gettime(CLOCK_MONOTONIC,...)

It is a VDSO function. That means that the kernel injects code into the userspace program that "does the right thing", so the userspace program can access the time stamp counter without doing a syscall.

On x86, it will [usually] invoke rdtsc [or a PET], and adjust the counter value to represent nanoseconds.

On arm, the TSC is a control register, accessible only in kernel mode. But, higher end arm arches allow this to be mapped for R/O access by userspace. The kernel enables the mapping. Then, the VDSO snippet will know how to access the values via the mapping.

Calls to clock_gettime are fast. So fast that it's not worth trying to access the counter register directly.

Also, it's not terribly meaningful to access the counter directly, because we still have to convert it to some standard unit (e.g. nanoseconds). The VDSO snippet will do this.


UPDATE:

Is it a VDSO call on macOS, too? –  fuz

My direct experience was with arm was on an nVidia Jetson [under linux].

But, AFAIK, macOS provides [has to provide] clock_gettime.

On older kernels, it may have to issue a syscall equivalent.

But, since the architecture provides the means to do the direct access for userspace to a given OS/kernel, there is every reason to believe the VDSO method is available under macOS as well. In fact, it does: https://www.unix.com/man-page/osx/7/vdso/

The way to see the specific mechanism is to build a program that uses clock_gettime and [using gdb] single step it a bit. Then, it is possible to have gdb disassemble the clock_gettime code.

We have to use gdb [vs. objdump and/or readelf] for the disassembly because the snippet is loaded/injected by the kernel dynamically, so it's not easily accessible with static analysis.

Further, the injected code can be processor model specific. The kernel probes the CPU arch and its features during boot. It crafts the snippet based on the features it finds.

Using gdb is how I examined clock_gettime [about 3 years ago for a commercial product], to verify that it would access the H/W without a syscall and that it provided the correct nanosecond values. In that particular case, I also looked at the arch specific sections in the kernel source code.

Royall answered 10/12, 2022 at 23:50 Comment(4)
Is it a VDSO call on macOS, too?Figured
The man page you linked is of the Linux kernel project. No such man page exists on my arm64 macOS system. However, I was able to confirm that seemingly no system call is performed for clock_gettime(CLOCK_MONOTONIC, ...).Figured
I agree that clock_gettime is the right solution. But it's not true that the time stamp counter is privileged. The CNTVCT_EL0 system register is, as its name suggests, readable from EL0. (It can be set to trap for virtualization, but based on speed tests, it doesn't look like either Linux or MacOS do so.) This is how Linux implements its userspace clock_gettime, and I would not be surprised if MacOS does the same. So you could read the register directly from inline asm (or maybe there's an intrinsic?) if you wanted a raw tick count without the arithmetic overhead.Arriaga
There is also a MMIO interface to the timer, maybe that's what you have in mind. That would indeed need help from the operating system to map it.Arriaga

© 2022 - 2024 — McMap. All rights reserved.