S

5

32

In trying to build a very latency sensitive application, that needs to send 100s of messages a seconds, each message having the time field, we wanted to consider optimizing gettimeofday. Out first thought was rdtsc based optimization. Any thoughts ? Any other pointers ? Required accurancy of the time value returned is in milliseconds, but it isn't a big deal if the value is occasionally out of sync with the receiver for 1-2 milliseconds. Trying to do better than the 62 nanoseconds gettimeofday takes

Shepard answered 27/6, 2011 at 21:8 Comment(7)

To what granularity does the time need to be accurate too? – Isle 27/6, 2011 at 21:10

Keep in mind the timestamp counter may not be synchronized across CPUs, depending on CPU model. Also, modern Linux will implement gettimeofday in userspace with rdtsc where possible – Trench 27/6, 2011 at 21:14

Are you sure gettimeofday() is a problem? Which OS are you using? On Linux, IIRC, it was moved to userspace (to the vsyscall page, or the vDSO, don't remember which) to allow it to scale to lots of CPUs (was done by SGI's Christoph Lameter, IIRC). – Angulation 27/6, 2011 at 21:18

vsyscall had a gettimeofday, but vsyscall has been obsoleted, and its gettimeofday is now just a stub that calls into the kernel. – Trench 27/6, 2011 at 21:20

@Trench is there any way to make sure of this ? – Shepard 29/6, 2011 at 12:34

@Humble, why not just measure the amount of time gettimeofday() takes to see if it's fast enough for you? And how fast does it need to be? – Trench 29/6, 2011 at 15:22

I wanted to add a datapoint that YES, gettimeofday is demonstrably slow. Currently adding real-time ETA routines to a problem I have to brute-force with billion-total-iteration nested for loops... on my i3 machine, just doing some counter incrementation with occasional status calls I get 78M ops/sec; with one gettimeofday that becomes 28M, with two (for start= and end=) it becomes 16M. – Jillane 10/3, 2018 at 2:14

T

50

Have you actually benchmarked, and found gettimeofday to be unacceptably slow?

At the rate of 100 messages a second, you have 10ms of CPU time per message. If you have multiple cores, assuming it can be fully parallelized, you can easily increase that by 4-6x - that's 40-60ms per message! The cost of gettimeofday is unlikely to be anywhere near 10ms - I'd suspect it to be more like 1-10 microseconds (on my system, microbenchmarking it gives about 1 microsecond per call - try it for yourself). Your optimization efforts would be better spent elsewhere.

While using the TSC is a reasonable idea, modern Linux already has a userspace TSC-based gettimeofday - where possible, the vdso will pull in an implementation of gettimeofday that applies an offset (read from a shared kernel-user memory segment) to rdtsc's value, thus computing the time of day without entering the kernel. However, some CPU models don't have a TSC synchronized between different cores or different packages, and so this can end up being disabled. If you want high performance timing, you might first want to consider finding a CPU model that does have a synchronized TSC.

That said, if you're willing to sacrifice a significant amount of resolution (your timing will only be accurate to the last tick, meaning it could be off by tens of milliseconds), you could use CLOCK_MONOTONIC_COARSE or CLOCK_REALTIME_COARSE with clock_gettime. This is also implemented with the vdso as well, and guaranteed not to call into the kernel (for recent kernels and glibc).

Trench answered 27/6, 2011 at 21:18 Comment(11)

Each process is single threaded. The server will typically have 10-20 such processes running. – Shepard 29/6, 2011 at 12:38

"CPU model that does have a synchronized TSC", have a Xeon 5680, will research on it's handling of this – Shepard 29/6, 2011 at 12:41

@Humble, check for "Marking TSC unstable" in your dmesg. If it's there, you're not using TSC. But always, always benchmark before you try to optimize. Not only do you not know if it's fast enough to start, if you don't benchmark, you'll never know if you make an improvement... – Trench 29/6, 2011 at 15:23

@Trench dmesg | grep TSC says Fast TSC calibration using PIT – Shepard 29/6, 2011 at 15:38

Chances are that, if that's all you see, you're using the TSC. But still, benchmark gettimeofday(). Is it fast enough? – Trench 29/6, 2011 at 15:56

Getting about 178 cycles for gettimeofday(), so about 0.06 microseconds per call. – Shepard 29/6, 2011 at 16:31

@Humble, well, that's a lot faster than mine (my laptop, it turns out, has an unstable TSC and calls into the kernel to use the HPET) - Hopefully 178 cycles is fast enough for you? :) – Trench 29/6, 2011 at 17:2

We had benchmarks showing gettimeofday() with 7us per call and clock_gettime() with 9us per call. However, as you say, gettimeofday() could give wrong timing up to 10ms, I think it's reasonable to accept gettimeofday() even it's 2us slower – Duff 27/7, 2012 at 8:47

@Duff you've got a typo I think – Jenkins 30/12, 2013 at 4:53

@Trench your "check it for yourself" program does not compile on clang (so I can't run it on OS X) – Jenkins 30/12, 2013 at 4:58

@Trench can you expand on why clock_gettime() is significantly less accurate than gettimeofday()? What is meant by "accurate to the last tick"? The reason this is relevant is because on my system Qt uses clock_gettime() to implement timer events, presumably because it's monotonic and not wall time. Why can't clock_gettime() be as accurate as gettimeofday()? – Blankbook 29/11, 2016 at 20:39

L

73

POSIX Clocks

I wrote a benchmark for POSIX clock sources:

time (s) => 3 cycles
ftime (ms) => 54 cycles
gettimeofday (us) => 42 cycles
clock_gettime (ns) => 9 cycles (CLOCK_MONOTONIC_COARSE)
clock_gettime (ns) => 9 cycles (CLOCK_REALTIME_COARSE)
clock_gettime (ns) => 42 cycles (CLOCK_MONOTONIC)
clock_gettime (ns) => 42 cycles (CLOCK_REALTIME)
clock_gettime (ns) => 173 cycles (CLOCK_MONOTONIC_RAW)
clock_gettime (ns) => 179 cycles (CLOCK_BOOTTIME)
clock_gettime (ns) => 349 cycles (CLOCK_THREAD_CPUTIME_ID)
clock_gettime (ns) => 370 cycles (CLOCK_PROCESS_CPUTIME_ID)
rdtsc (cycles) => 24 cycles

These numbers are from an Intel Core i7-4771 CPU @ 3.50GHz on Linux 4.0. These measurements were taken using the TSC register and running each clock method thousands of times and taking the minimum cost value.

You'll want to test on the machines you intend to run on though as how these are implemented varies from hardware and kernel version. The code can be found here. It relies on the TSC register for cycle counting, which is in the same repo (tsc.h).

TSC

Access the TSC (processor time-stamp counter) is the most accurate and cheapest way to time things. Generally, this is what the kernel is using itself. It's also quite straight-forward on modern Intel chips as the TSC is synchronized across cores and unaffected by frequency scaling. So it provides a simple, global time source. You can see an example of using it here with a walkthrough of the assembly code here.

The main issue with this (other than portability) is that there doesn't seem to be a good way to go from cycles to nanoseconds. The Intel docs as far as I can find state that the TSC runs at a fixed frequency, but that this frequency may differ from the processors stated frequency. Intel doesn't appear to provide a reliable way to figure out the TSC frequency. The Linux kernel appears to solve this by testing how many TSC cycles occur between two hardware timers (see here).

Memcached

Memcached bothers to do the cache method. It may simply be to make sure the performance is more predictable across platforms, or scale better with multiple cores. It may also no be a worthwhile optimization.

Londoner answered 27/10, 2012 at 3:14 Comment(12)

On your github link you have the same results, but in nanoseconds, differing from what you write here by factor 1000. – Emancipated 3/5, 2013 at 4:38

sorry, fixed time notation. – Londoner 21/6, 2013 at 22:32

Your cached_clock method is just hiding in another thread the time consumed by the measuring. Plus it's far way inaccurate compared to other methods. See, if your main loop goes well and current_time==2, you're adding 2 msec each time you loop. – Argeliaargent 19/9, 2013 at 14:23

@Argeliaargent Sure, it is hiding the cost in another thread, saying 0ns is not accurate at all. The 'cost' of it though really depends on your application. Point of it being that you pay the cost once per cycle of the granularity you care about, not every time that you need the time. And yes, accuracy isn't great but once again depends a lot on use case. For something like memcached it's perfectly fine. For a paxos implementation it's probably terrible. – Londoner 10/10, 2013 at 1:17

How can you even benchmark with nanosecond accuracy? Is there a way to ensure that your program is the only one executing and that no context switches are allowed? – Cockspur 3/2, 2014 at 12:45

@Cockspur you run a LOT of rounds - enough where the context switches factor out. – Joh 8/6, 2015 at 3:43

@Joh My comment was actually a rhetorical question, hinting that none of this makes any sense unless the program is running on a system with real-time performance. – Cockspur 8/6, 2015 at 6:15

I think taking the minimum value is nonsense. The caches, branch predictors will be warmed up by the repeated calls. However, when you call these calls in real use, chances are they will called very rarely, and in those cases, the caches will be most likely cold. (in respect to reading the time). – Coray 12/9, 2016 at 4:52

I'm not sure it's plausible for time to take only 3 clock cycles on your Haswell. Or maybe it's reading a shared counter from memory instead of using rdtsc? As you found rdtsc has a throughput of 1 per 24 cycles on Haswell. (agner.org/optimize). Did you check the compiler-generated asm to make sure it didn't optimize away some work? – Sastruga 18/8, 2018 at 15:56

I just checked, and time(NULL) does in fact jump (via the PLT) to code in the VDSO page which just loads a value from memory. mov rax,QWORD PTR [rip+0xffffffffffffc1fd] # 0x7ffff7ff70a8. The code comes from __vdso_time() in arch/x86/entry/vdso/vclock_gettime.c. If that could inline to a movabs and avoid the PLT + call + branch overhead, we'd have 2-per-clock throughput for reading the time with 1-second resolution :P – Sastruga 19/8, 2018 at 9:45

@Peter - clock_gettime(CLOCK_MONOTONIC_COARSE) is also "faster than rdtsc" and also reads from a memory location in the VDSO. It does a bit more math though so it ends up quite a bit more expensive than time(), but is sometimes much more useful since it has a higher resolution. It is a shame it isn't even faster, although you can always "roll your own" with a periodic signal (or thread that sleeps) that updates a shared memory location - then you really can have your 1 uop reads of a high(ish) resolution clock. – Habituate 19/8, 2018 at 14:35

On Linux 5.3 and above kernels calling clock_gettime(CLOCK_MONOTONIC_RAW) should take the same number of cycles clock_gettime(CLOCK_MONOTONIC) thanks to [PATCH v7 00/25] Unify vDSOs across more architectures. – Pediculosis 31/10, 2019 at 6:1