faster equivalent of gettimeofday
Asked Answered
S

5

32

In trying to build a very latency sensitive application, that needs to send 100s of messages a seconds, each message having the time field, we wanted to consider optimizing gettimeofday. Out first thought was rdtsc based optimization. Any thoughts ? Any other pointers ? Required accurancy of the time value returned is in milliseconds, but it isn't a big deal if the value is occasionally out of sync with the receiver for 1-2 milliseconds. Trying to do better than the 62 nanoseconds gettimeofday takes

Shepard answered 27/6, 2011 at 21:8 Comment(7)
To what granularity does the time need to be accurate too?Isle
Keep in mind the timestamp counter may not be synchronized across CPUs, depending on CPU model. Also, modern Linux will implement gettimeofday in userspace with rdtsc where possibleTrench
Are you sure gettimeofday() is a problem? Which OS are you using? On Linux, IIRC, it was moved to userspace (to the vsyscall page, or the vDSO, don't remember which) to allow it to scale to lots of CPUs (was done by SGI's Christoph Lameter, IIRC).Angulation
vsyscall had a gettimeofday, but vsyscall has been obsoleted, and its gettimeofday is now just a stub that calls into the kernel.Trench
@Trench is there any way to make sure of this ?Shepard
@Humble, why not just measure the amount of time gettimeofday() takes to see if it's fast enough for you? And how fast does it need to be?Trench
I wanted to add a datapoint that YES, gettimeofday is demonstrably slow. Currently adding real-time ETA routines to a problem I have to brute-force with billion-total-iteration nested for loops... on my i3 machine, just doing some counter incrementation with occasional status calls I get 78M ops/sec; with one gettimeofday that becomes 28M, with two (for start= and end=) it becomes 16M.Jillane
T
50

Have you actually benchmarked, and found gettimeofday to be unacceptably slow?

At the rate of 100 messages a second, you have 10ms of CPU time per message. If you have multiple cores, assuming it can be fully parallelized, you can easily increase that by 4-6x - that's 40-60ms per message! The cost of gettimeofday is unlikely to be anywhere near 10ms - I'd suspect it to be more like 1-10 microseconds (on my system, microbenchmarking it gives about 1 microsecond per call - try it for yourself). Your optimization efforts would be better spent elsewhere.

While using the TSC is a reasonable idea, modern Linux already has a userspace TSC-based gettimeofday - where possible, the vdso will pull in an implementation of gettimeofday that applies an offset (read from a shared kernel-user memory segment) to rdtsc's value, thus computing the time of day without entering the kernel. However, some CPU models don't have a TSC synchronized between different cores or different packages, and so this can end up being disabled. If you want high performance timing, you might first want to consider finding a CPU model that does have a synchronized TSC.

That said, if you're willing to sacrifice a significant amount of resolution (your timing will only be accurate to the last tick, meaning it could be off by tens of milliseconds), you could use CLOCK_MONOTONIC_COARSE or CLOCK_REALTIME_COARSE with clock_gettime. This is also implemented with the vdso as well, and guaranteed not to call into the kernel (for recent kernels and glibc).

Trench answered 27/6, 2011 at 21:18 Comment(11)
Each process is single threaded. The server will typically have 10-20 such processes running.Shepard
"CPU model that does have a synchronized TSC", have a Xeon 5680, will research on it's handling of thisShepard
@Humble, check for "Marking TSC unstable" in your dmesg. If it's there, you're not using TSC. But always, always benchmark before you try to optimize. Not only do you not know if it's fast enough to start, if you don't benchmark, you'll never know if you make an improvement...Trench
@Trench dmesg | grep TSC says Fast TSC calibration using PITShepard
Chances are that, if that's all you see, you're using the TSC. But still, benchmark gettimeofday(). Is it fast enough?Trench
Getting about 178 cycles for gettimeofday(), so about 0.06 microseconds per call.Shepard
@Humble, well, that's a lot faster than mine (my laptop, it turns out, has an unstable TSC and calls into the kernel to use the HPET) - Hopefully 178 cycles is fast enough for you? :)Trench
We had benchmarks showing gettimeofday() with 7us per call and clock_gettime() with 9us per call. However, as you say, gettimeofday() could give wrong timing up to 10ms, I think it's reasonable to accept gettimeofday() even it's 2us slowerDuff
@Duff you've got a typo I thinkJenkins
@Trench your "check it for yourself" program does not compile on clang (so I can't run it on OS X)Jenkins
@Trench can you expand on why clock_gettime() is significantly less accurate than gettimeofday()? What is meant by "accurate to the last tick"? The reason this is relevant is because on my system Qt uses clock_gettime() to implement timer events, presumably because it's monotonic and not wall time. Why can't clock_gettime() be as accurate as gettimeofday()?Blankbook
L
73

POSIX Clocks

I wrote a benchmark for POSIX clock sources:

  • time (s) => 3 cycles
  • ftime (ms) => 54 cycles
  • gettimeofday (us) => 42 cycles
  • clock_gettime (ns) => 9 cycles (CLOCK_MONOTONIC_COARSE)
  • clock_gettime (ns) => 9 cycles (CLOCK_REALTIME_COARSE)
  • clock_gettime (ns) => 42 cycles (CLOCK_MONOTONIC)
  • clock_gettime (ns) => 42 cycles (CLOCK_REALTIME)
  • clock_gettime (ns) => 173 cycles (CLOCK_MONOTONIC_RAW)
  • clock_gettime (ns) => 179 cycles (CLOCK_BOOTTIME)
  • clock_gettime (ns) => 349 cycles (CLOCK_THREAD_CPUTIME_ID)
  • clock_gettime (ns) => 370 cycles (CLOCK_PROCESS_CPUTIME_ID)
  • rdtsc (cycles) => 24 cycles

These numbers are from an Intel Core i7-4771 CPU @ 3.50GHz on Linux 4.0. These measurements were taken using the TSC register and running each clock method thousands of times and taking the minimum cost value.

You'll want to test on the machines you intend to run on though as how these are implemented varies from hardware and kernel version. The code can be found here. It relies on the TSC register for cycle counting, which is in the same repo (tsc.h).

TSC

Access the TSC (processor time-stamp counter) is the most accurate and cheapest way to time things. Generally, this is what the kernel is using itself. It's also quite straight-forward on modern Intel chips as the TSC is synchronized across cores and unaffected by frequency scaling. So it provides a simple, global time source. You can see an example of using it here with a walkthrough of the assembly code here.

The main issue with this (other than portability) is that there doesn't seem to be a good way to go from cycles to nanoseconds. The Intel docs as far as I can find state that the TSC runs at a fixed frequency, but that this frequency may differ from the processors stated frequency. Intel doesn't appear to provide a reliable way to figure out the TSC frequency. The Linux kernel appears to solve this by testing how many TSC cycles occur between two hardware timers (see here).

Memcached

Memcached bothers to do the cache method. It may simply be to make sure the performance is more predictable across platforms, or scale better with multiple cores. It may also no be a worthwhile optimization.

Londoner answered 27/10, 2012 at 3:14 Comment(12)
On your github link you have the same results, but in nanoseconds, differing from what you write here by factor 1000.Emancipated
sorry, fixed time notation.Londoner
Your cached_clock method is just hiding in another thread the time consumed by the measuring. Plus it's far way inaccurate compared to other methods. See, if your main loop goes well and current_time==2, you're adding 2 msec each time you loop.Argeliaargent
@Argeliaargent Sure, it is hiding the cost in another thread, saying 0ns is not accurate at all. The 'cost' of it though really depends on your application. Point of it being that you pay the cost once per cycle of the granularity you care about, not every time that you need the time. And yes, accuracy isn't great but once again depends a lot on use case. For something like memcached it's perfectly fine. For a paxos implementation it's probably terrible.Londoner
How can you even benchmark with nanosecond accuracy? Is there a way to ensure that your program is the only one executing and that no context switches are allowed?Cockspur
@Cockspur you run a LOT of rounds - enough where the context switches factor out.Joh
@Joh My comment was actually a rhetorical question, hinting that none of this makes any sense unless the program is running on a system with real-time performance.Cockspur
I think taking the minimum value is nonsense. The caches, branch predictors will be warmed up by the repeated calls. However, when you call these calls in real use, chances are they will called very rarely, and in those cases, the caches will be most likely cold. (in respect to reading the time).Coray
I'm not sure it's plausible for time to take only 3 clock cycles on your Haswell. Or maybe it's reading a shared counter from memory instead of using rdtsc? As you found rdtsc has a throughput of 1 per 24 cycles on Haswell. (agner.org/optimize). Did you check the compiler-generated asm to make sure it didn't optimize away some work?Sastruga
I just checked, and time(NULL) does in fact jump (via the PLT) to code in the VDSO page which just loads a value from memory. mov rax,QWORD PTR [rip+0xffffffffffffc1fd] # 0x7ffff7ff70a8. The code comes from __vdso_time() in arch/x86/entry/vdso/vclock_gettime.c. If that could inline to a movabs and avoid the PLT + call + branch overhead, we'd have 2-per-clock throughput for reading the time with 1-second resolution :PSastruga
@Peter - clock_gettime(CLOCK_MONOTONIC_COARSE) is also "faster than rdtsc" and also reads from a memory location in the VDSO. It does a bit more math though so it ends up quite a bit more expensive than time(), but is sometimes much more useful since it has a higher resolution. It is a shame it isn't even faster, although you can always "roll your own" with a periodic signal (or thread that sleeps) that updates a shared memory location - then you really can have your 1 uop reads of a high(ish) resolution clock.Habituate
On Linux 5.3 and above kernels calling clock_gettime(CLOCK_MONOTONIC_RAW) should take the same number of cycles clock_gettime(CLOCK_MONOTONIC) thanks to [PATCH v7 00/25] Unify vDSOs across more architectures.Pediculosis
T
50

Have you actually benchmarked, and found gettimeofday to be unacceptably slow?

At the rate of 100 messages a second, you have 10ms of CPU time per message. If you have multiple cores, assuming it can be fully parallelized, you can easily increase that by 4-6x - that's 40-60ms per message! The cost of gettimeofday is unlikely to be anywhere near 10ms - I'd suspect it to be more like 1-10 microseconds (on my system, microbenchmarking it gives about 1 microsecond per call - try it for yourself). Your optimization efforts would be better spent elsewhere.

While using the TSC is a reasonable idea, modern Linux already has a userspace TSC-based gettimeofday - where possible, the vdso will pull in an implementation of gettimeofday that applies an offset (read from a shared kernel-user memory segment) to rdtsc's value, thus computing the time of day without entering the kernel. However, some CPU models don't have a TSC synchronized between different cores or different packages, and so this can end up being disabled. If you want high performance timing, you might first want to consider finding a CPU model that does have a synchronized TSC.

That said, if you're willing to sacrifice a significant amount of resolution (your timing will only be accurate to the last tick, meaning it could be off by tens of milliseconds), you could use CLOCK_MONOTONIC_COARSE or CLOCK_REALTIME_COARSE with clock_gettime. This is also implemented with the vdso as well, and guaranteed not to call into the kernel (for recent kernels and glibc).

Trench answered 27/6, 2011 at 21:18 Comment(11)
Each process is single threaded. The server will typically have 10-20 such processes running.Shepard
"CPU model that does have a synchronized TSC", have a Xeon 5680, will research on it's handling of thisShepard
@Humble, check for "Marking TSC unstable" in your dmesg. If it's there, you're not using TSC. But always, always benchmark before you try to optimize. Not only do you not know if it's fast enough to start, if you don't benchmark, you'll never know if you make an improvement...Trench
@Trench dmesg | grep TSC says Fast TSC calibration using PITShepard
Chances are that, if that's all you see, you're using the TSC. But still, benchmark gettimeofday(). Is it fast enough?Trench
Getting about 178 cycles for gettimeofday(), so about 0.06 microseconds per call.Shepard
@Humble, well, that's a lot faster than mine (my laptop, it turns out, has an unstable TSC and calls into the kernel to use the HPET) - Hopefully 178 cycles is fast enough for you? :)Trench
We had benchmarks showing gettimeofday() with 7us per call and clock_gettime() with 9us per call. However, as you say, gettimeofday() could give wrong timing up to 10ms, I think it's reasonable to accept gettimeofday() even it's 2us slowerDuff
@Duff you've got a typo I thinkJenkins
@Trench your "check it for yourself" program does not compile on clang (so I can't run it on OS X)Jenkins
@Trench can you expand on why clock_gettime() is significantly less accurate than gettimeofday()? What is meant by "accurate to the last tick"? The reason this is relevant is because on my system Qt uses clock_gettime() to implement timer events, presumably because it's monotonic and not wall time. Why can't clock_gettime() be as accurate as gettimeofday()?Blankbook
B
5

Like bdonian says, if you're only sending a few hundred messages per second, gettimeofday is going to be fast enough.

However, if you were sending millions of messages per second, it might be different (but you should still measure that it is a bottleneck). In that case, you might want to consider something like this:

  • have a global variable, giving the current timestamp in your desired accuracy
  • have a dedicated background thread that does nothing except update the timestamp (if timestamp should be updated every T units of time, then have the thread sleep some fraction of T and then update the timestamp; use real-time features if you need to)
  • all other threads (or the main process, if you don't use threads otherwise) just reads the global variable

The C language does not guarantee that you can read the timestamp value if it is larger than sig_atomic_t. You could use locking to deal with that, but locking is heavy. Instead, you could use a volatile sig_atomic_t typed variable to index an array of timestamps: the background thread updates the next element in the array, and then updates the index. The other threads read the index, and then read the array: they might get a tiny bit out-of-date timestamp (but they get the right one next time), but they do not run into the problem where they read the timestamp at the same time it is being updated, and get some bytes of the old value and some of the new value.

But all this is much overkill for just hundreds of messages per second.

Beautify answered 27/6, 2011 at 21:43 Comment(4)
"have a dedicated background thread that does nothing except update the timestamp (if timestamp should be updated every T units of time" <-- this is exactly what CLOCK_*_COARSE does, except the dedicated thread is actually an interrupt handler and is system-wide, and the kernel folks have already dealt with the read tearing and other issues for you :)Trench
I'm not sure that would be faster than Linux's gettimeofday(): every write would potentially cause a cache miss on every reader on SMP.Angulation
Come to think of it, are vvars cpu-local on Linux? If so, that's be another major advantage of CLOCK_*_COARSE... Edit: Looks like not (lxr.linux.no/linux+v2.6.39/arch/x86/kernel/vsyscall_64.c#L76), but invalidating a cache line or two is better than interrupting all CPUs with a local timer interrupt or IPI I supposeTrench
Lars, it is not a question of how many times a second, the application wants to construct a message and send it out as soon as possible to the receiver, and is competing with other senders. This is a trading application, so in every message to the receiver, no matter how low or high the frequency we would like to shave off microseconds.Shepard
H
1

Below is a benchmark. I see about 30ns. printTime() from rashad How to get current time and date in C++?

#include <string>
#include <iostream>
#include <sys/time.h>
using namespace std;

void printTime(time_t now)
{
    struct tm  tstruct;
    char       buf[80];
    tstruct = *localtime(&now);
    strftime(buf, sizeof(buf), "%Y-%m-%d.%X", &tstruct);
    cout << buf << endl;
}

int main()
{
   timeval tv;
   time_t tm;

   gettimeofday(&tv,NULL);
   printTime((time_t)tv.tv_sec);
   for(int i=0; i<100000000; i++)
        gettimeofday(&tv,NULL);
   gettimeofday(&tv,NULL);
   printTime((time_t)tv.tv_sec);

   printTime(time(NULL));
   for(int i=0; i<100000000; i++)
        tm=time(NULL);
   printTime(time(NULL));

   return 0;
}

3 sec for 100,000,000 calls or 30ns;

2014-03-20.09:23:35
2014-03-20.09:23:38
2014-03-20.09:23:38
2014-03-20.09:23:41
Helpless answered 20/3, 2014 at 13:30 Comment(0)
E
0

Do you need the millisecond precision? If not you could simply use time() and deal with the unix timestamp.

Ensure answered 27/6, 2011 at 21:10 Comment(2)
Comparison of time() and gettimeofday(), 60 nanoseconds versus 62 nanoseconds. Not much, need to do much better.Shepard
Maybe having a thread with: global_unix_ts = time(); sleep 500ms;. The global var not even protected by a mutex. This should be lighting fast. bdonlan's answers seems to be very elegant and complete too.Ensure

© 2022 - 2024 — McMap. All rights reserved.