In trying to build a very latency sensitive application, that needs to send 100s of messages a seconds, each message having the time field, we wanted to consider optimizing gettimeofday.
Out first thought was rdtsc
based optimization. Any thoughts ? Any other pointers ?
Required accurancy of the time value returned is in milliseconds, but it isn't a big deal if the value is occasionally out of sync with the receiver for 1-2 milliseconds.
Trying to do better than the 62 nanoseconds gettimeofday takes
Have you actually benchmarked, and found gettimeofday
to be unacceptably slow?
At the rate of 100 messages a second, you have 10ms of CPU time per message. If you have multiple cores, assuming it can be fully parallelized, you can easily increase that by 4-6x - that's 40-60ms per message! The cost of gettimeofday is unlikely to be anywhere near 10ms - I'd suspect it to be more like 1-10 microseconds (on my system, microbenchmarking it gives about 1 microsecond per call - try it for yourself). Your optimization efforts would be better spent elsewhere.
While using the TSC is a reasonable idea, modern Linux already has a userspace TSC-based gettimeofday - where possible, the vdso will pull in an implementation of gettimeofday that applies an offset (read from a shared kernel-user memory segment) to rdtsc
's value, thus computing the time of day without entering the kernel. However, some CPU models don't have a TSC synchronized between different cores or different packages, and so this can end up being disabled. If you want high performance timing, you might first want to consider finding a CPU model that does have a synchronized TSC.
That said, if you're willing to sacrifice a significant amount of resolution (your timing will only be accurate to the last tick, meaning it could be off by tens of milliseconds), you could use CLOCK_MONOTONIC_COARSE or CLOCK_REALTIME_COARSE with clock_gettime. This is also implemented with the vdso as well, and guaranteed not to call into the kernel (for recent kernels and glibc).
dmesg | grep TSC
says Fast TSC calibration using PIT
–
Shepard gettimeofday()
. Is it fast enough? –
Trench gettimeofday()
, so about 0.06 microseconds per call. –
Shepard gettimeofday()
with 7us
per call and clock_gettime()
with 9us
per call. However, as you say, gettimeofday()
could give wrong timing up to 10ms
, I think it's reasonable to accept gettimeofday()
even it's 2us
slower –
Duff clock_gettime()
is significantly less accurate than gettimeofday()
? What is meant by "accurate to the last tick"? The reason this is relevant is because on my system Qt uses clock_gettime()
to implement timer events, presumably because it's monotonic and not wall time. Why can't clock_gettime()
be as accurate as gettimeofday()
? –
Blankbook POSIX Clocks
I wrote a benchmark for POSIX clock sources:
- time (s) => 3 cycles
- ftime (ms) => 54 cycles
- gettimeofday (us) => 42 cycles
- clock_gettime (ns) => 9 cycles (CLOCK_MONOTONIC_COARSE)
- clock_gettime (ns) => 9 cycles (CLOCK_REALTIME_COARSE)
- clock_gettime (ns) => 42 cycles (CLOCK_MONOTONIC)
- clock_gettime (ns) => 42 cycles (CLOCK_REALTIME)
- clock_gettime (ns) => 173 cycles (CLOCK_MONOTONIC_RAW)
- clock_gettime (ns) => 179 cycles (CLOCK_BOOTTIME)
- clock_gettime (ns) => 349 cycles (CLOCK_THREAD_CPUTIME_ID)
- clock_gettime (ns) => 370 cycles (CLOCK_PROCESS_CPUTIME_ID)
- rdtsc (cycles) => 24 cycles
These numbers are from an Intel Core i7-4771 CPU @ 3.50GHz on Linux 4.0. These measurements were taken using the TSC register and running each clock method thousands of times and taking the minimum cost value.
You'll want to test on the machines you intend to run on though as how these are implemented varies from hardware and kernel version. The code can be found here. It relies on the TSC register for cycle counting, which is in the same repo (tsc.h).
TSC
Access the TSC (processor time-stamp counter) is the most accurate and cheapest way to time things. Generally, this is what the kernel is using itself. It's also quite straight-forward on modern Intel chips as the TSC is synchronized across cores and unaffected by frequency scaling. So it provides a simple, global time source. You can see an example of using it here with a walkthrough of the assembly code here.
The main issue with this (other than portability) is that there doesn't seem to be a good way to go from cycles to nanoseconds. The Intel docs as far as I can find state that the TSC runs at a fixed frequency, but that this frequency may differ from the processors stated frequency. Intel doesn't appear to provide a reliable way to figure out the TSC frequency. The Linux kernel appears to solve this by testing how many TSC cycles occur between two hardware timers (see here).
Memcached
Memcached bothers to do the cache method. It may simply be to make sure the performance is more predictable across platforms, or scale better with multiple cores. It may also no be a worthwhile optimization.
time
to take only 3 clock cycles on your Haswell. Or maybe it's reading a shared counter from memory instead of using rdtsc
? As you found rdtsc
has a throughput of 1 per 24 cycles on Haswell. (agner.org/optimize). Did you check the compiler-generated asm to make sure it didn't optimize away some work? –
Sastruga time(NULL)
does in fact jump (via the PLT) to code in the VDSO page which just loads a value from memory. mov rax,QWORD PTR [rip+0xffffffffffffc1fd] # 0x7ffff7ff70a8
. The code comes from __vdso_time()
in arch/x86/entry/vdso/vclock_gettime.c
. If that could inline to a movabs
and avoid the PLT + call + branch overhead, we'd have 2-per-clock throughput for reading the time with 1-second resolution :P –
Sastruga clock_gettime(CLOCK_MONOTONIC_RAW)
should take the same number of cycles clock_gettime(CLOCK_MONOTONIC)
thanks to [PATCH v7 00/25] Unify vDSOs across more architectures. –
Pediculosis Have you actually benchmarked, and found gettimeofday
to be unacceptably slow?
At the rate of 100 messages a second, you have 10ms of CPU time per message. If you have multiple cores, assuming it can be fully parallelized, you can easily increase that by 4-6x - that's 40-60ms per message! The cost of gettimeofday is unlikely to be anywhere near 10ms - I'd suspect it to be more like 1-10 microseconds (on my system, microbenchmarking it gives about 1 microsecond per call - try it for yourself). Your optimization efforts would be better spent elsewhere.
While using the TSC is a reasonable idea, modern Linux already has a userspace TSC-based gettimeofday - where possible, the vdso will pull in an implementation of gettimeofday that applies an offset (read from a shared kernel-user memory segment) to rdtsc
's value, thus computing the time of day without entering the kernel. However, some CPU models don't have a TSC synchronized between different cores or different packages, and so this can end up being disabled. If you want high performance timing, you might first want to consider finding a CPU model that does have a synchronized TSC.
That said, if you're willing to sacrifice a significant amount of resolution (your timing will only be accurate to the last tick, meaning it could be off by tens of milliseconds), you could use CLOCK_MONOTONIC_COARSE or CLOCK_REALTIME_COARSE with clock_gettime. This is also implemented with the vdso as well, and guaranteed not to call into the kernel (for recent kernels and glibc).
dmesg | grep TSC
says Fast TSC calibration using PIT
–
Shepard gettimeofday()
. Is it fast enough? –
Trench gettimeofday()
, so about 0.06 microseconds per call. –
Shepard gettimeofday()
with 7us
per call and clock_gettime()
with 9us
per call. However, as you say, gettimeofday()
could give wrong timing up to 10ms
, I think it's reasonable to accept gettimeofday()
even it's 2us
slower –
Duff clock_gettime()
is significantly less accurate than gettimeofday()
? What is meant by "accurate to the last tick"? The reason this is relevant is because on my system Qt uses clock_gettime()
to implement timer events, presumably because it's monotonic and not wall time. Why can't clock_gettime()
be as accurate as gettimeofday()
? –
Blankbook Like bdonian says, if you're only sending a few hundred messages per second, gettimeofday
is going to be fast enough.
However, if you were sending millions of messages per second, it might be different (but you should still measure that it is a bottleneck). In that case, you might want to consider something like this:
- have a global variable, giving the current timestamp in your desired accuracy
- have a dedicated background thread that does nothing except update the timestamp (if timestamp should be updated every T units of time, then have the thread sleep some fraction of T and then update the timestamp; use real-time features if you need to)
- all other threads (or the main process, if you don't use threads otherwise) just reads the global variable
The C language does not guarantee that you can read the timestamp value if it is larger than sig_atomic_t
. You could use locking to deal with that, but locking is heavy. Instead, you could use a volatile sig_atomic_t
typed variable to index an array of timestamps: the background thread updates the next element in the array, and then updates the index. The other threads read the index, and then read the array: they might get a tiny bit out-of-date timestamp (but they get the right one next time), but they do not run into the problem where they read the timestamp at the same time it is being updated, and get some bytes of the old value and some of the new value.
But all this is much overkill for just hundreds of messages per second.
gettimeofday()
: every write would potentially cause a cache miss on every reader on SMP. –
Angulation Below is a benchmark. I see about 30ns. printTime() from rashad How to get current time and date in C++?
#include <string>
#include <iostream>
#include <sys/time.h>
using namespace std;
void printTime(time_t now)
{
struct tm tstruct;
char buf[80];
tstruct = *localtime(&now);
strftime(buf, sizeof(buf), "%Y-%m-%d.%X", &tstruct);
cout << buf << endl;
}
int main()
{
timeval tv;
time_t tm;
gettimeofday(&tv,NULL);
printTime((time_t)tv.tv_sec);
for(int i=0; i<100000000; i++)
gettimeofday(&tv,NULL);
gettimeofday(&tv,NULL);
printTime((time_t)tv.tv_sec);
printTime(time(NULL));
for(int i=0; i<100000000; i++)
tm=time(NULL);
printTime(time(NULL));
return 0;
}
3 sec for 100,000,000 calls or 30ns;
2014-03-20.09:23:35
2014-03-20.09:23:38
2014-03-20.09:23:38
2014-03-20.09:23:41
Do you need the millisecond precision? If not you could simply use time()
and deal with the unix timestamp.
time()
and gettimeofday()
, 60 nanoseconds versus 62 nanoseconds. Not much, need to do much better. –
Shepard global_unix_ts = time(); sleep 500ms;
. The global var not even protected by a mutex. This should be lighting fast. bdonlan's answers seems to be very elegant and complete too. –
Ensure © 2022 - 2024 — McMap. All rights reserved.
gettimeofday
in userspace withrdtsc
where possible – Trenchgettimeofday()
is a problem? Which OS are you using? On Linux, IIRC, it was moved to userspace (to the vsyscall page, or the vDSO, don't remember which) to allow it to scale to lots of CPUs (was done by SGI's Christoph Lameter, IIRC). – Angulationgettimeofday()
takes to see if it's fast enough for you? And how fast does it need to be? – Trenchgettimeofday
is demonstrably slow. Currently adding real-time ETA routines to a problem I have to brute-force with billion-total-iteration nested for loops... on my i3 machine, just doing some counter incrementation with occasional status calls I get 78M ops/sec; with onegettimeofday
that becomes 28M, with two (for start= and end=) it becomes 16M. – Jillane