Scheduling events at microsecond granularity in POSIX

Asked 12/11, 2013 at 10:53 Answered 16/5, 2014 at 17:28

I'm trying to determine the granularity I can accurately schedule tasks to occur in C/C++. At the moment I can reliably schedule tasks to occur every 5 microseconds, but I'm trying to see if I can lower this further.

Any advice on how to achieve this / if it is possible would be greatly appreciated.

Since I know timer granularity can often be OS dependent: I am currently running on Linux, but would use Windows if the timing granularity is better (although I don't believe it is, based on what I've found for the QueryPerformanceCounter)

I execute all measurements on bare-metal (no VM). /proc/timer_info confirms nanosecond timer resolution for my CPU (but I know that doesn't translate to nanosecond alarm resolution)

Current

My current code can be found as a Gist here

At the moment, I'm able to execute a request every 5 microseconds (5000 nanoseconds) with less then 1% late arrivals. When late arrivals do occur, they are typically only one cycle (5000 nanoseconds) behind.

I'm doing 3 things at the moment

Setting the process to real-time priority (some pointed out by @Spudd86 here)

struct sched_param schedparm;
memset(&schedparm, 0, sizeof(schedparm));
schedparm.sched_priority = 99; // highest rt priority
sched_setscheduler(0, SCHED_FIFO, &schedparm);

Minimizing the timer slack

prctl(PR_SET_TIMERSLACK, 1);

Using timerfds (part of the 2.6 Linux kernel)

int timerfd = timerfd_create(CLOCK_MONOTONIC,0);
struct itimerspec timspec;
bzero(&timspec, sizeof(timspec));
timspec.it_interval.tv_sec = 0;
timspec.it_interval.tv_nsec = nanosecondInterval;
timspec.it_value.tv_sec = 0;
timspec.it_value.tv_nsec = 1;

timerfd_settime(timerfd, 0, &timspec, 0);

Possible improvements

Dedicate a processor to this process?
Use a nonblocking timerfd so that I can create a tight loop, instead of blocking (tight loop will waste more CPU, but may also be quicker to respond to an alarm)
Using an external embedded device for triggering (can't imagine why this would be better)

Why

I'm currently working on creating a workload generator for a benchmarking engine. The workload generator simulates an arrival rate (X requests / second, etc.) using a Poisson process. From the Poisson process, I can determine the relative times at which requests must be made from the benchmarking engine.

So for instance, at 10 requests a second, we may have requests made at: t = 0.02, 0.04, 0.05, 0.056, 0.09 seconds

These requests need to be scheduled in advance and then executed. As the number of requests per second increases, the granularity required for scheduling these requests increases (thousands of requests per second requires sub-millisecond accuracy). As a result, I'm trying to figure out how to scale this system further.

Mutualism answered 12/11, 2013 at 10:53 Comment(5)

Have you patched your kernel to be real-time with CONFIG_PREEMPT_RT flag enabled? – Ammoniac 12/11, 2013 at 11:4

@Ammoniac No, I have not -- thanks for pointing me towards that. – Mutualism 12/11, 2013 at 11:8

I don't see anything of your question really has to do with C or C++. They are just interface language for you, aren't they? So perhaps remove the tags, and also replace C++ by POSIX in your question title. – Potaufeu 12/11, 2013 at 11:10

@JensGustedt I figured C will give me a little more visibility to people in this area? – Mutualism 12/11, 2013 at 11:14

@BSchlinker, "visibility" shouldn't be your goal. Tags and titles are there to attract the experts with the right skills to your question, and not to deviate others that may not have to say much on the particular question. – Potaufeu 12/11, 2013 at 12:15

You're very close to the limits of what vanilla Linux will offer you, and it's way past what it can guarantee. Adding the real-time patches to your kernel and tuning for full pre-emption will help give you better guarantees under load. I would also remove any dynamic memory allocation from your time critical code, malloc and friends can (and will) stall for a not-inconsequential (in a real-time sense) period of time if it has to reclaim the memory from the i/o cache. I would also be considering removing swap from that machine to help guarantee performance. Dedicating a processor to your task will help to prevent context switch times but, again, it's no guarantee.

I would also suggest that you be careful with that level of sched_priority, you're above various important bits of Linux there, which can lead to very strange effects.

Apoenzyme answered 12/11, 2013 at 11:6 Comment(3)

"Adding the real-time patches to your kernel and tuning for full pre-emption will help give you better guarantees under load." Can you point me towards any resources which talk about these items? I have RT knowledge for embedded systems but have never applied it to Linux.. – Mutualism 12/11, 2013 at 11:7

Interestingly, using the full real-time patch actually degrades the application's performance and provides less granularity then I had achieved without it. I am guessing that this is because the call to notify me that the timer has gone "off" is now being preempted, and thus not propagating to my application fast enough. Does this sound reasonable? I have changed the RT priority to 1 from 99 to attempt to resolve this, but no impact. – Mutualism 16/11, 2013 at 18:51

You may find that best case is worst, but worst case is better. Have you put your system under serious CPU/Memory/IO load whilst running your tests? – Apoenzyme 18/11, 2013 at 8:47

What you gain from building a realtime kernel is more reliable guarantees (ie lower maximum latency) of the time between an IO/timer event handled by the kernel, and control being passed to your app in response. This comes at the price of lower throughput, and you might notice an increase in your best-case latency times.

However, the only reason for using OS timers to schedule events with high-precision is if you're afraid of burning CPU cycles in a loop while you wait for your next due event. OS timers (especially in MS Windows) are not reliable for high granularity timing events, and are very dependant on the sort of timing/HPET hardware available in your system.

When I require highly accurate event scheduling, I use a hybrid method. First, I measure the worst case latency - that is, the biggest difference between the time I requested to sleep, and the actual clock time after sleeping. Let's call this difference "D". (You can actually do this on-the-fly during normal running, by tracking "D" every time you sleep, with something like "D = (D*7 + lastD) / 8" to produce a temporal average).

Then never request to sleep beyond "N - D*2", where "N" is the time of the next event. When within "D*2" time of the next event, enter a spin loop and wait for "N" to occur.

This eats a lot more CPU cycles, but depending on the accuracy you require, you might be able to get away with a "sched_yield()" in your spin loop, which is more kind to your system.

Keyes answered 16/5, 2014 at 17:28 Comment(0)

Recommended topics

Hot tags