How to estimate the thread context switching overhead?

L

10

67

I am trying to improve the performance of the threaded application with real-time deadlines. It is running on Windows Mobile and written in C / C++. I have a suspicion that high frequency of thread switching might be causing tangible overhead, but can neither prove it or disprove it. As everybody knows, lack of proof is not a proof of opposite :).

Thus my question is twofold:

If exists at all, where can I find any actual measurements of the cost of switching thread context?
Without spending time writing a test application, what are the ways to estimate the thread switching overhead in the existing application?
Does anyone know a way to find out the number of context switches (on / off) for a given thread?

Leanora answered 20/11, 2008 at 9:21 Comment(2)

I believe that thread switching is heavily dependent on the amound of 'memory' and state a single thread 'contains'. If all your threads do lots of work on huge bitmaps a thread switch can be very expensive. A thread that simply increments a single counter has a very small thread switch overhead. – Peper 9/11, 2010 at 14:43

The accepted answer is wrong. Context switching is expensive because of cache invalidation. Of course if you benchmark just the thread switch with a counter increment it seems fast but that's a unrealistic worthless benchmark. It's not even really a context switch when the context is just the counter register. – Ollieollis 28/5, 2020 at 0:38

M

13

While you said you don't want to write a test application, I did this for a previous test on an ARM9 Linux platform to find out what the overhead is. It was just two threads that would boost::thread::yield() (or, you know) and increment some variable, and after a minute or so (without other running processes, at least none that do something), the app printed how many context switches it could do per second. Of course this is not really exact, but the point is that both threads yielded the CPU to each other, and it was so fast that it just didn't make sense any more to think about the overhead. So, simply go ahead and just write a simple test instead of thinking too much about a problem that may be non-existent.

Other than that, you might try like 1800 suggested with performance counters.

Oh, and I remember an application running on Windows CE 4.X, where we also have four threads with intensive switching at times, and never ran into performance issues. We also tried to implement the core threading thing without threads at all, and saw no performance improvement (the GUI just responded much slower, but everything else was the same). Maybe you can try the same, by either reducing the number of context switches or by removing threads completely (just for testing).

Moleskin answered 20/11, 2008 at 9:37 Comment(2)

Thanks, this affirmation that switching times are minimal is what I needed. – Leanora 21/11, 2008 at 1:17

This answer is wrong unless your application is just incrementing counters. Context switching is very expensive. Not because of CPU operation but because all cache is invalidated when you switch to another thread. You must benchmark context switch with real work that fill cache on each thread. – Ollieollis 27/7, 2022 at 9:10

A

31

I doubt you can find this overhead somewhere on the web for any existing platform. There exists just too many different platforms. The overhead depends on two factors:

The CPU, as the necessary operations may be easier or harder on different CPU types
The system kernel, as different kernels will have to perform different operations on each switch

Other factors include how the switch takes place. A switch can take place when

the thread has used all of its time quantum. When a thread is started, it may run for a given amount of time before it has to return control to the kernel that will decide who's next.
the thread was preempted. This happens when another thread needs CPU time and has a higher priority. E.g. the thread that handles mouse/keyboard input may be such a thread. No matter what thread owns the CPU right now, when the user types something or clicks something, he doesn't want to wait till the current threads time quantum has been used up completely, he wants to see the system reacting immediately. Thus some systems will make the current thread stop immediately and return control to some other thread with higher priority.
the thread doesn't need CPU time anymore, because it's blocking on some operation or just called sleep() (or similar) to stop running.

These 3 scenarios might have different thread switching times in theory. E.g. I'd expect the last one to be slowest, since a call to sleep() means the CPU is given back to the kernel and the kernel needs to setup a wake-up call that will make sure the thread is woken up after about the amount of time it requested to sleep, it then must take the thread out of the scheduling process, and once the thread is woken up, it must add the thread again to the scheduling process. All these steeps will take some amount of time. So the actual sleep-call might be longer than the time it takes to switch to another thread.

I think if you want to know for sure, you must benchmark. The problem is that you usually will have to either put threads to sleep or you must synchronize them using mutexes. Sleeping or Locking/Unlocking mutexes has itself an overhead. This means your benchmark will include these overheads as well. Without having a powerful profiler, it's hard to later on say how much CPU time was used for the actual switch and how much for the sleep/mutex-call. On the other hand, in a real life scenario, your threads will either sleep or synchronize via locks as well. A benchmark that purely measures the context switch time is a synthetically benchmark as it does not model any real life scenario. Benchmarks are much more "realistic" if they base on real-life scenarios. Of what use is a GPU benchmark that tells me my GPU can in theory handle 2 billion polygons a second, if this result can never be achieved in a real life 3D application? Wouldn't it be much more interesting to know how many polygons a real life 3D application can have the GPU handle a second?

Unfortunately I know nothing of Windows programming. I could write an application for Windows in Java or maybe in C#, but C/C++ on Windows makes me cry. I can only offer you some source code for POSIX.

#include <stdlib.h>
#include <stdint.h>
#include <stdio.h>
#include <pthread.h>
#include <sys/time.h>
#include <unistd.h>

uint32_t COUNTER;
pthread_mutex_t LOCK;
pthread_mutex_t START;
pthread_cond_t CONDITION;

void * threads (
    void * unused
) {
    // Wait till we may fire away
    pthread_mutex_lock(&START);
    pthread_mutex_unlock(&START);

    pthread_mutex_lock(&LOCK);
    // If I'm not the first thread, the other thread is already waiting on
    // the condition, thus Ihave to wake it up first, otherwise we'll deadlock
    if (COUNTER > 0) {
        pthread_cond_signal(&CONDITION);
    }
    for (;;) {
        COUNTER++;
        pthread_cond_wait(&CONDITION, &LOCK);
        // Always wake up the other thread before processing. The other
        // thread will not be able to do anything as long as I don't go
        // back to sleep first.
        pthread_cond_signal(&CONDITION);
    }
    pthread_mutex_unlock(&LOCK); //To unlock
}

int64_t timeInMS ()
{
    struct timeval t;

    gettimeofday(&t, NULL);
    return (
        (int64_t)t.tv_sec * 1000 +
        (int64_t)t.tv_usec / 1000
    );
}


int main (
    int argc,
    char ** argv
) {
    int64_t start;
    pthread_t t1;
    pthread_t t2;
    int64_t myTime;

    pthread_mutex_init(&LOCK, NULL);
    pthread_mutex_init(&START, NULL);   
    pthread_cond_init(&CONDITION, NULL);

    pthread_mutex_lock(&START);
    COUNTER = 0;
    pthread_create(&t1, NULL, threads, NULL);
    pthread_create(&t2, NULL, threads, NULL);
    pthread_detach(t1);
    pthread_detach(t2);
    // Get start time and fire away
    myTime = timeInMS();
    pthread_mutex_unlock(&START);
    // Wait for about a second
    sleep(1);
    // Stop both threads
    pthread_mutex_lock(&LOCK);
    // Find out how much time has really passed. sleep won't guarantee me that
    // I sleep exactly one second, I might sleep longer since even after being
    // woken up, it can take some time before I gain back CPU time. Further
    // some more time might have passed before I obtained the lock!
    myTime = timeInMS() - myTime;
    // Correct the number of thread switches accordingly
    COUNTER = (uint32_t)(((uint64_t)COUNTER * 1000) / myTime);
    printf("Number of thread switches in about one second was %u\n", COUNTER);
    return 0;
}

Output

Number of thread switches in about one second was 108406

Over 100'000 is not too bad and that even though we have locking and conditional waits. I'd guess without all this stuff at least twice as many thread switches were possible a second.

Agatha answered 20/11, 2008 at 10:46 Comment(2)

Which part of "Unfortunately I know nothing of Windows programming...I can only offer you some source code for POSIX." didn't you understand? – Agatha 20/11, 2008 at 18:12

No I understand fully, but your answer doesn't help the guy who asked the original question and the whole point is to help those who ask questions. – Impersonate 20/11, 2008 at 19:20

I

14

You can't estimate it. You need to measure it. And it's going to vary depending on the processor in the device.

There are two fairly simple ways to measure a context switch. One involves code, the other doesn't.

First, the code way (pseudocode):

DWORD tick;

main()
{
  HANDLE hThread = CreateThread(..., ThreadProc, CREATE_SUSPENDED, ...);
  tick = QueryPerformanceCounter();
  CeSetThreadPriority(hThread, 10); // real high
  ResumeThread(hThread);
  Sleep(10);
}

ThreadProc()
{
  tick = QueryPerformanceCounter() - tick;
  RETAILMSG(TRUE, (_T("ET: %i\r\n"), tick));
}

Obviously doing it in a loop and averaging will be better. Keep in mind that this doesn't just measure the context switch. You're also measuring the call to ResumeThread and there's no guarantee the scheduler is going to immediately switch to your other thread (though the priority of 10 should help increase the odds that it will).

You can get a more accurate measurement with CeLog by hooking into scheduler events, but it's far from simple to do and not very well documented. If you really want to go that route, Sue Loh has several blogs on it that a search engine can find.

The non-code route would be to use Remote Kernel Tracker. Install eVC 4.0 or the eval version of Platform Builder to get it. It will give a graphical display of everything the kernel is doing and you can directly measure a thread context switch with the provided cursor capabilities. Again, I'm certain Sue has a blog entry on using Kernel Tracker as well.

All that said, you're going to find that CE intra-process thread context switches are really, really fast. It's the process switches that are expensive, as it requires swapping the active process in RAM and then doing the migration.

Impersonate answered 20/11, 2008 at 14:7 Comment(1)

There's no point in measuring context switching if your application are not doing real work that fills cache. (Check my answer at the bottom). – Ollieollis 23/3, 2023 at 9:58

M

13

While you said you don't want to write a test application, I did this for a previous test on an ARM9 Linux platform to find out what the overhead is. It was just two threads that would boost::thread::yield() (or, you know) and increment some variable, and after a minute or so (without other running processes, at least none that do something), the app printed how many context switches it could do per second. Of course this is not really exact, but the point is that both threads yielded the CPU to each other, and it was so fast that it just didn't make sense any more to think about the overhead. So, simply go ahead and just write a simple test instead of thinking too much about a problem that may be non-existent.

Other than that, you might try like 1800 suggested with performance counters.

Oh, and I remember an application running on Windows CE 4.X, where we also have four threads with intensive switching at times, and never ran into performance issues. We also tried to implement the core threading thing without threads at all, and saw no performance improvement (the GUI just responded much slower, but everything else was the same). Maybe you can try the same, by either reducing the number of context switches or by removing threads completely (just for testing).

Moleskin answered 20/11, 2008 at 9:37 Comment(2)

Thanks, this affirmation that switching times are minimal is what I needed. – Leanora 21/11, 2008 at 1:17

This answer is wrong unless your application is just incrementing counters. Context switching is very expensive. Not because of CPU operation but because all cache is invalidated when you switch to another thread. You must benchmark context switch with real work that fill cache on each thread. – Ollieollis 27/7, 2022 at 9:10

O

13

Context Switch is very expensive. Not because of the CPU operation itself, but because of cache invalidation. If you have an intensive task running, it will fill the CPU cache, both for instructions and data, also the memory prefetch, TLB and RAM will optimize the work toward some areas of ram.

When you change context all these cache mechanisms are reset and the new thread start from "blank" state.

The accepted answer is wrong unless your thread are just incrementing a counter. Of course there is no cache flush involved in this case. There is no point in benchmarking context switching without filling cache like real applications.

Ollieollis answered 14/3, 2019 at 11:42 Comment(0)

E

8

My 50 lines of C++ show for Linux (QuadCore Q6600) the context switch time ~ 0.9us (0.75us for 2 threads, 0.95 for 50 threads). In this benchmark threads call yield immediately when they get a quantum of time.

Esposito answered 15/5, 2010 at 14:50 Comment(7)

.9 NANOSECONDS? Are you sure? ... <rummages...> your code appears to be computing miilliseconds/switch*1000-> microseconds. – Rita 11/9, 2010 at 22:20

@IraBaxter that is not nano-sec, 1000us==1ms 1000ms==1s – Brawny 10/12, 2020 at 11:59

over 1000 switches per milli-sec?? Are you sure? – Brawny 10/12, 2020 at 12:0

It probably needs a retesting given it’s CFS now... – Esposito 10/12, 2020 at 12:0

But yes, >1000 switches per millisecond, that’s for sure, will be even more in the CFS. – Esposito 11/12, 2020 at 9:58

@Scott: check the message edit history. It used to say "Nanoseconds". – Rita 13/12, 2020 at 22:32

@Esposito The link is broken. do you still have the code somewhere that can be shared? – Kalmia 20/2, 2022 at 3:4

C

8

Context Switch is expensive, as a rule of thumb it costs 30µs of CPU overhead http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html

Cloris answered 19/8, 2011 at 8:21 Comment(0)

E

5

I've only ever tried to estimate this once and that was on a 486! The upshot was that the processor context switch was taking about 70 instructions to complete (note this was happening for many OS api calls as well as thread switching). We calculated that it was taking approx 30us per thread switch (including OS overhead) on a DX3. The few thousand context switches we were doing per second was absorbing between 5-10% of the processor time.

How that would translate to a multi-core, multi-ghz modern processor I don't know but I would guess that unless you were completely going over the top with thread switching its a negligible overhead.

Note that thread creation/deletion is a more expensive CPU/OS hogger than activating/deactivating threads. A good policy for heavily threaded apps is to use thread pools and activate/deactivate as required.

Esau answered 20/11, 2008 at 12:16 Comment(0)

S

4

The problem with context switches is that they have a fixed time. GPU's implemented 1 cycle context switch between threads. The following for example can not be threaded on CPU's:

double * a; 
...
for (i = 0; i < 1000; i ++)
{
    a[i] = a[i] + a[i]
}

because its time of execution is much less than context switch cost. On Core i7 this code takes around 1 micro second (depends on the compiler). So context switch time does matter because it defines how small jobs can be threaded. I guess this also provides a method for effective measurement of context switch. Check how long does the array (in the upper example) has to be so that two threads from thread pool will start showing some real advantage in compare to a single threaded one. This may easily become 100 000 elements and therefore the effective context switch time would be somewhere in the range of 20us within the same app.

All the encapsulations used by the thread pool have to be counted to the thread switch time because that is what it all comes down to (at the end).

Atmapuri

Scleritis answered 28/2, 2010 at 10:37 Comment(0)

D

0

I don't know but do you have the usual performance counters in windows mobile? You could look at things like context switches/sec. I don't know if there is one that specifically measures context switch time though.

Decelerate answered 20/11, 2008 at 9:27 Comment(0)

C

0

Interesting to remember that Windows Mobile has taken the big switch in the meantime. Nonetheless, the conundrum of 'thread switching overhead', 'thread switching time', 'reduction of available computational capacity through threading' (not the same!) remains largely a mystery to many.

@Mecki above provided code to measure posix thread switching 'timings' by measuring the number of switches made in one second. I have run that program (on two different hardware/windows/cygwin platforms, but quite different performance ) and am stymied by the result that overall unchanged ~100,000 thread switches are measured.

If I had to speculate what that means, I would say, probably the hand brake is on. What the OS will never allow to happen is thread 'thrashing' where the OS is monopolized by a few threads doing just context switches for amusement.

This gives us a good rule of thumb: on a modern platform, a thread should be provided with a minimum workload (and hopefully much more than that) of ten times the 10 microseconds. This is just for everybody's sanity.

Exceptions being the 'controlled' environments, e.g. GPUs, OS kernels, embedded devices etc, where there is no concern for user-written badly-behaved code.

The other flaw in thinking is: If it takes thaat much time to switch threads, why do it at all? Short answer: The switching time is largely incurred by the OS. Users will still see their threads making progress, just slightly impeded by their occasional switches. So for eight threads each will incur on average at most 100000/8 switches per second, with a delay of 12,500 * 10 microseconds = 125 milliseconds, which leaves 87.5 % of the time open for net useful computation. If the number of thread switches possible per second were increased by a (multi-threaded) kernel, then in the end that would give users the wrong idea of how split their workloads and thus further decrease the usefully spent percentage of overall computational capacity.

Of course I agree in principle with @bokan that context switching is generally more expensive as we might think, but then this is not always the case if we use threads to work on relatively small chunks of data, no huge amounts of data on the stack and shared code locality with other threads.

Candycecandystriped answered 8/6 at 14:57 Comment(0)

Recommended topics

Hot tags