How much overhead is there when creating a thread?

Asked 14/10, 2010 at 3:3 Answered 27/6, 2021 at 19:38

I just reviewed some really terrible code - code that sends messages on a serial port by creating a new thread to package and assemble the message in a new thread for every single message sent. Yes, for every message a pthread is created, bits are properly set up, then the thread terminates.

I haven't a clue why anyone would do such a thing, but it raises the question - how much overhead is there when actually creating a thread?

Nebuchadnezzar answered 14/10, 2010 at 3:3 Comment(8)

A single worker thread will help with some of the other kinds of resource contention that you might have with a multiple threads (interleaved writes etc). – Metamorphosis 14/10, 2010 at 4:55

I would like to comment that yes, this is arguably a horrible thing to do - which leads to my question, how much overhead is incurred when creating a thread (in general) Frankly, I do not know how to determine or even measure an implementation of the pthread library – Nebuchadnezzar 14/10, 2010 at 20:55

The overhead is 0.37%, given the right message size. – Life 10/9, 2015 at 10:2

@LutzPrechelt Can you explain how you came to that number? – Finance 13/10, 2023 at 22:15

@Dai: Yes. For an infinitely long message, the overhead will be 0%, for an empty message it will be 100%. So for some message size in between (the "right" message size) it will be approximately 0.37%. – Life 16/10, 2023 at 9:13

@LutzPrechelt I'm sorry, but I can't tell if you're being facetious or not. – Finance 16/10, 2023 at 9:58

@Dai: Yes, I am joking, sort of: The original question is lacking crucial information. Without information about the transfer rate of the serial port and the length of the messages, a proper answer cannot be given. – Life 17/10, 2023 at 11:27

I’m shocked that a question I asked 13 years ago is suddenly getting traction. Since this was something I worked on so long (and 4 or 5 jobs ago), my recollection is vague…. But I believe the intent of my question was really to try to understand what overhead there might be by spinning up a new thread for every message to be sent along a serial port. (Which still seems odd to me all these years later), and would be independent of baud rate of the actual serial send. – Nebuchadnezzar 18/10, 2023 at 14:26

...sends Messages on a serial port ... for every message a pthread is created, bits are properly set up, then the thread terminates. ...how much overhead is there when actually creating a thread?

This is highly system specific. For example, last time I used VMS threading was nightmarishly slow (been years, but from memory one thread could create something like 10 more per second (and if you kept that up for a few seconds without threads exiting you'd core)), whereas on Linux you can probably create thousands. If you want to know exactly, benchmark it on your system. But, it's not much use just knowing that without knowing more about the messages: whether they average 5 bytes or 100k, whether they're sent contiguously or the line idles in between, and what the latency requirements for the app are are all as relevant to the appropriateness of the code's thread use as any absolute measurement of thread creation overhead. And performance may not have needed to be the dominant design consideration.

Preparation answered 14/10, 2010 at 4:58 Comment(0)

To resurrect this old thread, I just did some simple test code:

#include <thread>

int main(int argc, char** argv)
{
  for (volatile int i = 0; i < 500000; i++)
    std::thread([](){}).detach();
  return 0;
}

I compiled it with g++ test.cpp -std=c++11 -lpthread -O3 -o test. I then ran it three times in a row on an old (kernel 2.6.18) heavily loaded (doing a database rebuild) slow laptop (Intel core i5-2540M). Results from three consecutive runs: 5.647s, 5.515s, and 5.561s. So we're looking at a tad over 10 microseconds per thread on this machine, probably much less on yours.

That's not much overhead at all, given that serial ports max out at around 1 bit per 10 microseconds. Now, of course there's various additional thread losses one can get involving passed/captured arguments (although function calls themselves can impose some), cache slowdowns between cores (if multiple threads on different cores are battling over the same memory at the same time), etc. But in general I highly doubt the use case you presented will adversely impact performance at all (and could provide benefits, depending), despite having you already preemptively labeled the concept "really terrible code" without even knowing how much time it takes to launch a thread.

Whether it's a good idea or not depends a lot on the details of your situation. What else is the calling thread responsible for? What precisely is involved in preparing and writing out the packets? How frequently are they written out (with what sort of distribution? uniform, clustered, etc...?) and what's their structure like? How many cores does the system have? Etc. Depending on the details, the optimal solution could be anywhere from "no threads at all" to "shared thread pool" to "thread for each packet".

Note that thread pools aren't magic and can in some cases be a slowdown versus unique threads, since one of the biggest slowdowns with threads is synchronizing cached memory used by multiple threads at the same time, and thread pools by their very nature of having to look for and process updates from a different thread have to do this. So either your primary thread or child processing thread can get stuck having to wait if the processor isn't sure whether the other process has altered a section of memory. By contrast, in an ideal situation, a unique processing thread for a given task only has to share memory with its calling task once (when it's launched) and then they never interfere with each other again.

Omora answered 4/1, 2015 at 10:29 Comment(1)

Tangent: ran across this thread, as a Windows user I was curious how my system fared. Compiling under msvc with standard release optimisations, running on a 6700k it took 31.442s to run fully. The only alterations I made were to add a std::chrono::high_resolution_clock + time_points before and after the loop and std::cout the result before exiting. Rather shocking results. I tried mingw-w64's 7.1.0 g++ with your your exact command line arguments but it crashes after a few seconds so no idea what's wrong there, same with a clang++ v8.0 I had lying around. – Nagey 25/1, 2019 at 17:29

I have always been told that thread creation is cheap, especially when compared to the alternative of creating a process. If the program you are talking about does not have a lot of operations that need to run concurrently then threading might not be necessary, and judging by what you wrote this might well be the case. Some literature to back me up:

http://www.personal.kent.edu/~rmuhamma/OpSystems/Myos/threads.htm

Threads are cheap in the sense that

They only need a stack and storage for registers therefore, threads are cheap to create.

Threads use very little resources of an operating system in which they are working. That is, threads do not need new address space, global data, program code or operating system resources.

Context switching are fast when working with threads. The reason is that we only have to save and/or restore PC, SP and registers.

More of the same here.

In Operating System Concepts 8th Edition (page 155) the authors write about the benefits of threading:

Allocating memory and resources for process creation is costly. Because threads share the resource of the process to which they belong, it is more economical to create and context-switch threads. Empirically gauging the difference in overhead can be difficult, but in general it is much more time consuming to create and manage processes than threads. In Solaris, for example, creating a process is about thirty times slower than is creating a thread, and context switching is about five times slower.

Pas answered 14/10, 2010 at 3:12 Comment(19)

But the alternative is probably a single reused thread, or a thread pool, not a process. – Subotica 14/10, 2010 at 3:27

@Matthew Flaschen I meant the alternative to creating any thread, not an alternative to creating a thread(s) in the way the question describes :) – Pas 14/10, 2010 at 3:30

Actually process creation is cheaper than thread creation. The fork part of processes creation has basically no cost as the memory pages are duplicated at the hardware level. See what the google chrome team found: hanselman.com/blog/… – Alexei 14/10, 2010 at 4:28

@Martin York according to my text books (and my professors) process creation is more costly than thread creation. I will add an excerpt from one of my text books so you can judge for yourself. – Pas 14/10, 2010 at 4:31

@ typoknig: It was true in the old days. I am not saying it is free but it is a myth that it is very expensive. See the article I linked in my last comment. – Alexei 14/10, 2010 at 4:32

@Martin York the article seems to say that process creation is faster than it once was because of "this thing called Moore's Law that keeps marching on", but that does not compare process to threads on an level playing field. If process creation has gotten faster by an order of magnitude then thread creation would also have gotten faster. In my understanding, there is no way creating a process can ever be done faster than creating a thread because a process will always have more stuff in it than a thread. – Pas 14/10, 2010 at 4:45

Yes you quotes are from text books that are OLD. In the old days a fork() required the OS to generate a copy of each memory page that the processes was using. Modern versions of the kernel have moved fork implementation into the protected realm and now use a copy on write technique (that basically means only one page (that page with the current stack frame) needs to be copied at the fork() point). See: cis.upenn.edu/~jms/cw-fork.pdf – Alexei 14/10, 2010 at 5:2

@typoknig: The problem with thread creation it involves a lot of work setting up a new stack on the other hand processes creation noways is very cheap as the pages used by the processes are copied only when needed thus your forked process in-effect are sharing the same memory pages. Please read the first article a little more closely. – Alexei 14/10, 2010 at 5:6

The advantage of threads: Quicker to switch, data is easily shared. The advantage of fork: processes protection (if the process dies it does nto affect the other processes). – Alexei 14/10, 2010 at 5:12

@Martin York The link you provided is not dated, but its newest reference is 1990. My book is in its 8th edition and was copyrighted in 2009. The original version of my book was published in 1985 which is still newer than most of the references in the link you provided. The link you provided does not compare processes and threads, in fact it does not even mention threads. I would totally agree with your most recent comment, but I still think (based on what I have been taught and what my references say) process creation will always be more expensive than thread creation. – Pas 14/10, 2010 at 5:39

@typoknig: Try it and see. It is the only way to know for sure. – Alexei 14/10, 2010 at 5:41

@Martin York I actually have, and that is the only reason I would even try to stand my ground talking about this with you since you obviously know your stuff with such a big number next to your name :) I am in Advanced Operating Systems right now and most of the semester has been spent on POSIX threads. My professor had an example program which timed the creation of a thread vs. the creation of a process. Process creation always took considerably longer. The program was written in C and run in RedHat, I will see if I can get it from him. – Pas 14/10, 2010 at 5:46

Id did not know RedHAt was still around :-) . What version of the Kernel are they using? – Alexei 14/10, 2010 at 14:21

Are you comparing fork/exec against thread creation? This is definitively more costly than thread creation (because of the exec). But is not equivalent. You just want to compare the fork time against thread creation as you can write the child code in the same application (just like a thread). Here are my two versions: fork vs pthread_create. padfly.com/threadVsfork Admittedly I have nothing for the child to do but I create 1000 children in either and it takes 0 time (or its so fast that time XXX records zero). – Alexei 14/10, 2010 at 15:0

[Alpha:~/X] myork% ./thread 1000 Start: 717836 Wait: 352700 All: 1070536 – Alexei 14/10, 2010 at 15:11

[Alpha:~/X] myork% ./fork 1000 Start: 22628 Wait: 1439 All: 24067 – Alexei 14/10, 2010 at 15:12

The above times were generate using the code posted here padfly.com/threadVsfork Running on Darwin Kernel Version 10.4.0. Each version generates 1000 children (which do nothing and then exit). Though I would not rely on the absolute values of a clock() as a timing mechanism I think this shows that the cost of fork() is actually smaller. Now I will concede that different OS will have different characteristics based. Windows for example will not fair as well but Unix like OS running a modern kernel will behave in this manner. Older Kernels that do not use copy on write will also be slower – Alexei 14/10, 2010 at 15:17

@typoking: This is getting to heavy for a thread discussion. I have started a question: #3935492 – Alexei 14/10, 2010 at 15:47

@Pas : did you you ever ear of fork bomb attacks using threads ? – Chaffer 31/10, 2015 at 17:40

There is some overhead in thread creation, but comparing it with usually slow baud rates of the serial port (19200 bits/sec being the most common), it just doesn't matter.

Phasia answered 14/10, 2010 at 3:11 Comment(1)

I agree. Why worry about microseconds of delay from creating a thread when networking is likely cause delays in the dozens of milliseconds or even seconds. – Herbage 24/11, 2014 at 2:43

...sends Messages on a serial port ... for every message a pthread is created, bits are properly set up, then the thread terminates. ...how much overhead is there when actually creating a thread?

Preparation answered 14/10, 2010 at 4:58 Comment(0)

You definitely do not want to do this. Create a single thread or a pool of threads and just signal when messages are available. Upon receiving the signal, the thread can perform any necessary message processing.

In terms of overhead, thread creation/destruction, especially on Windows, is fairly expensive. Somewhere on the order of tens of microseconds, to be specific. It should, for the most part, only be done at the start/end of an app, with the possible exception of dynamically resized thread pools.

Acrilan answered 14/10, 2010 at 3:13 Comment(2)

Yes, an "eternal" dedicated worker thread would also solve the possible MT problems. – Phasia 14/10, 2010 at 3:16

@MichaelGoldshteyn : have you got an idea on how to do this in python ? – Chaffer 31/10, 2015 at 17:39

I used the above "terrible" design in a VOIP app I made. It worked very well ... absolutely no latency or missed/dropped packets for locally connected computers. Each time a data packet arrived in, a thread was created and handed that data to process it to the output devices. Of course the packets were large so it caused no bottleneck. Meanwhile the main thread could loop back to wait and receive another incoming packet.

I have tried other designs where the threads I need are created in advance but this creates it's own problems. First you need to design your code properly for threads to retrieve the incoming packets and process them in a deterministic fashion. If you use multiple (pre-allocated) threads it's possible that the packets may be processed 'out of order'. If you use a single (pre-allocated) thread to loop and pick up the incoming packets, there is a chance that thread might encounter a problem and terminate leaving no threads to process any data.

Creating a thread to process each incoming data packet works very cleanly, especially on multi-core systems and where incoming packets are large. Also to answer your question more directly, the alternative to thread creation is to create a run-time process that manages the pre-allocated threads. Being able to synchronize data hand-off and processing as well as detecting errors may add just as much, if not more overhead as just simply creating a new thread. It all depends on your design and requirements.

Outage answered 7/3, 2017 at 20:59 Comment(1)

This makes no sense. If a thread crashes your whole app is going down, unless on your OS it's catchable, in which case you can take that opportunity to spawn a new thread. If it's not an outright crash, but an exception/error-code, the thread can catch it and take appropriate action. – Annihilator 14/12, 2022 at 23:2

Thread creation and computing in a thread is pretty expensive: all data structures need to be set up, the thread registered with the kernel and a thread switch must occur so that the new thread actually gets executed (in an unspecified and unpredictable time). Executing thread.start does not mean that the thread main function is called immediately.

As the article (mentioned by typoking) points out creation of a thread is cheap only compared to the creation of a process. Overall, it is pretty expensive.

I would never use a thread

for a short computation
a computation where I need the result in my flow of code (that means, I am starting the thread and wait for it to return the result of it's computation

In your example, it would make sense (as has already been pointed out) to create a thread that handles all of the serial communication and is eternal.

Guntar answered 14/10, 2010 at 4:47 Comment(2)

The downvotes are baffling. Understanding the cost of context switches is core to any discussion of threading costs. – Dynasty 21/4, 2015 at 19:21

Your second bullet seems to rule out async here - although I guess that's because async is C++11 and your answer was written in '10. – Jaleesa 4/10, 2018 at 12:57

For comparison , take a look of OSX: Link

Kernel data structures : Approximately 1 KB Stack space: 512 KB (secondary threads) : 8 MB (OS X main thread) , 1 MB (iOS main thread)
Creation time: Approximately 90 microseconds

The posix thread creation also should be around this (not a far away figure) I guess.

Handfast answered 28/1, 2014 at 4:43 Comment(0)

It is indeed very system dependent, I tested @Nafnlaus code:

#include <thread>

int main(int argc, char** argv)
{
  for (volatile int i = 0; i < 500000; i++)
    std::thread([](){}).detach();
  return 0;
}

On my Desktop Ryzen 5 2600:

windows 10, compiled with MSVC 2019 release adding std::chrono calls around it to time it. Idle (only Firefox with 217 tabs):

It took around 20 seconds (20.274, 19.910, 20.608) (also ~20 seconds with Firefox closed)

Ubuntu 18.04 compiled with:

g++ main.cpp -std=c++11 -lpthread -O3 -o thread

timed with:

time ./thread

It took around 5 seconds (5.595, 5.230, 5.297)

The same code on my raspberry pi 3B compiled with:

g++ main.cpp -std=c++11 -lpthread -O3 -o thread

timed with:

time ./thread

took around 15 seconds (16.225, 14.689, 16.235)

Endocarditis answered 10/6, 2020 at 19:2 Comment(0)

On any sane implementation, the cost of thread creation should be proportional to the number of system calls it involves, and on the same order of magnitude as familiar system calls like open and read. Some casual measurements on my system showed pthread_create taking about twice as much time as open("/dev/null", O_RDWR), which is very expensive relative to pure computation but very cheap relative to any IO or other operations which would involve switching between user and kernel space.

Precept answered 24/3, 2011 at 16:47 Comment(1)

In my case, this would involve creating thousands of threads. Is there a way to avoid that in python ? – Chaffer 31/10, 2015 at 17:45

Interesting.

I tested with my FreeBSD PCs and got the following results:

FreeBSD 12-STABLE, Core i3-8100T, 8GB RAM: 9.523sec<br/>
FreeBSD 12.1-RELEASE, Core i5-6600K, 16GB: 8.045sec

You need to do

sysctl kern.threads.max_threads_per_proc=500100

though.

Core i3-8100T is pretty slow but the results are not very different. Rather the CPU clocks seem to be more relevant: i3-8100T 3.1GHz vs i5-6600k 3.50GHz.

Gambeson answered 24/7, 2020 at 20:41 Comment(0)

As others have mentioned, this seems to be very OS dependent. On my Core i5-8350U running Win10, it took 118 seconds which indicates an overhead of around 237 uS per thread (I suspect that the virus scanner and all the other rubbish IT installed is really slowing it down too). Dual core Xeon E5-2667 v4 running Windows Server 2016 took 41.4 seconds (82 uS per thread), but it's also running a lot of IT garbage in the background including the virus scanner. I think a better approach is to implement a queue with a thread that continuously processes whatever is in the queue to avoid the overhead of creating and destroying the thread everytime.

Laporte answered 27/6, 2021 at 19:38 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags