Reformulation of the Question
I have written a multi-threaded application which takes X seconds to execute on my K-core machine.
How do I estimate how long the program will take to run on a single-core computer?
Empirically
The obvious solution is to get a computer with one core, and run your application, and use Wall-Clock time and/or CPU time as you wish.
...Oh, wait, your computer already has one core (it also has some others, but we won't need to use them).
How to do this will depend on the Operating System, but one of the first results I found from Google explains a few approaches for Windows XP and Vista.
http://masolution.blogspot.com/2008/01/how-to-use-only-one-core-of-multi-core.html
Following that you could:
- Assign your Application's process to a single core's affinity. (you can also do this in your code).
- Start your operating system only knowing about one of your cores. (and then switch back afterwards)
Independent Parallelism
Estimating this analytically requires knowledge about your program, the method of parallelism, etc.
As an simple example, suppose I write a multi-threaded program that calculates the ten billionth decimal digit of pi and the ten billionth decimal digit of e.
My code looks like:
public static int main()
{
Task t1 = new Task( calculatePiDigit );
Task t2 = new Task( calculateEDigit );
t1.Start();
t2.Start();
Task.waitall( t1, t2 );
}
And the happens-before graph looks like:
Clearly these are independent.
In this case
- Time calculatePiDigit() by itself.
- Time calculateEDigit() by itself.
- Add the times together.
2-Stage Pipeline
When the tasks are not independent, you won't be able to just add the individual times together.
In this next example, I create a multi-threaded application to: take 10 images, convert them to grayscale, and then run a line detection algorithm. For some external reason, every images are not allowed to be processed out of order. Because of this, I create a pipeline pattern.
My code looks something like this:
ConcurrentQueue<Image> originalImages = new ConcurrentQueue<Image>();
ConcurrentQueue<Image> grayscaledImages = new ConcurrentQueue<Image>();
ConcurrentQueue<Image> completedImages = new ConcurrentQueue<Image>();
public static int main()
{
PipeLineStage p1 = new PipeLineStage(originalImages, grayScale, grayscaledImages);
PipeLineStage p2 = new PipeLineStage(grayscaledImages, lineDetect, completedImages);
p1.Start();
p2.Start();
originalImages.add( image1 );
originalImages.add( image2 );
//...
originalImages.add( image10 );
originalImages.add( CancellationToken );
Task.WaitAll( p1, p2 );
}
A data centric happens-before graph:
If this program had been designed as a sequential program to begin with, for cache reasons it would be more efficient to take each image one at a time and move them to completed, before moving to the next image.
Anyway, we know that GrayScale() will be called 10 times and LineDetection() will be called 10 times, so we can just time each independently and then multiply them by 10.
But what about the costs of pushing/popping/polling the ConcurrentQueues?
Assuming the images are large, that time will be negligible.
If there are millions of small images, with many consumers at each stage, then you will probably find that the overhead of waiting on locks, mutexes, etc, is very small when a program is run sequentially (assuming that the amount of work performed in the critical sections is small, such as inside the concurrent queue).
Costs of Context Switching?
Take a look at this question:
How to estimate the thread context switching overhead?
Basically, you will have context switches in multi-core environments and in single-core environments.
The overhead to perform a context switch is quite small, but they also occur very many times per second.
The danger is that the cache gets fully disrupted between context switches.
For example, ideally:
- image1 gets loaded into the cache as a result of doing GrayScale
- LineDetection will run much faster on image1, since it is in the cache
However, this could happen:
- image1 gets loaded into the cache as a result of doing GrayScale
- image2 gets loaded into the cache as a result of doing GrayScale
- now pipeline stage 2 runs LineDetection on image1, but image1 isn't in the cache anymore.
Conclusion
Nothing beats timing on the same environment it will be run in.
Next best is to simulate that environment as well as you can.
Regardless, understanding your program's design should give you an idea of what to expect in a new environment.
c++
andlanguage-agnostic
;-) – MarismarisaC++
solution or a general one. In my opinion, thelanguage-agnostic
would be the better fit. – Marismarisa