How to measure multithreaded process time on a multitasking environment?

Since I am running performance evaluation tests of my multithreaded program on a (preemptive) multitasking, multicore environment, the process can get swapped out periodically. I want to compute the latency, i.e., only the duration when the process was active. This will allow me to extrapolate how the performance would be on a non-multitasking environment, i.e., where only one program is running (most of the time), or on different workloads.

Usually two kinds of time are measured:

The wall-clock time (i.e., the time since the process started) but this includes the time when the process was swapped out.
The processor time (i.e., sum total of CPU time used by all threads) but this is not useful to compute the latency of the process.

I believe what I need is makespan of times of individual threads, which can be different from the maximum CPU time used by any thread due to the task dependency structure among the threads. For example, in a process with 2 threads, thread 1 is heavily loaded in the first two-third of the runtime (for CPU time t) while thread 2 is loaded in the later two-third of the runtime of the process (again, for CPU time t). In this case:

wall-clock time would return 3t/2 + context switch time + time used by other processes in between,
max CPU time of all threads would return a value close to t, and
total CPU time is close to 2t.
What I hope to receive as output of measure is the makespan, i.e., 3t/2.

Furthermore, multi-threading brings indeterminacy on its own. This issue can probably be taken care of running the test multiple times and summarizing the results.

Moreover, the latency also depends on how the OS schedules the threads; things get more complicated if some threads of a process wait for CPU while others run. But lets forget about this.

Is there an efficient way to compute/approximate this makespan time? For providing code examples, please use any programming language, but preferably C or C++ on linux.

PS: I understand this definition of makespan is different from what is used in scheduling problems. The definition used in scheduling problems is similar to wall-clock time.

Reformulation of the Question

I have written a multi-threaded application which takes X seconds to execute on my K-core machine.

How do I estimate how long the program will take to run on a single-core computer?

Empirically

The obvious solution is to get a computer with one core, and run your application, and use Wall-Clock time and/or CPU time as you wish.

...Oh, wait, your computer already has one core (it also has some others, but we won't need to use them).

How to do this will depend on the Operating System, but one of the first results I found from Google explains a few approaches for Windows XP and Vista.

http://masolution.blogspot.com/2008/01/how-to-use-only-one-core-of-multi-core.html

Following that you could:

Assign your Application's process to a single core's affinity. (you can also do this in your code).
Start your operating system only knowing about one of your cores. (and then switch back afterwards)

Independent Parallelism

Estimating this analytically requires knowledge about your program, the method of parallelism, etc.

As an simple example, suppose I write a multi-threaded program that calculates the ten billionth decimal digit of pi and the ten billionth decimal digit of e.

My code looks like:

public static int main()
{
    Task t1 = new Task( calculatePiDigit );
    Task t2 = new Task( calculateEDigit );
    t1.Start();
    t2.Start();
    Task.waitall( t1, t2 );
}

And the happens-before graph looks like:

enter image description here

Clearly these are independent.

In this case

Time calculatePiDigit() by itself.
Time calculateEDigit() by itself.
Add the times together.

2-Stage Pipeline

When the tasks are not independent, you won't be able to just add the individual times together.

In this next example, I create a multi-threaded application to: take 10 images, convert them to grayscale, and then run a line detection algorithm. For some external reason, every images are not allowed to be processed out of order. Because of this, I create a pipeline pattern.

My code looks something like this:

ConcurrentQueue<Image> originalImages = new ConcurrentQueue<Image>();
ConcurrentQueue<Image> grayscaledImages = new ConcurrentQueue<Image>();
ConcurrentQueue<Image> completedImages = new ConcurrentQueue<Image>();

public static int main()
{
     PipeLineStage p1 = new PipeLineStage(originalImages, grayScale, grayscaledImages);
     PipeLineStage p2 = new PipeLineStage(grayscaledImages, lineDetect, completedImages);

     p1.Start();
     p2.Start();

     originalImages.add( image1 );
     originalImages.add( image2 );
     //... 
     originalImages.add( image10 );

     originalImages.add( CancellationToken );

     Task.WaitAll( p1, p2 );
}

A data centric happens-before graph:

enter image description here

If this program had been designed as a sequential program to begin with, for cache reasons it would be more efficient to take each image one at a time and move them to completed, before moving to the next image.

Anyway, we know that GrayScale() will be called 10 times and LineDetection() will be called 10 times, so we can just time each independently and then multiply them by 10.

But what about the costs of pushing/popping/polling the ConcurrentQueues?

Assuming the images are large, that time will be negligible.

If there are millions of small images, with many consumers at each stage, then you will probably find that the overhead of waiting on locks, mutexes, etc, is very small when a program is run sequentially (assuming that the amount of work performed in the critical sections is small, such as inside the concurrent queue).

Costs of Context Switching?

Take a look at this question:

How to estimate the thread context switching overhead?

Basically, you will have context switches in multi-core environments and in single-core environments.

The overhead to perform a context switch is quite small, but they also occur very many times per second.

The danger is that the cache gets fully disrupted between context switches.

For example, ideally:

image1 gets loaded into the cache as a result of doing GrayScale
LineDetection will run much faster on image1, since it is in the cache

However, this could happen:

image1 gets loaded into the cache as a result of doing GrayScale
image2 gets loaded into the cache as a result of doing GrayScale
now pipeline stage 2 runs LineDetection on image1, but image1 isn't in the cache anymore.

Conclusion

Nothing beats timing on the same environment it will be run in.

Next best is to simulate that environment as well as you can.

Regardless, understanding your program's design should give you an idea of what to expect in a new environment.