measure time to execute single instruction

Asked 17/4, 2010 at 14:37 Answered 6/4, 2012 at 12:5

Is there a way using C or assembler or maybe even C# to get an accurate measure of how long it takes to execute a ADD instruction?

Clayborne answered 17/4, 2010 at 14:37 Comment(6)

I believe there is an article about this floating around in ACM land, I will check – Carbineer 17/4, 2010 at 14:42

The programming manual for your CPU will specify the number of cycles it takes to execute a specific instruction. – Athalie 17/4, 2010 at 15:5

That is a highly misleading metric, Neil. Modern CPUs are a whole lot more complicated than can be accounted for by mere cycle counting. Consider an ADD r32, r/m32 -- if you hit memory (the m32) you have cache issues to consider for example. L1 vs. L2 vs. hitting actual main memory gives you gigantic differences in execution speed (never mind virtual memory!). And this doesn't even introduce superscalar architectures or pipelining or a whole bunch of other issues deep down in the dank depths of the CPU. – Lifegiving 17/4, 2010 at 15:51

@Neil:not for most higher-performance CPUs it won't... – Mcmillon 17/4, 2010 at 15:53

Modern CPU performance manuals (like Intel's optimization manual) will list throughput and latency separately, so yes they do consider out-of-order exec, because yeah there's no "cycles" number you can just add up (What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?). uops.info even times reg,reg vs. memory-source vs. memory-destination forms for every instruction, with perf counters to measure uop counts and which execution ports they can run on. (Assuming L1d hit – Therapeutic 17/2, 2021 at 6:31

See also How many CPU cycles are needed for each assembly instruction? – Therapeutic 17/2, 2021 at 6:36

Yes, sort of, but it's non-trivial and produces results that are almost meaningless, at least on most reasonably modern processors.

On relatively slow processors (e.g., up through the original Pentium in the Intel line, still true on most small embedded processors) you can just look in the processor's data sheet and it'll (normally) tell you how many clock ticks to expect. Quick, simple, and easy.

On a modern desktop machine (e.g., Pentium Pro or newer), life isn't nearly that simple. These CPUs can execute a number of instructions at a time, and execute them out of order as long as there aren't any dependencies between them. This means the whole concept of the time taken by a single instruction becomes almost meaningless. The time taken to execute one instruction can and will depend on the instructions that surround it.

That said, yes, if you really want to, you can (usually -- depending on the processor) measure something, though it's open to considerable question exactly how much it'll really mean. Even getting a result like this that's only close to meaningless instead of completely meaningless isn't trivial though. For example, on an Intel or AMD chip, you can use RDTSC to do the timing measurement itself. That, unfortunately, can be executed out of order as described above. To get meaningful results, you need to surround it by an instruction that can't be executed out of order (a "serializing instruction"). The most common choice for that is CPUID, since it's one of the few serializing instructions that's available to "user mode" (i.e., ring 3) programs. That adds a bit of a twist itself though: as documented by Intel, the first few times the processor executes CPUID, it can take longer than subsequent times. As such, they recommend that you execute it three times before you use it to serialize your timing. Therefore, the general sequence runs something like this:

.align 16
CPUID
CPUID
CPUID
RDTSC
; sequence under test
Add eax, ebx
; end of sequence under test
CPUID
RDTSC

Then you compare that to a result from doing the same, but with the sequence under test removed. That's leaving out quite a fe details, of course -- at minimum you need to:

set the registers up correctly before each CPUID
save the value in EAX:EDX after the first RDTSC
subtract result from the second RDTSC from the first

Also note the "align" directive I've inserted -- instruction alignment can and will affect timing as well, especially if a loop is involved.

Mcmillon answered 17/4, 2010 at 15:11 Comment(2)

i tried to say that part, about the value of measurement, but you said it better. you get my +1! – Convalesce 17/4, 2010 at 15:19

The number you get from this means something, but it's not "the cost" of an add instruction. As you say, there's no such thing; it has a throughput separate from latency. See also How many CPU cycles are needed for each assembly instruction?. You should actually measure by creating a loop that bottlenecks on throughput or latency (whichever you want to measure) of that instruction, and time many iterations. That's what uops.info does (using perf counters for uops and core cycles, not RDTSC ref cycles). – Therapeutic 17/2, 2021 at 6:35

Construct a loop that executes 10 million times, with nothing in the loop body, and time that. Keep that time as the overhead required for looping.

Then execute the same loop again, this time with the code under test in the body. Time for this loop, minus the overhead (from the empty loop case) is the time due to the 10 million repetitions of your code under test. So, divide by the number of iterations.

Obviously this method needs tuning with regard to the number of iterations. If what you're measuring is small, like a single instruction, you might even want to run upwards of a billion iterations. If its a significant chunk of code, a few 10's of thousands might suffice.

In the case of a single assembly instruction, the assembler is probably the right tool for the job, or perhaps C, if you are conversant with inline assembly. Others have posted more elegant solutions for how to get a measurement w/o the repetition, but the repetition technique is always available, for example, an embedded processor that doesn't have the nice timing instructions mentioned by others.

Note however, that on modern pipeline processors, instruction level parallelism may confound your results. Because more than one instruction is running through the execution pipeline at a time, it is no longer true that N repetitions of an given instruction take N times as long as a single one.

Convalesce answered 17/4, 2010 at 15:3 Comment(0)

Okay, the problem that you are going to encounter if you are using an OS like Windows, Linux, Unix, MacOS, AmigaOS and all those others that there are lots of processes already running on your machine in the background which will impact performance. The only real way of calculating actual time of an instruction is to disassemble your motherboard and test each component using external hardware. It depends whether you absolutely want to do this yourself, or simply find out how fast a typical revision of your processor actually runs. Companies such as Intel and Motorola test their chips extensively before release, and these results are available to the public. All you need to do is ask them and they'll send you a free CD-ROM (it might be a DVD - nonsense pedantry) with the results contained. You can do it yourself, but be warned that especially Intel processors contain many redundant instructions that are no longer desirable, let alone necessary. This will take up a lot of your time, but I can absolutely see the fun in doing this. PS. If its purely to help push your own machine's hardware to its theoretical maximum in a personal project that you're doing the Just Jeff's answer above is excellent for generating tidy instruction-speed-averages under real-world conditions.

Cheeky answered 6/4, 2012 at 12:5 Comment(0)

-3

No, but you can calculate it based upon the number of clock cycles the add instruction requires multiplied by the clock rate of the CPU. Different types of arguments to an ADD may result in more or fewer cycles but, for a given argument list, the instruction always takes the same number of cycles to complete.

That said, why do you care?

Engross answered 17/4, 2010 at 14:46 Comment(4)

actually an ADD in computer land is always 1 register plus another. It would never be 7 registers at once, so thats a falsehood. – Carbineer 17/4, 2010 at 14:48

@WootMoo:That depends on the processor and instruction set involved. Just for example, the VAX has an "AddP6" instruction ("Add packed, 6 operand"). The x86 doesn't fit your description either. – Mcmillon 17/4, 2010 at 15:17

@Woot4Moo: You may want to consult the processor manuals you use before you talk about what an ADD "always is". The Intel 64-bit (non-Itanium) chips have 13 different ADD forms that can have memory as the destination of the operation and 5 forms that have memory as the source. Indeed the only combinations of operands not supported are immediate/immediate and memory/memory. – Lifegiving 17/4, 2010 at 15:56

I learn something new every day, for instance that a single thread can do multiple additions at once. – Carbineer 17/4, 2010 at 23:0

Recommended topics

Hot tags