How is it possible that BITWISE AND operation to take more CPU clocks than ARITHMETIC ADDITION operation in a C program?
Asked Answered
E

3

1

I wanted to test if bitwise operations really are faster to execute than arithmetic operation. I thought they were.

I wrote a small C program to test this hypothesis and to my surprise the addition takes less on average than bitwise AND operation. This is surprising to me and I cannot understand why this is happening.

From what I know for addition the carry from the less significant bits should be carried to the next bits because the result depends on the carry too. It does not make sense to me that a logic operator is slower than addition.

My cod is below:

#include<stdio.h>
#include<time.h>

int main() 
{
   int x=10;
   int y=25;
   int z=x+y;
   printf("Sum of x+y = %i", z);
   time_t start = clock();
   for(int i=0;i<100000;i++)z=x+y;
   time_t stop = clock();

   printf("\n\nArithmetic instructions take: %d",stop-start);
   start = clock();
   for(int i=0;i<100000;i++)z=x&y;
   stop = clock();

   printf("\n\nLogic instructions take: %d",stop-start);
}

Some of the results:

Arithmetic instructions take: 327
Logic instructions take: 360

Arithmetic instructions take: 271
Logic instructions take: 271

Arithmetic instructions take: 287
Logic instructions take: 294

Arithmetic instructions take: 279
Logic instructions take: 266

Arithmetic instructions take: 265
Logic instructions take: 296

These results are taken from consecutive runs of the program.

As you can see on average the logic operator takes longer on average than the arithmetic operator.

Enwrap answered 27/9, 2017 at 7:44 Comment(10)
Considering that with any reasonable optimizations the compiler can throw away both loops, I would suspect the measurement to be wrong, not the actual operations. If you look at instruction timings for at least most x86 CPUs, AND and ADD instructions take exactly the same amount of time.Narcosis
Measuring performance correctly is difficult. I would expect the two operations to take the same amount of time. You are measuring over a very short time. Try measuring over at least a second.Koto
The method of measurement is somewhat flawed for these examples. For one, the number of iterations is way too short. But - if you are compiling without optimizations, it is not very interesting to compare performance here, as the compiler produces quite dumb code that is in no way usable to compare performance of 2 loops. On the other hand, if you enable optimization, the compiler might just eliminate the loops, and you have no basis for comparison. Make sure you read the generated assembly code, and understand what you are measuring.Telugu
@Telugu the loops are not eliminated. If I increase the number of iterations the timing increases as well.Enwrap
@JenniferAnderson either you didn't read the comments completely, or you did not understand them. Turn ON optimization, and they will. iterate over a billion times. you will see they take about the same amount of timeEmplace
@Emplace I know that if I turned ON the optimization the loops will not run. However user Art said that the compiler "optimizes the loop away" which is NOT the case if the optimization flags are not used.Enwrap
@JenniferAnderson I said "with any reasonable optimizations the compiler can throw away both loops". All those words are relevant. You can't make up a quote and use that to argue the exact opposite of what the person you're allegedly quoting said.Narcosis
You didn't specify what CPU are you measuring. But with any modern CPU (either x86 or ARM) the and and add are of the same speed. If you measure anything different, your measuring+measured code is flawed, which wouldn't be of any surprise, as measuring performance on x86 with artificial code is very difficult and there's ton of tiny details which may skew the results either way. Reading Agner Fog articles on performance of x86 instructions would give you lot more accurate data, although it's sadly missing that feel of exploration with your own code.Calvados
there are a number of factors that could come into play here. but basically the test is flawed. you didnt show the disassembly, and even that may not tell the whole story, could have the same machine code in each loop with one being and and the other being add and they could still differ in speed for system and other reasons, you are just barely scratching the surface. Bitwise is clearly faster yes, but designs do not take advantage of that the one clock for that alu operation is long enough to cover addition too (not necessarily multiply and divide though, depends on the implementation).Mapel
It is certainly possible for some code to produce faster performance with add rather than the same code with and since on x86 add can use also use lea, which allows 3 inputs (two reg and one constant) and a distinct destination. On the other hand, add and and are both 2-input and the destination is shared with one input. For some code using lea can be faster.Ostrander
E
10

ok, let's take this "measuring" and blow it up, 100k is a bit few

#include<stdio.h>
#include<time.h>
#define limit 10000000000

int main() 
{
   int x=10, y=25, z;

   time_t start = clock();
   for(long long i=0;i<limit;i++)z=x+y;
   time_t stop = clock();
   printf("Arithmetic instructions take: %ld\n",stop-start);

   start = clock();
   for(long long i=0;i<limit;i++)z=x&y;
   stop = clock();
   printf("Logic instructions take: %ld\n",stop-start);
}

this will run a bit longer. First let's try with no optimization:

thomas@TS-VB:~/src$ g++ -o trash trash.c 
thomas@TS-VB:~/src$ ./trash 
Arithmetic instructions take: 21910636
Logic instructions take: 21890332

you see, both loops take about the same time.

compiling with -S reveals why (only relevant part of .s file shown here) :

// this is the assembly for the first loop
.L3:
    movl    32(%esp), %eax
    movl    28(%esp), %edx
    addl    %edx, %eax             // <<-- ADD
    movl    %eax, 40(%esp)
    addl    $1, 48(%esp)
    adcl    $0, 52(%esp)
.L2:
    cmpl    $2, 52(%esp)
    jl  .L3
    cmpl    $2, 52(%esp)
    jg  .L9
    cmpl    $1410065407, 48(%esp)
    jbe .L3

// this is the one for the second
.L9:
    movl    32(%esp), %eax
    movl    28(%esp), %edx
    andl    %edx, %eax             // <<--- AND
    movl    %eax, 40(%esp)
    addl    $1, 56(%esp)
    adcl    $0, 60(%esp)
.L5:
    cmpl    $2, 60(%esp)
    jl  .L6
    cmpl    $2, 60(%esp)
    jg  .L10
    cmpl    $1410065407, 56(%esp)
    jbe .L6
.L10:

looking into the CPUs instruction set tells us, both ADD and AND will take the same amount of cycles --> the 2 loops will run the same amount of time

Now with optimization:

thomas@TS-VB:~/src$ g++ -O3 -o trash trash.c 
thomas@TS-VB:~/src$ ./trash 
Arithmetic instructions take: 112
Logic instructions take: 74

The loop has been optimized away. The value calculated is never needed, so the compiler decided to not run it at all

conclusion: If you shoot 3 times into the forest, and hit 2 boars and 1 rabbit, it doesn't mean there a twice as much boars than rabbits inside

Emplace answered 27/9, 2017 at 8:36 Comment(3)
That limit literal should have a ll suffix.Embed
@unwind: it would not help, as it would make it look deceptively similar to 1000000000011. The LL suffix is not needed, 10000000000 is a long long literal unless type int or type long are wide enough to store this value. Very old compilers may parse it differently, but chances are they do not support long long at all.Partan
@Partan D'oh, you're right of course. Nevermind. Thanks.Embed
N
5

Let's start by looking at your code. The loops don't actually do anything. Any reasonable compiler will see that you don't use the variable z after the first call to printf, so it is perfectly safe to throw away. Of course, the compiler doesn't have to do it, but any reasonable compiler with any reasonable levels of optimisations enabled will do so.

Let's look at what a compiler did to your code (standard clang with -O2 optimisation level):

    leaq    L_.str(%rip), %rdi
    movl    $35, %esi
    xorl    %eax, %eax
    callq   _printf

This is the first printf ("Sum of ..."), notice how the generated code didn't actually add anything, the compiler knows the values of x and y and just calculated its sum and calls printf with 35.

    callq   _clock
    movq    %rax, %rbx
    callq   _clock

Call clock, save its result in a temporary register, call clock again,

    movq    %rax, %rcx
    subq    %rbx, %rcx
    leaq    L_.str.1(%rip), %rdi
    xorl    %eax, %eax
    movq    %rcx, %rsi

Subtract start from end, set up arguments for printf,

    callq   _printf

Call printf.

The second loop is removed in an equivalent way. There are no loops because the compiler does the reasonable thing - it notices that z is not used after you modify it in the loop, so the compiler throws away any stores to it. And then since nothing is stored into it, it can also throw away x+y. And now, since the body of the loop is not doing anything, the loop can be thrown away. So your code essentially becomes:

printf("\n\nArithmetic instructions take: %d", clock() - clock());

Now, why is this relevant. It is relevant to understand some important concepts. The compiler doesn't slavishly translate one statement at a time into code. The compiler reads all (or as much as possible) of your code, tries to figure out what you actually mean and then generates code that behaves "as if" it executed all those statements. The language and compiler only care about preserving what we could call observable side effects. If computing a value is not observable, it doesn't need to be computed. The time to execute some code is not a side effect we care about, so the compiler doesn't care about preserving it, after all we want our code to be as fast as possible, so we'd like the time to execute something to be not observable at all.

The second part why this is relevant. It is pretty useless to measure how long something takes if you compiled it without optimisations. This is the loop in your code compiled without optimisations:

LBB0_1:
        cmpl    $100000, -28(%rbp)
        jge     LBB0_4
        movl    -8(%rbp), %eax
        addl    -12(%rbp), %eax
        movl    %eax, -16(%rbp)
        movl    -28(%rbp), %eax
        addl    $1, %eax
        movl    %eax, -28(%rbp)
        jmp     LBB0_1
LBB0_4:

You thought you were measuring the addl instruction here. But the whole loop contains so much more. In fact, most of the time in the loop is spent maintaining the loop, not actually executing your instruction. Most of the time is spent reading and writing values on the stack and calculating the loop variable. Any timing you happen to measure will be totally dominated by the infrastructure of the loop rather than operation you want to measure.

You loop very few times. I'm pretty certain that most of the time you actually measure is spent in clock() and not the code you're actually trying to measure. clock needs to do quite some work, reading time is often quite expensive.

Then we get to the question of the actual instructions you care about. They take the exact same amount of time. Here's the canonical source of everything related to instruction timing on x86.

But. It's very hard and almost pretty useless to reason about individual instructions. Pretty much every CPU in the past few decades is superscalar. This means that it will execute many instructions at the same time. What matters for how much time things take is more dependencies between instructions (can't start executing an instruction before its inputs are ready if those inputs are computed by previous instructions) rather than the actual instruction. While you can do dozens of calculations in registers per nanosecond, it can take hundreds of nanoseconds to fetch data from main memory. So even if we know that an instruction takes one cycle and your CPU does two cycles per nanosecond (it's usually around that), it can mean that the number of instructions we can finish in 100ns can be anywhere between 1 (if you need to wait for main memory) and 12800 (I don't know the real exact numbers, but I remember that Skylake can retire 64 floating point operations per cycle).

This is why microbenchmarks aren't seriously done anymore. If minor variations in how things are done can affect the outcome by a factor of twelve thousand, you can quickly see why measuring individual instructions is useless. Most measurements today are done on large parts of programs or whole programs. I do this a lot at work and I've had several situations where improving an algorithm changed memory access patterns that even though the algorithm could be mathematically proven faster, the behavior of the whole program suffered because of changed memory access patterns or such.

Sorry for such a rambling answer, but I'm trying to get to why even though there is a simple answer to your question: "your measurement method is bad" and also a real answer: "they are the same", there are actually interesting reasons why the question itself is unanswerable.

Narcosis answered 27/9, 2017 at 12:30 Comment(8)
To understand and make use of Agner Fog's instruction tables, you have to read his micorarch pdf as well. Different CPUs have different front-end bottlenecks. agner.org/optimize.Byington
12800 is nonsense. A 256b FMA instruction on 8 packed floats could count as 16 FLOPs, but it still retires as one uop. SKL's FMA throughput is 2 per clock. And you'd have to be using 256b loads to feed it, so that would be 16 (in your counting) operations for one memory latency wait time. Anyway, unless your loads are dependent, there will be multiple in flight, so it's never that bad.Byington
@PeterCordes I know, I wasn't aiming for the exact truth there. Using floating point instructions here was a complete lie because the number 64 was the only number I remembered and it is how many can be retired at the same time, not how long they actually take to decode and execute. The main point there is: there are many orders of magnitude between the best case and worst case of number of instructions per second and measuring individual instructions out of context doesn't really mean anything. Which instructions should I have used instead?Narcosis
But it's not 64 unless you count each SIMD element as a separate instruction, and that's nonsense. Integer ADD would have been a good choice, since that's what the OP is asking for. It has 4 per clock throughput on Intel Haswell and later, and AMD Ryzen. And 1c latency, so if you bottleneck on latency it has 1/4 the throughput. It makes little sense to talk about the total time to run one instruction, because out-of-order execution doesn't work like that. A cache miss can add latency to a dependency chain sure, but it's just not the right way to think about it.Byington
A single instruction or short sequence can be mostly characterized by 3 parameters: front-end uop count, latency, and which execution ports it needs. (Not throughput, unless you're actually repeating that block with no other code mixed in.) Microbenchmarks are definitely still a thing, though, if you write them in asm and construct them carefully to measure what you want to measure. e.g. #45660639. It helps a lot to use perf counters to count uops as well as cycles.Byington
"But it's not 64 unless you count each SIMD element as a separate instruction". Isn't that what Intel does in their marketing materials? I'm pretty sure that's where my brain remembered this number from.Narcosis
Stop reading marketing crap and start reading software.intel.com/en-us/articles/intel-sdm#optimization. It's interesting to talk about sustained FMA throughput in floats per clock (bottlenecked by execution unit throughput), but it makes no sense to talk about burst retirement of FMA instructions in elements per clock, only in uops per clock just like any other uops.Byington
SKX (Skylake-AVX512) and KNL can each sustain two 16-float FMAs per clock. You could count that as 64 FLOP/cycle. (But KNL has no room in the pipeline for anything else, so it usually bottlenecks on the front-end.)Byington
M
4

This is just a few minutes worth of work, would rather also demonstrate bare metal and other such things but not worth the time right now.

Through testing of some functions to see what the calling convention is and also noting that for add it is generating

  400600:   8d 04 37                lea    (%rdi,%rsi,1),%eax
  400603:   c3                      retq  

for and

  400610:   89 f8                   mov    %edi,%eax
  400612:   21 f0                   and    %esi,%eax
  400614:   c3                      retq 

three instructions instead of two, five bytes instead of four, these bits if information both do and dont matter depending. But to make it more fair will do the same thing for each operation.

Also want the do it a zillion times loop closely coupled and not compiled as that may end up creating some variation. And lastly the alignment try to make that fair.

.balign 32
nop
.balign 256
.globl and_test
and_test:
    mov %edi,%eax
    and %esi,%eax
    sub $1,%edx
    jne and_test
    retq

.balign 32
nop
.balign 256
.globl add_test
add_test:
    mov %edi,%eax
    add %esi,%eax
    sub $1,%edx
    jne add_test
    retq

.balign 256
    nop

derived from yours

#include<stdio.h>
#include<time.h>
unsigned int add_test ( unsigned int a, unsigned int b, unsigned int x );
unsigned int and_test ( unsigned int a, unsigned int b, unsigned int x );
int main() 
{
   int x=10;
   int y=25;
   time_t start,stop;
   for(int j=0;j<10;j++)
   {
       start = clock();
       add_test(10,25,2000000000);
       stop = clock();
       printf("%u %u\n",j,(int)(stop-start));

   }
   for(int j=0;j<10;j++)
   {
       start = clock();
       and_test(10,25,2000000000);
       stop = clock();
       printf("%u %u\n",j,(int)(stop-start));

   }
   return(0);
}

first run as expected the first loop took longer as it wasnt in the cache? should not have taken that much longer so that doesnt make sense, perhaps other reasons...

0 605678
1 520204
2 521311
3 520050
4 521455
5 520213
6 520315
7 520197
8 520253
9 519743
0 520475
1 520221
2 520109
3 520319
4 521128
5 520974
6 520584
7 520875
8 519944
9 521062

but we stay fairly consistent. second run, the times stay somewhat consistent.

0 599558
1 515120
2 516035
3 515863
4 515809
5 516069
6 516578
7 516359
8 516170
9 515986
0 516403
1 516666
2 516842
3 516710
4 516932
5 516380
6 517392
7 515999
8 516861
9 517047

note that this is 2 billion loops. four instructions per. my clocks per second is 1000000 at 3.4ghz 0.8772 clocks per loop or 0.2193 clocks per instruction, how is that possible? superscaler processor.

A LOT more work could be done, here, this was just a few minutes worth and hopefully it is just enough to demonstrate (as others have already as well) that you cant really see the difference with a test like this.

I could do a demo with something more linear like an arm and something we could read the clock/timer register as part of the code under test, as the calling of the clock code are all part of the code under test and can vary here. Hopefully that is not necessary, the results are much more consistent though using sram, controlling all the instructions under test, etc. and with that you can see alignment differences you can see the cost of the cache read on the first loop but not the remaining, etc...(a few clocks total although 10ms as we see here, hmm, might be on par for an x86 system dont know benchmarking x86 is a near complete waste of time, no fun in it and the results dont translate to other x86 computers that well)

As pointed out in your other question that was closed as a duplicate, and I hate using links here, should learn how to cut and paste pictures (TODO).

https://en.wikipedia.org/wiki/AND_gate
https://en.wikipedia.org/wiki/Adder_(electronics)

Assuming feeding the math/logic operation for add and and is the same and we are only trying to measure the difference between them you are correct the AND is faster not getting into further detail you can see an and only has one stage/gate. Where a full adder takes three levels, back of the envelope math, three times as much time to settle the signals once the inputs change than the AND....BUT....Although there are some exceptions, chips are not designed to take advantage of this (well multiply and divide vs add/and/xor/etc, yes they are or are more likely to be). One would design these simple operations to be one clock cycle, on a clock the inputs to the combinational logic (the actual AND or ADD) are latched, on the next clock cycle the result is latched from the other end and begins its journey to the register file or out the core to memory, etc...At some point in the design you do synthesis into the gates available for the foundry/process you are using then do a timing analysis/closure on that and look for long poles in the tent. Extremely unlikely (impossible) that the add is a long pole, both add and and are very very short poles, but you determine at that point what your max clock rate is, if you wanted a 4ghz procerssor but the result is 2.7 well you need to take those long poles and turn them into two or more clock operations. the time it takes to do an add vs and which should vary add should be longer, is so fast and in the noise that it is all within a clock cycle so even if you did a functional simulation of the logic design you would not see the difference you need to implement an and and a full adder in say pspice using transistors and other components, then feed step changes into the inputs and watch how long it takes to settle, that or build them from discrete components from radio shack and try, although the results might be too fast for your scope, so use pspice or other.

think of writing equations to solve something you can write maybe a long equation or you can break that into multiple smaller ones with intermediate variables

this
a = b+c+d+e+f+g;
vs
x=b+c;
y=d+e;
z=f+g;
a=x+y;
a=a+z;

one clock vs 5 clocks, but each of the 5 clocks can be faster if this was the longest pole in the tent. all the other logic is that much faster too feeding this. (actually x,y,z can be one clock, then either a=x+y+z in the next or make it two more)

multiply and divide are different simply because the logic explodes exponentially, there is no magic to multiply or divide they have to work the same way we do things on pencil and paper. you can take shortcuts with binary if you think about it. as you can only multiply by 0 or 1 before shifing and adding to the accumulator. the logic equations for one clock still explode exponentially and then you can do parallel things. it burns a ton of chip realestate, so you may choose to make multiply as well as divide take more than one clock and hide those in the pipeline. or you can choose to burn a significant amount of chip realestate...Look at the documentation for some of the arm cores you can at compile time (when you compile/synthesize the core) choose a one clock or multi-clock multiply to balance chip size vs performance. x86 we dont buy the IP and make the chips ourselves so it is up to intel how they do it, and very likely microcoded so by just microcoding you can tweak how things happen or do it in an alu type operation.

So you might detect multiply or divide performance vs add/and with a test like this but either they did it in one clock and you cant ever see it, or they may have buried it in two or more steps in the pipe so that it averages out right and to see it you would need access to a chip sim. using timers and running something a billion times is fun, but to actually see instruction performance you need a chip sim and need to tweak the code to keep the code not under test from affecting the results.

Mapel answered 27/9, 2017 at 15:12 Comment(3)
On Intel x86 chips, integer div is microcoded (which can lead to weird performance effects), but integer mul is implemented directly in hardware on all Intel/AMD CPUs. Typically 3c latency, one per clock throughput, even for 64-bit operand-size. (Yes, this takes a lot of transistors).Byington
Interestingly, FP div is not microcoded, at least not in the usual sense (it's iterative inside a dedicated execution unit). It's not fully pipelined, but it is partially (e.g. Skylake can do a float division every 3 clocks, even without using SIMD, with latency=11c). If you can hide the latency of an FP div, an occasional FP div is no more expensive than an FP add or mul (as far as throughput), if you don't bottleneck on divider throughput. (https://mcmap.net/q/14541/-floating-point-division-vs-floating-point-multiplication).Byington
Yep, was trying to be general and just say that in addition to other factors microcoding adds even more to performance variations. x86 implementations keep changing so even if you could performance tune something it is really only tuned to your machine, put it on the same skylake on some other motherboard, or a sandy bridge or other and no reason to expect good or even similar performance. what they microcode or not is just part of the fun.Mapel

© 2022 - 2024 — McMap. All rights reserved.