This is just a few minutes worth of work, would rather also demonstrate bare metal and other such things but not worth the time right now.
Through testing of some functions to see what the calling convention is and also noting that for add it is generating
400600: 8d 04 37 lea (%rdi,%rsi,1),%eax
400603: c3 retq
for and
400610: 89 f8 mov %edi,%eax
400612: 21 f0 and %esi,%eax
400614: c3 retq
three instructions instead of two, five bytes instead of four, these bits if information both do and dont matter depending. But to make it more fair will do the same thing for each operation.
Also want the do it a zillion times loop closely coupled and not compiled as that may end up creating some variation. And lastly the alignment try to make that fair.
.balign 32
nop
.balign 256
.globl and_test
and_test:
mov %edi,%eax
and %esi,%eax
sub $1,%edx
jne and_test
retq
.balign 32
nop
.balign 256
.globl add_test
add_test:
mov %edi,%eax
add %esi,%eax
sub $1,%edx
jne add_test
retq
.balign 256
nop
derived from yours
#include<stdio.h>
#include<time.h>
unsigned int add_test ( unsigned int a, unsigned int b, unsigned int x );
unsigned int and_test ( unsigned int a, unsigned int b, unsigned int x );
int main()
{
int x=10;
int y=25;
time_t start,stop;
for(int j=0;j<10;j++)
{
start = clock();
add_test(10,25,2000000000);
stop = clock();
printf("%u %u\n",j,(int)(stop-start));
}
for(int j=0;j<10;j++)
{
start = clock();
and_test(10,25,2000000000);
stop = clock();
printf("%u %u\n",j,(int)(stop-start));
}
return(0);
}
first run as expected the first loop took longer as it wasnt in the cache? should not have taken that much longer so that doesnt make sense, perhaps other reasons...
0 605678
1 520204
2 521311
3 520050
4 521455
5 520213
6 520315
7 520197
8 520253
9 519743
0 520475
1 520221
2 520109
3 520319
4 521128
5 520974
6 520584
7 520875
8 519944
9 521062
but we stay fairly consistent. second run, the times stay somewhat consistent.
0 599558
1 515120
2 516035
3 515863
4 515809
5 516069
6 516578
7 516359
8 516170
9 515986
0 516403
1 516666
2 516842
3 516710
4 516932
5 516380
6 517392
7 515999
8 516861
9 517047
note that this is 2 billion loops. four instructions per. my clocks per second is 1000000 at 3.4ghz 0.8772 clocks per loop or 0.2193 clocks per instruction, how is that possible? superscaler processor.
A LOT more work could be done, here, this was just a few minutes worth and hopefully it is just enough to demonstrate (as others have already as well) that you cant really see the difference with a test like this.
I could do a demo with something more linear like an arm and something we could read the clock/timer register as part of the code under test, as the calling of the clock code are all part of the code under test and can vary here. Hopefully that is not necessary, the results are much more consistent though using sram, controlling all the instructions under test, etc. and with that you can see alignment differences you can see the cost of the cache read on the first loop but not the remaining, etc...(a few clocks total although 10ms as we see here, hmm, might be on par for an x86 system dont know benchmarking x86 is a near complete waste of time, no fun in it and the results dont translate to other x86 computers that well)
As pointed out in your other question that was closed as a duplicate, and I hate using links here, should learn how to cut and paste pictures (TODO).
https://en.wikipedia.org/wiki/AND_gate
https://en.wikipedia.org/wiki/Adder_(electronics)
Assuming feeding the math/logic operation for add and and is the same and we are only trying to measure the difference between them you are correct the AND is faster not getting into further detail you can see an and only has one stage/gate. Where a full adder takes three levels, back of the envelope math, three times as much time to settle the signals once the inputs change than the AND....BUT....Although there are some exceptions, chips are not designed to take advantage of this (well multiply and divide vs add/and/xor/etc, yes they are or are more likely to be). One would design these simple operations to be one clock cycle, on a clock the inputs to the combinational logic (the actual AND or ADD) are latched, on the next clock cycle the result is latched from the other end and begins its journey to the register file or out the core to memory, etc...At some point in the design you do synthesis into the gates available for the foundry/process you are using then do a timing analysis/closure on that and look for long poles in the tent. Extremely unlikely (impossible) that the add is a long pole, both add and and are very very short poles, but you determine at that point what your max clock rate is, if you wanted a 4ghz procerssor but the result is 2.7 well you need to take those long poles and turn them into two or more clock operations. the time it takes to do an add vs and which should vary add should be longer, is so fast and in the noise that it is all within a clock cycle so even if you did a functional simulation of the logic design you would not see the difference you need to implement an and and a full adder in say pspice using transistors and other components, then feed step changes into the inputs and watch how long it takes to settle, that or build them from discrete components from radio shack and try, although the results might be too fast for your scope, so use pspice or other.
think of writing equations to solve something you can write maybe a long equation or you can break that into multiple smaller ones with intermediate variables
this
a = b+c+d+e+f+g;
vs
x=b+c;
y=d+e;
z=f+g;
a=x+y;
a=a+z;
one clock vs 5 clocks, but each of the 5 clocks can be faster if this was the longest pole in the tent. all the other logic is that much faster too feeding this. (actually x,y,z can be one clock, then either a=x+y+z in the next or make it two more)
multiply and divide are different simply because the logic explodes exponentially, there is no magic to multiply or divide they have to work the same way we do things on pencil and paper. you can take shortcuts with binary if you think about it. as you can only multiply by 0 or 1 before shifing and adding to the accumulator. the logic equations for one clock still explode exponentially and then you can do parallel things. it burns a ton of chip realestate, so you may choose to make multiply as well as divide take more than one clock and hide those in the pipeline. or you can choose to burn a significant amount of chip realestate...Look at the documentation for some of the arm cores you can at compile time (when you compile/synthesize the core) choose a one clock or multi-clock multiply to balance chip size vs performance. x86 we dont buy the IP and make the chips ourselves so it is up to intel how they do it, and very likely microcoded so by just microcoding you can tweak how things happen or do it in an alu type operation.
So you might detect multiply or divide performance vs add/and with a test like this but either they did it in one clock and you cant ever see it, or they may have buried it in two or more steps in the pipe so that it averages out right and to see it you would need access to a chip sim. using timers and running something a billion times is fun, but to actually see instruction performance you need a chip sim and need to tweak the code to keep the code not under test from affecting the results.
and
andadd
are of the same speed. If you measure anything different, your measuring+measured code is flawed, which wouldn't be of any surprise, as measuring performance on x86 with artificial code is very difficult and there's ton of tiny details which may skew the results either way. Reading Agner Fog articles on performance of x86 instructions would give you lot more accurate data, although it's sadly missing that feel of exploration with your own code. – Calvadosadd
rather than the same code withand
since on x86 add can use also uselea
, which allows 3 inputs (two reg and one constant) and a distinct destination. On the other hand,add
andand
are both 2-input and the destination is shared with one input. For some code usinglea
can be faster. – Ostrander