I wanted to run some code through IACA analyzer to see how many uops it was using-- I started with a simple function to see if it was working..
Unfortunately when I insert the macros IACA says to use, the resulting assembly was very different, and so any analysis of it is not helpful..
Here is the assembly produced without IACA
00007FF9CD590580 vaddps ymm1,ymm5,ymmword ptr [rax]
00007FF9CD590584 vaddps ymm2,ymm6,ymmword ptr [rax+20h]
00007FF9CD590589 vaddps ymm3,ymm7,ymmword ptr [rax+40h]
00007FF9CD59058E vmulps ymm4,ymm1,ymm1
00007FF9CD590592 vfmadd231ps ymm4,ymm2,ymm2
00007FF9CD590597 vfmadd231ps ymm4,ymm3,ymm3
00007FF9CD59059C vcmpgt_oqps ymm1,ymm4,ymm9
00007FF9CD5905A2 vrsqrtps ymm0,ymm4
00007FF9CD5905A6 vandps ymm2,ymm1,ymm0
00007FF9CD5905AA vmovups ymm3,ymm8
00007FF9CD5905AF vfmsub231ps ymm3,ymm2,ymm4
00007FF9CD5905B4 vmovups ymmword ptr [r9+rax],ymm3
00007FF9CD5905BA add rax,rcx
00007FF9CD5905BD sub r8d,1
00007FF9CD5905C1 jne fm::EvlOp::applyLoop<`RegisterShapeOps<fm::interpeter<fm::interpreter_settings<math::v8float,4,float,fm::Instruction,math::v8f2d,math::v8float> > >'::`2'::doDISTANCE_SPHERE_11,fm::interpeter<fm::interpreter_settings<math::v8float,4,float,fm::Instruction,math::v8f2d,math::v8float> >::DataWrapper,fm::interpeter<fm::interpreter_settings<math::v8float,4,float,fm::Instruction,math::v8f2d,math::v8float> >::RegisterBlock,fm::interpeter<fm::interpreter_settings<math::v8float,4,float,fm::Instruction,math::v8f2d,math::v8float> >::instruction_input>+0B0h (07FF9CD590580h)
And here is what it produces once I add the IACA macros..( I'm testing MSVC produced binary, so I'm using IACA_VC64_START and IACA_VC64_END as the manual says to do).
00007FF9CD59058B vmovups ymm2,ymmword ptr [rax+40h]
00007FF9CD590590 vmovups ymm0,ymmword ptr [rax]
00007FF9CD590594 vmovups ymm1,ymmword ptr [rax+20h]
00007FF9CD590599 vaddps ymm3,ymm2,ymm8
00007FF9CD59059E vmovups ymmword ptr [rbp+20h],ymm0
00007FF9CD5905A3 vaddps ymm0,ymm0,ymm6
00007FF9CD5905A7 vmovups ymmword ptr [rbp+40h],ymm1
00007FF9CD5905AC vmulps ymm4,ymm0,ymm0
00007FF9CD5905B0 vaddps ymm1,ymm1,ymm7
00007FF9CD5905B4 vfmadd231ps ymm4,ymm1,ymm1
00007FF9CD5905B9 vfmadd231ps ymm4,ymm3,ymm3
00007FF9CD5905BE vcmpgt_oqps ymm1,ymm4,ymm5
00007FF9CD5905C3 vrsqrtps ymm0,ymm4
00007FF9CD5905C7 vmovups ymmword ptr [rbp+60h],ymm2
00007FF9CD5905CC vandps ymm2,ymm1,ymm0
00007FF9CD5905D0 vmovups ymm3,ymm9
00007FF9CD5905D5 vfmsub231ps ymm3,ymm2,ymm4
00007FF9CD5905DA vmovups ymmword ptr [rcx+rax],ymm3
00007FF9CD5905DF add rax,rdx
00007FF9CD5905E2 mov qword ptr [rbp+18h],rax
00007FF9CD5905E6 vmovups ymmword ptr [rbp+80h],ymm3
00007FF9CD5905EE sub r8d,1
00007FF9CD5905F2 jne fm::EvlOp::applyLoop<`RegisterShapeOps<fm::interpeter<fm::interpreter_settings<math::v8float,4,float,fm::Instruction,math::v8f2d,math::v8float> > >'::`2'::doDISTANCE_SPHERE_11,fm::interpeter<fm::interpreter_settings<math::v8float,4,float,fm::Instruction,math::v8f2d,math::v8float> >::DataWrapper,fm::interpeter<fm::interpreter_settings<math::v8float,4,float,fm::Instruction,math::v8f2d,math::v8float> >::RegisterBlock,fm::interpeter<fm::interpreter_settings<math::v8float,4,float,fm::Instruction,math::v8f2d,math::v8float> >::instruction_input>+0B2h (07FF9CD590582h)
So it has inserted lots of moves, and now my (hopefully) fused add is not longer fused--..
I was hoping it would be able to tell me if
00007FF9CD590584 vaddps ymm2,ymm6,ymmword ptr [rax+20h]
Stayed fused, but it removed this code all together..
Is this a known issue, or perhaps because I'm using MSVC which may not be very common?
Is there perhaps a way to fix this, or a better tool that is compatible with MSVC?
__writegsbyte(111, 111)
for IACA_VC64_START, which should actually disturb the code less than trashing EBX would. – Edsel