Intel IACA analyzer alters assembly?
Asked Answered
H

1

6

I wanted to run some code through IACA analyzer to see how many uops it was using-- I started with a simple function to see if it was working..

Unfortunately when I insert the macros IACA says to use, the resulting assembly was very different, and so any analysis of it is not helpful..

Here is the assembly produced without IACA

00007FF9CD590580  vaddps      ymm1,ymm5,ymmword ptr [rax]  
00007FF9CD590584  vaddps      ymm2,ymm6,ymmword ptr [rax+20h]  
00007FF9CD590589  vaddps      ymm3,ymm7,ymmword ptr [rax+40h]  
00007FF9CD59058E  vmulps      ymm4,ymm1,ymm1  
00007FF9CD590592  vfmadd231ps ymm4,ymm2,ymm2  
00007FF9CD590597  vfmadd231ps ymm4,ymm3,ymm3  
00007FF9CD59059C  vcmpgt_oqps ymm1,ymm4,ymm9  
00007FF9CD5905A2  vrsqrtps    ymm0,ymm4  
00007FF9CD5905A6  vandps      ymm2,ymm1,ymm0  
00007FF9CD5905AA  vmovups     ymm3,ymm8  
00007FF9CD5905AF  vfmsub231ps ymm3,ymm2,ymm4  
00007FF9CD5905B4  vmovups     ymmword ptr [r9+rax],ymm3  
00007FF9CD5905BA  add         rax,rcx  
00007FF9CD5905BD  sub         r8d,1  
00007FF9CD5905C1  jne         fm::EvlOp::applyLoop<`RegisterShapeOps<fm::interpeter<fm::interpreter_settings<math::v8float,4,float,fm::Instruction,math::v8f2d,math::v8float> > >'::`2'::doDISTANCE_SPHERE_11,fm::interpeter<fm::interpreter_settings<math::v8float,4,float,fm::Instruction,math::v8f2d,math::v8float> >::DataWrapper,fm::interpeter<fm::interpreter_settings<math::v8float,4,float,fm::Instruction,math::v8f2d,math::v8float> >::RegisterBlock,fm::interpeter<fm::interpreter_settings<math::v8float,4,float,fm::Instruction,math::v8f2d,math::v8float> >::instruction_input>+0B0h (07FF9CD590580h)  

And here is what it produces once I add the IACA macros..( I'm testing MSVC produced binary, so I'm using IACA_VC64_START and IACA_VC64_END as the manual says to do).

00007FF9CD59058B  vmovups     ymm2,ymmword ptr [rax+40h]  
00007FF9CD590590  vmovups     ymm0,ymmword ptr [rax]  
00007FF9CD590594  vmovups     ymm1,ymmword ptr [rax+20h]  
00007FF9CD590599  vaddps      ymm3,ymm2,ymm8  
00007FF9CD59059E  vmovups     ymmword ptr [rbp+20h],ymm0  
00007FF9CD5905A3  vaddps      ymm0,ymm0,ymm6  
00007FF9CD5905A7  vmovups     ymmword ptr [rbp+40h],ymm1  
00007FF9CD5905AC  vmulps      ymm4,ymm0,ymm0  
00007FF9CD5905B0  vaddps      ymm1,ymm1,ymm7  
00007FF9CD5905B4  vfmadd231ps ymm4,ymm1,ymm1  
00007FF9CD5905B9  vfmadd231ps ymm4,ymm3,ymm3  
00007FF9CD5905BE  vcmpgt_oqps ymm1,ymm4,ymm5  
00007FF9CD5905C3  vrsqrtps    ymm0,ymm4  
00007FF9CD5905C7  vmovups     ymmword ptr [rbp+60h],ymm2  
00007FF9CD5905CC  vandps      ymm2,ymm1,ymm0  
00007FF9CD5905D0  vmovups     ymm3,ymm9  
00007FF9CD5905D5  vfmsub231ps ymm3,ymm2,ymm4  
00007FF9CD5905DA  vmovups     ymmword ptr [rcx+rax],ymm3  
00007FF9CD5905DF  add         rax,rdx  
00007FF9CD5905E2  mov         qword ptr [rbp+18h],rax  
00007FF9CD5905E6  vmovups     ymmword ptr [rbp+80h],ymm3  
00007FF9CD5905EE  sub         r8d,1  
00007FF9CD5905F2  jne         fm::EvlOp::applyLoop<`RegisterShapeOps<fm::interpeter<fm::interpreter_settings<math::v8float,4,float,fm::Instruction,math::v8f2d,math::v8float> > >'::`2'::doDISTANCE_SPHERE_11,fm::interpeter<fm::interpreter_settings<math::v8float,4,float,fm::Instruction,math::v8f2d,math::v8float> >::DataWrapper,fm::interpeter<fm::interpreter_settings<math::v8float,4,float,fm::Instruction,math::v8f2d,math::v8float> >::RegisterBlock,fm::interpeter<fm::interpreter_settings<math::v8float,4,float,fm::Instruction,math::v8f2d,math::v8float> >::instruction_input>+0B2h (07FF9CD590582h)  

So it has inserted lots of moves, and now my (hopefully) fused add is not longer fused--..

I was hoping it would be able to tell me if

00007FF9CD590584  vaddps      ymm2,ymm6,ymmword ptr [rax+20h] 

Stayed fused, but it removed this code all together..

Is this a known issue, or perhaps because I'm using MSVC which may not be very common?

Is there perhaps a way to fix this, or a better tool that is compatible with MSVC?

Hoatzin answered 16/5, 2019 at 5:16 Comment(0)
A
8

IACA mark macros are just inline asm (or for 64-bit MSVC: start = __writegsbyte(111, 111); and stop = 222). They can potentially disturb the optimizer, or end up in the wrong place (e.g. not the last instruction before falling into a loop, so the block includes some loop setup).

If that happens, like in your case, your best bet is to ask the compiler to produce asm (not machine code) output, and manually insert the markers into the asm you want to analyze.


In NASM syntax, I use this %if / %else block so I can build with nasm -DIACA_MARKS or not. I know this isn't the right syntax for MASM, but the IACA start/end markers are pretty simple: mov to EBX and fs addr32 nop.

%ifdef IACA_MARKS

%macro  IACA_start 0             ; NASM macro with 0 args, defines IACA_start
     mov ebx, 111
     db 0x64, 0x67, 0x90
%endmacro
%macro  IACA_end 0
     mov ebx, 222
     db 0x64, 0x67, 0x90
%endmacro

%else
%define IACA_start
%define IACA_end
%endif
Antonyantonym answered 16/5, 2019 at 11:30 Comment(5)
Apparently IACA uses __writegsbyte(111, 111) for IACA_VC64_START, which should actually disturb the code less than trashing EBX would.Edsel
@RossRidge: For 32-bit MSVC, the IACA header uses __emit so the compiler isn't "aware" of trashing EBX, and shouldn't affect the optimizer. Although actually it uses __asm mov ebx, x, only emit for the other bytes. But anyway, for GNU C inline asm it uses only a "memory" clobber, intentionally not telling the compiler about registers. I think it intentionally wants to crash so you don't accidentally use IACA-marked executables in production or run benchmarks on them, when they have extra bloat around the hottest loops in your program.Antonyantonym
Hmm, ok IACA-2.1's iacaMarks.h used to also include ud2 before the begin mark, and after the end mark. But they seem to have removed that in later versions.Antonyantonym
When I Google searched for it, what came up was __asm mov ebx, x for 32-bit Visual C++, but that was from some project's header file that copied the macros, so I don't know what version of IACA it was from or if it's accurate.Edsel
@RossRidge: Yes, that is still what IACA has for 32-bit MSVC, only using __asm __emit for the nop. (I should have rewritten the first part of my earlier comment instead of just adding a correction after checking the header >.<) But for compilers that support GNU C inline asm, it trashes EBX without telling the compiler; that's what I was thinking of.Antonyantonym

© 2022 - 2024 — McMap. All rights reserved.