What is IACA and how do I use it?
Asked Answered
N

1

59

I've found this interesting and powerful tool called IACA (the Intel Architecture Code Analyzer), but I have trouble understanding it. What can I do with it, what are its limitations and how can I:

  • Use it to analyze code in C or C++?
  • Use it to analyze code in x86 assembler?
Nimitz answered 24/9, 2014 at 15:53 Comment(0)
N
73

2019-04: Reached EOL. Suggested alternative: LLVM-MCA

2017-11: Version 3.0 released (latest as of 2019-05-18)

2017-03: Version 2.3 released

What it is:

IACA (the Intel Architecture Code Analyzer) is a (2019: end-of-life) freeware, closed-source static analysis tool made by Intel to statically analyze the scheduling of instructions when executed by modern Intel processors. This allows it to compute, for a given snippet,

  • In Throughput mode, the maximum throughput (the snippet is assumed to be the body of an innermost loop)
  • In Latency mode, the minimum latency from the first instruction to the last.
  • In Trace mode, prints the progress of instructions through their pipeline stages.

when assuming optimal execution conditions (All memory accesses hit L1 cache and there are no page faults).

IACA supports computing schedulings for Nehalem, Westmere, Sandy Bridge, Ivy Bridge, Haswell, Broadwell and Skylake processors as of version 2.3 and Haswell, Broadwell and Skylake as of version 3.0.

IACA is a command-line tool that produces ASCII text reports and Graphviz diagrams. Versions 2.1 and below supported 32- and 64-bit Linux, Mac OS X and Windows and analysis of 32-bit and 64-bit code; Version 2.2 and up only support 64-bit OSes and analysis of 64-bit code.

How to use it:

IACA's input is a compiled binary of your code, into which have been injected two markers: a start marker and an end marker. The markers make the code unrunnable, but allow the tool to find quickly the relevant pieces of code and analyze them.

You do not need the ability to run the binary on your system; In fact, the binary supplied to IACA can't run anyways because of the presence of the injected markers in the code. IACA only requires the ability to read the binary to be analyzed. Thus it is possible, using IACA, to analyze a Haswell binary employing FMA instructions on a Pentium III machine.

C/C++

In C and C++, one gains access to marker-injecting macros with #include "iacaMarks.h", where iacaMarks.h is a header that ships with the tool in the include/ subdirectory.

One then inserts the markers around the innermost loop of interest, or the straight-line chunk of interest, as follows:

/* C or C++ usage of IACA */

while(cond){
    IACA_START
    /* Loop body */
    /* ... */
}
IACA_END

The application is then rebuilt as it otherwise would with optimizations enabled (In Release mode for users of IDEs such as Visual Studio). The output is a binary that is identical in all respects to the Release build except with the presence of the marks, which make the application non-runnable.

IACA relies on the compiler not reordering the marks excessively; As such, for such analysis builds certain powerful optimizations may need to be disabled if they reorder the marks to include extraneous code not within the innermost loop, or exclude code within it.

Assembly (x86)

IACA's markers are magic byte patterns injected at the correct location within the code. When using iacaMarks.h in C or C++, the compiler handles inserting the magic bytes specified by the header at the correct location. In assembly, however, you must manually insert these marks. Thus, one must do the following:

    ; NASM usage of IACA
    
    mov ebx, 111          ; Start marker bytes
    db 0x64, 0x67, 0x90   ; Start marker bytes
    
.innermostlooplabel:
    ; Loop body
    ; ...
    jne .innermostlooplabel ; Conditional branch backwards to top of loop

    mov ebx, 222          ; End marker bytes
    db 0x64, 0x67, 0x90   ; End marker bytes

It is critical for C/C++ programmers that the compiler achieve this same pattern.

What it outputs:

As an example, let us analyze the following assembler example on the Haswell architecture:

.L2:
    vmovaps         ymm1, [rdi+rax] ;L2
    vfmadd231ps     ymm1, ymm2, [rsi+rax] ;L2
    vmovaps         [rdx+rax], ymm1 ; S1
    add             rax, 32         ; ADD
    jne             .L2             ; JMP

We add immediately before the .L2 label the start marker and immediately after jne the end marker. We then rebuild the software, and invoke IACA thus (On Linux, assumes the bin/ directory to be in the path, and foo to be an ELF64 object containing the IACA marks):

iaca.sh -64 -arch HSW -graph insndeps.dot foo

, thus producing an analysis report of the 64-bit binary foo when run on a Haswell processor, and a graph of the instruction dependencies viewable with Graphviz.

The report is printed to standard output (though it may be directed to a file with a -o switch). The report given for the above snippet is:

Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - ../../../tests_fma
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 1.55 Cycles       Throughput Bottleneck: FrontEnd, PORT2_AGU, PORT3_AGU

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 0.5    0.0  | 0.5  | 1.5    1.0  | 1.5    1.0  | 1.0  | 0.0  | 1.0  | 0.0  |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [rdi+rax*1]
|   2    | 0.5       | 0.5 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [rsi+rax*1]
|   2    |           |     | 0.5       | 0.5       | 1.0 |     |     |     | CP | vmovaps ymmword ptr [rdx+rax*1], ymm1
|   1    |           |     |           |           |     |     | 1.0 |     |    | add rax, 0x20
|   0F   |           |     |           |           |     |     |     |     |    | jnz 0xffffffffffffffec
Total Num Of Uops: 6

The tool helpfully points out that currently, the bottleneck is the Haswell frontend and Port 2 and 3's AGU. This example allows us to diagnose the problem as the store not being processed by Port 7, and take remedial action.

Limitations:

IACA does not support a certain few instructions, which are ignored in the analysis. It does not support processors older than Nehalem and does not support non-innermost loops in throughput mode (having no ability to guess which branch is taken how often and in what pattern).

Nimitz answered 24/9, 2014 at 15:53 Comment(50)
Does IACA require you have the hardware? I mean can you compile for e.g. fma3 and test it with IACA on a core2 system with only SSE2? And the opposite. If I want to test SSE2 only code can I do this with a Haswell system? If IACA reads counters I don't think this would be possible. But since IACA does not require root/admin I assume this means that it does not require the hardware.Fiora
@Zboson it does not require the hardware; It is a static analysis tool and as such never actually runs the code. The only real requirement is a binary to analyze; You needn't even be able to run said binary to analyze it. In fact, the binary to be analyzed can't be run anyways because of the injected markers.Nimitz
Your markers for assmebly are different then in the header. In the header it's only two instructions for each where you use three. The header has ;START_MARKER mov ebx, 111 db 0x64, 0x67, 0x90 ;END_MARKER mov ebx, 222 db 0x64, 0x67, 0x90Fiora
@Zboson I produced those from the iacaMarks.h header, and when I look at their definition there I see that for IACA_START the macro IACA_UD_BYTES is expanded before IACA_SSC_MARK, while for IACA_END it's the contrary. Correct me if I'm wrong?Nimitz
Look at the end of the header where it just says asm (not inline). That's what I use and it works for me. I wonder what the difference is.Fiora
It's the db 0x0F, 0x0B line. The end of the header does not use that. I don't know what it's for.Fiora
@Zboson but if you look at iacaMarks.h: #define IACA_END {IACA_SSC_MARK(222) \ IACA_UD_BYTES}. And that means that immediately after the mov 222 and three magic bytes there's the two bytes generated by IACA_UD_BYTES. 0F 0B is the opcode for UD2, the intentionally undefined instruction in x86.Nimitz
@Zboson I feel dumb. You're right, I'll edit my post. But then why does the C version feel obliged to also include ud2?Nimitz
I don't know. I just saw a difference and wondered why. If I had to guess it has somehting to do with inline assembly with C.Fiora
@Zboson So you didn't use ud2 and it still worked for you? Because in all my experiments I used ud2 and it worked OK.Nimitz
Yes, it works fine for me. You can try it. But you don't necessarily need to edit your question. People can read the comments here.Fiora
Fun tool :-} I have an inner assembler block with some internal branching that has two exits. I place the start mark at the top, and end marks on both exits. When I run it (it works! nice!) it chooses one of the two exits and shows me a result for the chosen path. a) it appears to pick up code inside the block that is conditionally, but rarely executed; how do I get it to ignore that, and b) How do I get to analyze both paths? (I'm about to try deleting the mark on one branch, but worry the tool is going to follow that branch into the infinite supply of code it leads to...Decimate
@IraBaxter I'm not too sure, having not tried this tool on anything but branchless inner loops, but my understanding is that it doesn't follow branches; it only hunts for the marks and analyzes the stretch of code in-between, however insane. Assuming your branches are predictable (IACA assumes no misprediction penalties), I'd disassemble the particular code path you're interested in, concatenate the basic blocks actually executed together with their terminating (conditional) branches, wrap this concatenation of assembly instructions with the marks, re-assemble them and analyze that.Nimitz
@IraBaxter Or in other words, if you have BB0; je L0; BB1; jmp L1; L0: BB2; jmp L2; L1: BB3; exit; L2: BB4; exit and you're interested in the else-branch BB0 -> BB2 -> BB4, I'd stick them together contiguously as follows: START_MARK-BB0-je-BB2-jmp-BB4-END_MARK in an assembler file, assemble that and analyze.Nimitz
@IwillnotexistIdonotexist: (Its pretty cool see the actual code in there; amazingly my tight code is largely manufactured by carefully designed macros). Ah, that explains why it showed me the rarely executed code (the assembly block conditionally jumps around it, so it is inline) and why it picked just the one path. I can apply it to the (partial) other path by judicious insertion of marks. (Too bad it doesn't follow branch hints).Decimate
@IwillnotexistIdonotexist, we could post a comment on the IACA webpage and ask them what the purpose of db 0x0F, 0x0B.Fiora
@IwillnotexistIdonotexist: +1: awesome - thanks for the pointer to this useful tool - the Graphviz plots are a nice touch too.Polysynthetic
does the tool require that you use the markers provided in the header, can you add it yourself?Tannic
@Tannic No, you can add them yourself. That's why I showed how to do them in both C and NASM; Because fundamentally IACA only looks for magic sequences of instructions to delimit the areas to analyze.Nimitz
thanks. I really wish I could just feed it some assembly (unencoded if it pleased IACA) so I don't have to generate an ELF binary.Tannic
@Tannic In that regards, IACA is somewhat inflexible; It's not open-source, and it internally uses the Pin framework for binary disassembly, so one is forced to use the inputs Pin can handle. The best I can suggest is that you make yourself a script that accepts assembly input and makes a minimal ELF file out of it, with the .text section containing the compiled assembler and sandwiched with the marks.Nimitz
Right on, that's not a bad idea. The upside is I'll get to know how to make an ELF file! I'm going to do some googling but I'll appreciate any hints you may have! Thanks a lot +1.Tannic
Sorry an aside question, do you know what ports mean here? I'm a bit new to this, so I'm guessing it's some thing to do with the CPU, any docs or tutorials for that?Tannic
Ok, so I see what ports are now, but I thought pipelining was implemented in a different way ... it seems like at least some pipelining is done via ports.Tannic
@Tannic Modern Intel CPUs are not just pipelined (the concept of having multiple instructions in different stages of completion executing simultaneously) but also superscalar (the concept of executing multiple instructions at the same stage of completion). The (multiple) instructions that an Intel processor fetches are then decoded into 0+ micro-operations, and those are dispatched to a port(s) capable of handling them. Well-tuned code makes sure that the instructions used saturate the ports evenly, so all are productive.Nimitz
nicely explained sir!Tannic
Unfortunately, the tool is not being maintained since the last release in 2012 and it pukes on code produced by modern compilers.Gulf
What does a port pressure of 0.5 cycles mean? Does it mean that an instruction can be issued to two different ports? For instance, vfmadd231ps requires 0.5 cycles of both fma ports 0 and 1.Burgoyne
@Burgoyne Precisely. Reads on Intel Haswell can go either to port 2 or 3; Most basic integer ops can go to 0,1,5 or 6; Floating-point ops can go to 0 or 1. The OoO engine load-balances between the ports, leading to fractional cycles on average.Nimitz
The licence appears to be distinctly non-free, and downloads appear to be binary-only, so you need to correct the first sentence of this answer.Xanthic
@TobySpeight I wrote "free" in the sense of free beer, not free speech (to abuse the FSF's slogan). But sure, I can call that out.Nimitz
I don't know exactly when it was released but 2.2 is currently available with support for broadwell.Erving
@Christoph Thanks for the heads up, glad to see Intel giving a fresh look at IACA. Will edit.Nimitz
@IwillnotexistIdonotexist I tried to download IACA the last to days on different platforms and the download seems to be broken. Could you kindly confirm this?Erving
@Christoph Really? I just tried this link, which brought me to a list of downloads for v2.1 and v2.2. I clicked on the Linux v2.2 version, which gave me a license agreement. After I accepted it I was given the true list of downloads, which worked.Nimitz
@IwillnotexistIdonotexist I've tried the same at least 20 times now. Everytime I click the link in the true list of downloads I get a message "page redirected to often" and no download starts :/Erving
@IwillnotexistIdonotexist GOD d*mn intel! If I switch my browser language to english the download works just fine.Erving
@IwillnotexistIdonotexist encoding sould be the same I simply switched from de-de to en-us. The final downlink seems to be language independent but it seems it doesn't work when sending Accept-Language: de-de at least for me. Thanks for your help :)Erving
In version 2.1, defining IACA_MARKS_OFF disables IACA analysis, but this feature has dissapeared in v2.2. Is it an "official" way of disabling IACA marks in version 2.2?Burgoyne
@Burgoyne Interesting... it would appear not. The best I could suggest is #undef IACA_SSC_MARK and #define IACA_SSC_MARK(x) {} or suchlike. Not very handy for build systems indeed. Also difficult to explain, coming from a company as well-organized as Intel.Nimitz
FYI, Intel has updated this again. v2.3 (released ~March 2017) adds support for Skylake and AVX-512. They have also added support for "Tracing in-depth information about different operation stages inside the processor." They explain exactly what that means in the new docs.Modulation
@DavidWohlferd Thanks for the heads-up!Nimitz
Yeah, IACA added support for skylake including skylake serverDear
@IwillnotexistIdonotexist There is a new release: v3.0. It supports only 4th to 6th generation processors (Haswell, Broadwell, Skylake). Tracing capabilities are more powerful.Burgoyne
@IwillnotexistIdonotexist, according to the IACA website, IACA is discontinued: April 2019: Intel® Architecture Code Analyzer has reached its End Of Life.Burgoyne
@Burgoyne :-( This sucks.Nimitz
Another alternative: OSACABaseline
Web interface to several static analysis tools: https://uica.uops.info/.Burgoyne
@chus, uica.uops.info is awesome!Fiora
@z-boson, yes, it is a great tool!Burgoyne

© 2022 - 2024 — McMap. All rights reserved.