What methods can be used to efficiently extend instruction length on modern x86?
Asked Answered
M

4

27

Imagine you want to align a series of x86 assembly instructions to certain boundaries. For example, you may want to align loops to a 16 or 32-byte boundary, or pack instructions so they are efficiently placed in the uop cache or whatever.

The simplest way to achieve this is single-byte NOP instructions, followed closely by multi-byte NOPs. Although the latter is generally more efficient, neither method is free: NOPs use front-end execution resources, and also count against your 4-wide1 rename limit on modern x86.

Another option is to somehow lengthen some instructions to get the alignment you want. If this is done without introducing new stalls, it seems better than the NOP approach. How can instructions be efficiently made longer on recent x86 CPUs?

In the ideal world lengthening techniques would simultaneously be:

  • Applicable to most instructions
  • Capable of lengthening the instruction by a variable amount
  • Not stall or otherwise slow down the decoders
  • Be efficiently represented in the uop cache

It isn't likely that there is a single method that satisfies all of the above points simultaneously, so good answers will probably address various tradeoffs.


1The limit is 5 or 6 on AMD Ryzen.

Maggee answered 1/1, 2018 at 2:21 Comment(16)
Ideally you can use an imm32 or disp32 form of an instruction that only needed imm8. Or use a REX prefix when you don't need one. Repeating the same prefix is sometimes possible. It's not in general safe to add prefixes like rep that current CPUs ignore, because they might mean something else in future ISA extensions. And yes, lengthening instructions is generally better than a NOP, but it's probably easy to introduce decode / pre-decode bottlenecks (e.g. fewer instructions fit in a group of 16 or 32 bytes).Bursiform
I wouldn't say "followed closely". Single-byte nop can defeat the uop cache, and make things dramatically worse than a couple long NOPs if you need 15 bytes of padding, or especially 31 bytes.Bursiform
I wouldn't consider "fewer instructions fit in a group of 16 or 32 bytes" either a positive or a negative when contrasting with nops. Sure, nops gets you "more instructions" per chunk - but they are useless: they are only for padding! So under either approach, a given chunk will have the same number of "useful" instructions, and how the total number of instructions instructions helps or hurts depends on the exact numbers (e.g., 4 per 16 bytes is much better than 5 on Haswell, but not on Skylake). Also, if you're doing this, it is assumed you are aware of ideal instruction layouts...Maggee
I doubt executing long strings of single-byte nops is common: you'd usually just jmp over them. So the pain of single byte nops is fairly limited if you use jmp, and also if they are expected to be executed and used to align blocks to say 16B boundaries there are a limited number of them by definition... FWIW though by "closely" I meant close in simplicity, not close in efficiency (although I think they are probably close in efficiency in practice too).Maggee
I went to clarify it, but upon reading it again, I don't think it needs clarification: it should be clear the "followed closely" is referring to simplicity since I don't even mention efficiency in that sentence. I only bring it up later in the question.Maggee
There could be cases where getting the real instructions into the OoO core sooner could be a win, even if the front-end has to spend the same amount of total decode cycles because of a later NOP. And re: single-byte NOPs. Yes, in real life you'd jmp over them instead of run them, but that's not what youre question says. But yes, I did misread it, I was thinking the "followed closely" was referring to efficiency.Bursiform
I don't think my question takes a stance on jmp either way. It mentions single-byte nops and multi-byte nops, with the general idea being there are only a few of either of them. If you have a lot (enough to break the uop cache) then you'd probably be looking at jumping over them. At that point it doesn't matter what "filler" you use since you are jumping over it, right? Do I need to clarify that?Maggee
I can't really imagine a scenario where the "getting the instructions into the core sooner" would apply here. Of course you could come up with a scenario where you "dumbly" aligned something like PIII.I instead of IIII.N where P is a prefix (or whatever other lengthening method) byte, I are instructions bytes for 1 instruction, N is a nop, and . is a boundary (say a 16B boundary) that makes the instruction move into the next fetch in the former case, but the whole point here is to align things properly. Do you have a real example?Maggee
There are plenty of real examples, like say 1111 1222 2233 333N vs 1111 1222 22P3 3333 where 1, 2, 3 are bytes belonging to three instructions, and N is a nop and P is a prefix: both of the methods align the 15 bytes of instructions to a 16 byte boundary (the assumption being that's the goal) and the former results in 4 instructions in the block and the latter 3. The former will decode more efficiently on Haswell, because 4, 4, 4 is better than 4, 1, 1, but the exact opposite would happen if the instruction counts were 4 and 5. That's why I mean it's a wash.Maggee
(note that 4, 1, 1 doesn't actually decode at 2.5 instructions/cycle, but something like 3, so there is something else going on, but in general it still seems better to have 4 (or 8) instructions in a block)Maggee
Let us continue this discussion in chat.Maggee
For one byte of padding, I think adding a ds prefix to any memory access instruction is completely free, and probably multiple ds prefixes are, too.Optative
I think segment prefixes are safe to repeat, but I have no idea about performance implications. I.e. this is hint for further research or question for people knowing the x86 inside out, not answer. EDIT: almost what prl saidCosgrove
I have got idication from a producer of RTOS that Intel is weakening support for segmentation, as the majority of OSs uses it in a very limited and quite standardized way. This means also that on some CPUs (Atom series in particular) changes to segment register is becoming more expensive. I don't know if this applies also to segment prefix decoding (though I think it shouldn't, since the expensive part is the load of descriptors from the system table, not the usage of an already-loaded descriptor)Flak
@prl: Some CPUs (like AMD) decode slowly when instructions have more than 3 prefixes. On some CPUs, this includes the mandatory prefixes in SSE2 and especially SSSE3 / SSE4.1 instructions. In Silvermont, even the 0F escape byte counts. But yes, that's a good suggestion. It would be cool if any assemblers could do this for you automatically. e.g. give it a region to expand to end at a certain alignment. (@ Bee: auto-generated padding is what I was thinking of earlier when I mentioned delaying decode/issue of the critical path. Re-doing padding by hand is time consuming.)Bursiform
uOp decoding doesn't matter if the uOps are in the LSD or the uOp cache in skylake and up...Lechner
B
11

Consider mild code-golfing to shrink your code instead of expanding it, especially before a loop. e.g. xor eax,eax / cdq if you need two zeroed registers, or mov eax, 1 / lea ecx, [rax+1] to set registers to 1 and 2 in only 8 total bytes instead of 10. See Set all bits in CPU register to 1 efficiently for more about that, and Tips for golfing in x86/x64 machine code for more general ideas. Probably you still want to avoid false dependencies, though.

Or fill extra space by creating a vector constant on the fly instead of loading it from memory. (Adding more uop-cache pressure could be worse, though, for the larger loop that contains your setup + inner loop. But it avoids d-cache misses for constants, so it has an upside to compensate for running more uops.)

If you weren't already using them to load "compressed" constants, pmovsxbd, movddup, or vpbroadcastd are longer than movaps. dword / qword broadcast loads are free (no ALU uop, just a load).

If you're worried about code alignment at all, you're probably worried about how it sits in the L1I cache or where the uop-cache boundaries are, so just counting total uops is no longer sufficient, and a few extra uops in the block before the one you care about may not be a problem at all.

But in some situations, you might really want to optimize decode throughput / uop-cache usage / total uops for the instructions before the block you want aligned.


Padding instructions, like the question asked for:

Agner Fog has a whole section on this: "10.6 Making instructions longer for the sake of alignment" in his "Optimizing subroutines in assembly language" guide. (The lea, push r/m64, and SIB ideas are from there, and I copied a sentence / phrase or two, otherwise this answer is my own work, either different ideas or written before checking Agner's guide.)

It hasn't been updated for current CPUs, though: lea eax, [rbx + dword 0] has more downsides than it used to vs mov eax, ebx, because you miss out on zero-latency / no execution unit mov. If it's not on the critical path, go for it though. Simple lea has fairly good throughput, and an LEA with a large addressing mode (and maybe even some segment prefixes) can be better for decode / execute throughput than mov + nop.

Use the general form instead of the short form (no ModR/M) of instructions like push reg or mov reg,imm. e.g. use 2-byte push r/m64 for push rbx. Or use an equivalent instruction that is longer, like add dst, 1 instead of inc dst, in cases where there are no perf downsides to inc so you were already using inc.

Use SIB byte. You can get NASM to do that by using a single register as an index, like mov eax, [nosplit rbx*1] (see also), but that hurts the load-use latency vs. simply encoding mov eax, [rbx] with a SIB byte. Indexed addressing modes have other downsides on SnB-family, like un-lamination and not using port7 for stores.

So it's best to just encode base=rbx + disp0/8/32=0 using ModR/M + SIB with no index reg. (The SIB encoding for "no index" is the encoding that would otherwise mean idx=RSP). [rsp + x] addressing modes require a SIB already (base=RSP is the escape code that means there's a SIB), and that appears all the time in compiler-generated code. So there's very good reason to expect this to be fully efficient to decode and execute (even for base registers other than RSP) now and in the future. NASM syntax can't express this, so you'd have to encode manually. GNU gas Intel syntax from objdump -d says 8b 04 23 mov eax,DWORD PTR [rbx+riz*1] for Agner Fog's example 10.20. (riz is a fictional index-zero notation that means there's a SIB with no index). I haven't tested if GAS accepts that as input.

Use an imm32 and/or disp32 form of an instruction that only needed imm8 or disp0/disp32. Agner Fog's testing of Sandybridge's uop cache (microarch guide table 9.1) indicates that the actual value of an immediate / displacement is what matters, not the number of bytes used in the instruction encoding. I don't have any info on Ryzen's uop cache.

So NASM imul eax, [dword 4 + rdi], strict dword 13 (10 bytes: opcode + modrm + disp32 + imm32) would use the 32small, 32small category and take 1 entry in the uop cache, unlike if either the immediate or disp32 actually had more than 16 significant bits. (Then it would take 2 entries, and loading it from the uop cache would take an extra cycle.)

According to Agner's table, 8/16/32small are always equivalent for SnB. And addressing modes with a register are the same whether there's no displacement at all, or whether it's 32small, so mov dword [dword 0 + rdi], 123456 takes 2 entries, just like mov dword [rdi], 123456789. I hadn't realized [rdi] + full imm32 took 2 entries, but apparently that' is the case on SnB.

Use jmp / jcc rel32 instead of rel8. Ideally try to expand instructions in places that don't require longer jump encodings outside the region you're expanding. Pad after jump targets for earlier forward jumps, pad before jump targets for later backward jumps, if they're close to needing a rel32 somewhere else. i.e. try to avoid padding between a branch and its target, unless you want that branch to use a rel32 anyway.


You might be tempted to encode mov eax, [symbol] as 6-byte a32 mov eax, [abs symbol] in 64-bit code, using an address-size prefix to use a 32-bit absolute address. But this does cause a Length-Changing-Prefix stall when it decodes on Intel CPUs. Fortunately, none of NASM/YASM / gas / clang do this code-size optimization by default if you don't explicitly specify a 32-bit address-size, instead using 7-byte mov r32, r/m32 with a ModR/M+SIB+disp32 absolute addressing mode for mov eax, [abs symbol].

In 64-bit position-dependent code, absolute addressing is a cheap way to use 1 extra byte vs. RIP-relative. But note that 32-bit absolute + immediate takes 2 cycles to fetch from uop cache, unlike RIP-relative + imm8/16/32 which takes only 1 cycle even though it still uses 2 entries for the instruction. (e.g. for a mov-store or a cmp). So cmp [abs symbol], 123 is slower to fetch from the uop cache than cmp [rel symbol], 123, even though both take 2 entries each. Without an immediate, there's no extra cost for

Note that PIE executables allow ASLR even for the executable, and are the default in many Linux distro, so if you can keep your code PIC without any perf downsides, then that's preferable.


Use a REX prefix when you don't need one, e.g. db 0x40 / add eax, ecx.

It's not in general safe to add prefixes like rep that current CPUs ignore, because they might mean something else in future ISA extensions.

Repeating the same prefix is sometimes possible (not with REX, though). For example, db 0x66, 0x66 / add ax, bx gives the instruction 3 operand-size prefixes, which I think is always strictly equivalent to one copy of the prefix. Up to 3 prefixes is the limit for efficient decoding on some CPUs. But this only works if you have a prefix you can use in the first place; you usually aren't using 16-bit operand-size, and generally don't want 32-bit address-size (although it's safe for accessing static data in position-dependent code).

A ds or ss prefix on an instruction that accesses memory is a no-op, and probably doesn't cause any slowdown on any current CPUs. (@prl suggested this in comments).

In fact, Agner Fog's microarch guide uses a ds prefix on a movq [esi+ecx],mm0 in Example 7.1. Arranging IFETCH blocks to tune a loop for PII/PIII (no loop buffer or uop cache), speeding it up from 3 iterations per clock to 2.

Some CPUs (like AMD) decode slowly when instructions have more than 3 prefixes. On some CPUs, this includes the mandatory prefixes in SSE2 and especially SSSE3 / SSE4.1 instructions. In Silvermont, even the 0F escape byte counts.

AVX instructions can use a 2 or 3-byte VEX prefix. Some instructions require a 3-byte VEX prefix (2nd source is x/ymm8-15, or mandatory prefixes for SSSE3 or later). But an instruction that could have used a 2-byte prefix can always be encoded with a 3-byte VEX. NASM or GAS {vex3} vxorps xmm0,xmm0. If AVX512 is available, you can use 4-byte EVEX as well.


Use 64-bit operand-size for mov even when you don't need it, for example mov rax, strict dword 1 forces the 7-byte sign-extended-imm32 encoding in NASM, which would normally optimize it to 5-byte mov eax, 1.

mov    eax, 1                ; 5 bytes to encode (B8 imm32)
mov    rax, strict dword 1   ; 7 bytes: REX mov r/m64, sign-extended-imm32.
mov    rax, strict qword 1   ; 10 bytes to encode (REX B8 imm64).  movabs mnemonic for AT&T.

You could even use mov reg, 0 instead of xor reg,reg.

mov r64, imm64 fits efficiently in the uop cache when the constant is actually small (fits in 32-bit sign extended.) 1 uop-cache entry, and load-time = 1, the same as for mov r32, imm32. Decoding a giant instruction means there's probably not room in a 16-byte decode block for 3 other instructions to decode in the same cycle, unless they're all 2-byte. Possibly lengthening multiple other instructions slightly can be better than having one long instruction.


Decode penalties for extra prefixes:

  • P5: prefixes prevent pairing, except for address/operand-size on PMMX only.
  • PPro to PIII: There is always a penalty if an instruction has more than one prefix. This penalty is usually one clock per extra prefix. (Agner's microarch guide, end of section 6.3)
  • Silvermont: it's probably the tightest constraint on which prefixes you can use, if you care about it. Decode stalls on more than 3 prefixes, counting mandatory prefixes + 0F escape byte. SSSE3 and SSE4 instructions already have 3 prefixes so even a REX makes them slow to decode.
  • some AMD: maybe a 3-prefix limit, not including escape bytes, and maybe not including mandatory prefixes for SSE instructions.

... TODO: finish this section. Until then, consult Agner Fog's microarch guide.


After hand-encoding stuff, always disassemble your binary to make sure you got it right. It's unfortunate that NASM and other assemblers don't have better support for choosing cheap padding over a region of instructions to reach a given alignment boundary.


Assembler syntax

NASM has some encoding override syntax: {vex3} and {evex} prefixes, NOSPLIT, and strict byte / dword, and forcing disp8/disp32 inside addressing modes. Note that [rdi + byte 0] isn't allowed, the byte keyword has to come first. [byte rdi + 0] is allowed, but I think that looks weird.

Listing from nasm -l/dev/stdout -felf64 padding.asm

 line  addr    machine-code bytes      source line
 num

 4 00000000 0F57C0                         xorps  xmm0,xmm0    ; SSE1 *ps instructions are 1-byte shorter
 5 00000003 660FEFC0                       pxor   xmm0,xmm0
 6                                  
 7 00000007 C5F058DA                       vaddps xmm3, xmm1,xmm2
 8 0000000B C4E17058DA              {vex3} vaddps xmm3, xmm1,xmm2
 9 00000010 62F1740858DA            {evex} vaddps xmm3, xmm1,xmm2
10                                  
11                                  
12 00000016 FFC0                        inc  eax
13 00000018 83C001                      add  eax, 1
14 0000001B 4883C001                    add  rax, 1
15 0000001F 678D4001                    lea  eax, [eax+1]     ; runs on fewer ports and doesn't set flags
16 00000023 67488D4001                  lea  rax, [eax+1]     ; address-size and REX.W
17 00000028 0501000000                  add  eax, strict dword 1   ; using the EAX-only encoding with no ModR/M 
18 0000002D 81C001000000                db 0x81, 0xC0, 1,0,0,0     ; add    eax,0x1  using the ModR/M imm32 encoding
19 00000033 81C101000000                add  ecx, strict dword 1   ; non-eax must use the ModR/M encoding
20 00000039 4881C101000000              add  rcx, strict qword 1   ; YASM requires strict dword for the immediate, because it's still 32b
21 00000040 67488D8001000000            lea  rax, [dword eax+1]
22                                  
23                                  
24 00000048 8B07                        mov  eax, [rdi]
25 0000004A 8B4700                      mov  eax, [byte 0 + rdi]
26 0000004D 3E8B4700                    mov  eax, [ds: byte 0 + rdi]
26          ******************       warning: ds segment base generated, but will be ignored in 64-bit mode
27 00000051 8B8700000000                mov  eax, [dword 0 + rdi]
28 00000057 8B043D00000000              mov  eax, [NOSPLIT dword 0 + rdi*1]  ; 1c extra latency on SnB-family for non-simple addressing mode

GAS has encoding-override pseudo-prefixes {vex3}, {evex}, {disp8}, and {disp32} These replace the now-deprecated .s, .d8 and .d32 suffixes.

GAS doesn't have an override to immediate size, only displacements.

GAS does let you add an explicit ds prefix, with ds mov src,dst

gcc -g -c padding.S && objdump -drwC padding.o -S, with hand-editting:

  # no CPUs have separate ps vs. pd domains, so there's no penalty for mixing ps and pd loads/shuffles
  0:   0f 28 07                movaps (%rdi),%xmm0
  3:   66 0f 28 07             movapd (%rdi),%xmm0

  7:   0f 58 c8                addps  %xmm0,%xmm1        # not equivalent for SSE/AVX transitions, but sometimes safe to mix with AVX-128

  a:   c5 e8 58 d9             vaddps %xmm1,%xmm2, %xmm3  # default {vex2}
  e:   c4 e1 68 58 d9          {vex3} vaddps %xmm1,%xmm2, %xmm3
 13:   62 f1 6c 08 58 d9       {evex} vaddps %xmm1,%xmm2, %xmm3

 19:   ff c0                   inc    %eax
 1b:   83 c0 01                add    $0x1,%eax
 1e:   48 83 c0 01             add    $0x1,%rax
 22:   67 8d 40 01             lea  1(%eax), %eax     # runs on fewer ports and doesn't set flags
 26:   67 48 8d 40 01          lea  1(%eax), %rax     # address-size and REX
         # no equivalent for  add  eax, strict dword 1   # no-ModR/M

         .byte 0x81, 0xC0; .long 1    # add    eax,0x1  using the ModR/M imm32 encoding
 2b:   81 c0 01 00 00 00       add    $0x1,%eax     # manually encoded
 31:   81 c1 d2 04 00 00       add    $0x4d2,%ecx   # large immediate, can't get GAS to encode this way with $1 other than doing it manually

 37:   67 8d 80 01 00 00 00      {disp32} lea  1(%eax), %eax
 3e:   67 48 8d 80 01 00 00 00   {disp32} lea  1(%eax), %rax


        mov  0(%rdi), %eax      # the 0 optimizes away
  46:   8b 07                   mov    (%rdi),%eax
{disp8}  mov  (%rdi), %eax      # adds a disp8 even if you omit the 0
  48:   8b 47 00                mov    0x0(%rdi),%eax
{disp8}  ds mov  (%rdi), %eax   # with a DS prefix
  4b:   3e 8b 47 00             mov    %ds:0x0(%rdi),%eax
{disp32} mov  (%rdi), %eax
  4f:   8b 87 00 00 00 00       mov    0x0(%rdi),%eax
{disp32} mov  0(,%rdi,1), %eax    # 1c extra latency on SnB-family for non-simple addressing mode
  55:   8b 04 3d 00 00 00 00    mov    0x0(,%rdi,1),%eax

GAS is strictly less powerful than NASM for expressing longer-than-needed encodings.

Bursiform answered 12/4, 2018 at 15:0 Comment(2)
Obsolete or deprecated?Samanthasamanthia
@MichaelPetch: good point, I hadn't realized how new the {disp32} syntax was. Just deprecated in the latest binutils, not obsolete yet.Bursiform
J
1

Let's look at a specific piece of code:

    cmp ebx,123456
    mov al,0xFF
    je .foo

For this code, none of the instructions can be replaced with anything else, so the only options are redundant prefixes and NOPs.

However, what if you change the instruction ordering?

You could convert the code into this:

    mov al,0xFF
    cmp ebx,123456
    je .foo

After re-ordering the instructions; the mov al,0xFF could be replaced with or eax,0x000000FF or or ax,0x00FF.

For the first instruction ordering there is only one possibility, and for the second instruction ordering there are 3 possibilities; so there's a total of 4 possible permutations to choose from without using any redundant prefixes or NOPs.

For each of those 4 permutations you can add variations with different amounts of redundant prefixes, and single and multi-byte NOPs, to make it end on a specific alignment/s. I'm too lazy to do the maths, so let's assume that maybe it expands to 100 possible permutations.

What if you gave each of these 100 permutations a score (based on things like how long it would take to execute, how well it aligns the instruction after this piece, if size or speed matters, ...). This can include micro-architectural targeting (e.g. maybe for some CPUs the original permutation breaks micro-op fusion and makes the code worse).

You could generate all the possible permutations and give them a score, and choose the permutation with the best score. Note that this may not be the permutation with the best alignment (if alignment is less important than other factors and just makes performance worse).

Of course you can break large programs into many small groups of linear instructions separated by control flow changes; and then do this "exhaustive search for the permutation with the best score" for each small group of linear instructions.

The problem is that instruction order and instruction selection are co-dependent.

For the example above, you couldn't replace mov al,0xFF until after we re-ordered the instructions; and it's easy to find cases where you can't re-order the instructions until after you've replaced (some) instructions. This makes it hard to do an exhaustive search for the best solution, for any definition of "best", even if you only care about alignment and don't care about performance at all.

Jerry answered 11/4, 2018 at 18:0 Comment(2)
or eax,0x000000FF has a "false" dependency on the old value of EAX. Of course, so does mov al, 0xff on many CPUs. or ax,0x00FF also has a length-changing prefix stall on Intel CPUs. Also, since it's (E)AX, you have the choice of 2 encodings for those OR instruction, with or without a ModR/M byte. (Same for the mov-immediate: you could use a 3-byte mov r/m8, imm8 instead of 2-byte mov r8, imm8.) Also, often you could look and see that future use of EAX doesn't care about the high bytes.Bursiform
maybe for some CPUs the original permutation breaks micro-op fusion and makes the code worse). IDK why you said "maybe". It's obviously true that putting a mov between cmp/je is worse on mainstream Intel / AMD CPUs since Core2 / Bulldozer. (But overall nice answer; yeah reordering instructions will often open up opportunities to clobber flags with longer instructions.)Bursiform
A
0

I can think of four ways off the top of my head:

First: Use alternate encodings for instructions (Peter Cordes mentioned something similar). There are a lot of ways to call the ADD operation for example, and some of them take up more bytes:

http://www.felixcloutier.com/x86/ADD.html

Usually an assembler will try to choose the "best" encoding for the situation whether that is optimizing for speed or length, but you can always use another one and get the same result.

Second: Use other instructions that mean the same thing and have different lengths. I'm sure you can think of countless examples where you could drop one instruction into the code to replace an existing one and get the same results. People that hand optimize code do it all the time:

shl 1
add eax, eax
mul 2
etc etc

Third: Use the variety of NOPs available to pad out extra space:

nop
and eax, eax
sub eax, 0
etc etc

In an ideal world you'd probably have to use all these tricks to get code to be the exact byte length you want.

Fourth: Change your algorithm to get more options using the above methods.

One final note: Obviously targeting more modern processors will give you better results due to the number and complexity of instructions. Having access to MMX, XMM, SSE, SSE2, floating point, etc instructions could make your job easier.

Appleton answered 19/1, 2018 at 21:53 Comment(4)
Yeah, the question was really about the First method, i.e., a general recipe for lengthening instructions, since I don't want to add redundant nops (third method). Second and fourth methods are interesting, but are kind of specific and would be hard to do an automated way (second method could be automated in some cases, but I think it is quite limited).Maggee
and eax,eax isn't a NOP; it writes flags. When you need compat with CPUs that don't support long NOPs, it's common to use lea as a NOP, because you can make the address mode take a variable amount of space while still just copying a register to itself. SIB or not, and, disp32/8/0.Bursiform
@Peter Cordes that's true, and eax eax does affect flags, but it doesn't necessarily matter. Optimizing is always a trade off.Appleton
@Sparafusile: Right, but if you want a 2-byte NOP, 66 90 is strictly better than and eax,eax (unless it's actually useful to break a dependency on flags at that point, e.g. before a variable-count shift). A true NOP only uses up a uop slot, but and also writes a physical register (which can limit the out-of-order window instead of the ROB size).Bursiform
L
-1

Depends on the nature of the code.

Floatingpoint heavy code

AVX prefix

One can resort to the longer AVX prefix for most SSE instructions. Note that there is a fixed penalty when switching between SSE and AVX on intel CPUs [1][2]. This requires vzeroupper which can be interpreted as another NOP for SSE code or AVX code which doesn't require the higher 128 bits.

SSE/AVX NOPS

typical NOPs I can think of are:

  • XORPS the same register, use SSE/AVX variations for integers of these
  • ANDPS the same register, use SSE/AVX variations for integers of these
Lechner answered 11/4, 2018 at 16:30 Comment(4)
x86 already has long NOPs which you'd use instead of a useless andps that will still tie up an ALU. This question is about making existing instructions longer so you can avoid NOPs. Mixing VEX-128 and non-VEX is viable for 128-bit-only code, which is sometimes what you want. (e.g. SIMD integer with AVX1 but not AVX2)Bursiform
As if blocking the SIMD ALU's for one cycle would matter if they are/were unused... it all depends on the code and architecture.Lechner
Ok, but 3-byte andps has no advantage over 66 67 90 nop on any x86 CPU I'm aware of. P5 Pentium took extra cycles to decode any prefixes at all (I think), but it didn't support SSE1, only MMX. Moreover, any CPU that supports SSE1 also supports long-NOPs 0F 1F /0 felixcloutier.com/x86/NOP.html, which is always going to be strictly better than andps: consuming fewer microarchitectural resources like physical registers or whatever until it retires. Also note that xorps xmm0,xmm0 is a zeroing idiom, not a NOP. Sure you can redo it if a register already needs to be zeroed...Bursiform
Your answer spurred me to write a proper one, so... thanks, I think :PBursiform

© 2022 - 2024 — McMap. All rights reserved.