Why are there no NAND, NOR and XNOR instructions in X86?
Asked Answered
S

1

4
  • They're one of the simplest "instructions" you could perform on a computer (they're the first ones I'd personally implement)
  • Performing NOT(AND(x, y)) doubles execution time AND dependency chain length AND code size
  • BMI1 introduced "andnot" which is a meaningful addition that is a unique operation - why not the ones in the title of this question?
  • You usually read answers among the lines of "they take up valuable op-code space" but then I look at all of the kmask operations introduced with AVX512, which, btw, include NAND and XNOR........................
  • Optimizing compilers could generate better code
  • It gets way worse with SIMD => there is no NOT instruction, which requires tripling of execution time, dependency chain length (EDIT: <= not true; thanks @Peter Cordes) and code size instead of doubling:
vpcmpeqd  xmm15, xmm15, xmm15
vpor      xmm0,  xmm0,  xmm1
vpandn    xmm0,  xmm0,  xmm15
Sulfonation answered 6/1, 2021 at 13:13 Comment(11)
You can do that NOT-operation with vpxor by the way. Also what about vpternlogd (a single instruction that implements 256 logic operations), its existence surely makes some sort of argumentCalyx
You can (for all 3), but you still need a mask with all bits set to 1 - or am I missing something? Throughput maybe? vpternlogd is AVX512 only, isn't it? As long as AMD doesn't implement it I don't see it as a real instruction set :D And it doesn't apply to 64bit registers anyway.Sulfonation
Yes you still need that all-ones vector, though I don't think that's as bad as you think: that vpcmpeqd is considered to be independent of its input (that goes back to Core2 .. well, the non-VEX version anyway, obviously there was no AVX at that point) and in any case it's not inside the dependency chain of the actual NOT-operation, it's only a side-chainCalyx
You can often arrange your code to not need an inversion, e.g. checking the opposite FLAG condition. Not always; of course when you're doing a chain of bitwise things it can come up. The real speedup from adding more such instructions to BMI1 probably would have been quite small for most general workloads like SPECint. And yeah it would have made sense for some SIMD version before AVX-512, like AVX2 or SSE4, but since they didn't, there's little point adding them now that vpternlogd exists. Unless Intel is going to create new 256-bit-only extensions that AMD might want to implement...Jotunheim
What's the trick using vpandn and a zeroed register?Calyx
@harold: nevermind, 0 & ~x = 0, not ~x. Brain fart on my part; you'd still need all-ones, like andn(x, ones). Like the OP is doing in their last code block, which they claim triples the critical-path length for NOR when it's actually only 2 cycles on the critical path from xmm0 or 1 to the result.Jotunheim
@harold replace vpandn xmm0, xmm0, xmm15 with vpandn xmm0, xmm15, xmm0 where xmm15 is now a zero register instead of a register full of 1 bits, with the previous xor you brought up. LLVM does not generate such code in my tests.Sulfonation
Questions like "Why didn't they add this perfectly reasonable instruction?" are usually unanswerable, since the only people who can answer it are the CPU designers. Everybody else is just speculating. Questions like "What alternatives are available?" are more suited to this site.Metallography
@MrUnbelievable92: vpandn is dst = ~src1 & src2. Using it with src1=0 just does dst=src2, so it's a slow vmovdqa.Jotunheim
@RaymondChen: Sometimes there are very plausible reasons why architects wouldn't have added them, and in x86's case we can talk about 8086 and evolution of the ISA, and what forces usually drive adding new instructions to existing ISAs, vs. present from the start. e.g. MIPS had nor from the start as a more powerful way to enable a NOT in one instruction. So I put my amateur CPU-architect had on and wrote an answer. I 100% agree it's not a programming question, but I personally like this kind of question when the actual question is interesting.Jotunheim
Retrocomputing has lot of these, but in this case the design history isn't really "retro" yet. So unless a CPU-architecture stack site is created, the [cpu-architecture] tag seems to be where these questions belong on the stackexchange network, if/when we choose to allow them instead of closing them. Like I said, I enjoy seeing this kind of question. I know that the [cpu-architecture] tag isn't my personal sandbox (even though I'm the only user with a gold badge in it), but similar "why does this design make sense" type of questions have upvotes and answers.Jotunheim
J
12

Those instructions wouldn't be as valuable as you imagine, and once a base ISA has been created, architects typically don't add new instructions unless there's a big win for some important use-case. (e.g. MMX isn't a big win overall for most code, but was a huge speedup for video/audio codecs as one of the early use-cases.)

Remember, most code isn't doing branchless bithacks. That only became much more common with SIMD, decades after 8086. I doubt most programmers would rather have nor than or (8086 had no space left for more standard ALU instruction encodings that follow its normal patterns1.) A lot of code spends a lot of it's time comparing-and-branching, looping over data structures (and stalling for memory), or doing "normal" math. Certainly bit-manipulation code exists, but a lot of code doesn't involve much of that.

Saving an instruction or two all over the place will help, but only if you can compile your whole application with these new instructions. (Although most of BMI1 and BMI2 are actually like that, e.g. SHLX/SHRX for 1-uop copy-and-shift-by-variable, but Intel still added them to patch over the really crappy 3-uop shift-by-cl.) That's fine if you're targeting a specific server (so you can build with -march=native), but a lot of x86 code is ahead-of-time compiled for use on random consumer machines. Extensions like SSE can greatly speed up single loops, so it's usually viable to dispatch to different versions of one single function to take advantage, while keeping the baseline requirement low.

But it wouldn't work that way for newly-added version of the instructions you're suggesting, so the benefit to adding them is significantly lower. And they weren't already present because 8086 is super cramped.

But most ISAS don't have these, not ARM, not even PowerPC which chooses to use the coding space in its 32-bit instruction words to have a lot of opcodes. (Including neat stuff like rlwinm rotate and mask with a bit-range, and other bitfield insert/extract to arbitrary position stuff.) So it's not just a matter of 8086 legacy screwing x86-64 yet again, it's that most CPU architects haven't considered it worth adding opcodes for these, even in a RISC with lots of space.

Although MIPS does have a nor, instead of a not. (MIPS xori zero-extends the immediate so it couldn't be used to NOT a full register.)


SIMD code:

Note that once you've created an all-ones vector once, you can reuse it in a loop. Most SIMD code is in loops, although careful use of SIMD for a single struct can be good.

SIMD NOT only adds 1 cycle to the critical path, for a total of 2 cycle latency for your NOR implementation. In your example, pcmpeqd is off the critical path and has no dependency on the old value of the reg on almost all CPUs. (Still needs a SIMD execution unit to write the ones, though). It costs throughput but not latency. Execution time might depend on either throughput or latency, for a given block of code. (How many CPU cycles are needed for each assembly instruction? (it's not that simple) / What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?)

BTW, compilers often use vpxor with all-ones instead of vpandn; only advantage is with a memory source operand where you can NOT-and-load with xor, unlike vpandn where the optionally-memory operand (src2) is the one that's not inverted. dst = ~src1 & src2.


Scalar code

You can often arrange your code to not need an inversion, e.g. checking the opposite FLAG condition after an OR. Not always; of course when you're doing a chain of bitwise things it can come up, probably moreso with SIMD.

The real speedup from adding more such instructions to BMI1 or a future extension probably would (have been) quite small for most general workloads like SPECint.

More valuable than integer xnor etc. probably would be non-destructive VEX versions of common integer instructions like sub that can't be done with LEA. So lots of mov/sub sequences could be vsub. Also maybe imul, or, maybe and, and perhaps shl/shr/sar-immediate. But sure if you're adding stuff, might as well have nand, nor, and xnor. And maybe scalar abs, and setcc r/m32 to avoid the stupid xor-zeroing or movzx you need to booleanize into a 32-bit integer. (While you're at it, mov r/m32, sign_extended_imm8 would also be good for code-density if you could find a one-byte opcode for it, e.g. one of the ones that 64-bit mode freed up.)

There's a whole laundry list of bad or short-sighted design decisions it would be nice to reverse (or that it would have been nice if AVX fixed), e.g. that cvtsi2sd xmm0, eax merges into XMM0 so it has a false dependency, leading GCC to spend an extra insn on xor-zeroing the destination. AVX was a chance to change that behaviour for the VEX version, and maybe could have been handled internally by giving the existing execution unit the physical zero-reg as the merge target. (Which exists in the physical register file on SnB-family, that's why xor-zeroing can be fully eliminated in rename, like mov-elimination.) But nope, Intel kept everything as much like the legacy-SSE versions as the possibly could, preserving that short-sighted Pentium III design decision. :( (PIII split xmm regs into two 64-bit halves: only writing the low half was good for it for SSE1 cvtsi2ss. Intel continued with the merging for SSE2 cvtsi2sd in P4 for consistency I guess.)


It might have made sense to add negated-boolean instruction in some SIMD version before AVX-512, like SSE4.1 (which added a bunch of miscellaneous integer stuff, and made things more orthogonal, and was added. And was only added in 45nm Core2, so transistor budgets were a lot higher than in MMX or SSE1/2 days), or AVX (which opened up a lot of coding space with VEX).

But since they didn't, there's little point adding them now that vpternlogd exists. Unless Intel is going to create new legacy-SSE or 256-bit-only VEX extensions that AMD might want to implement...

(Legacy-SSE would make it usable even in their Silvermont-family CPUs, and in Pentium/Celeron CPUs, none of which decode VEX prefixes. That's why unfortunately even Skylake Pentiums disable BMI1/2 support along with AVX1/2/FMA. This is really dumb and means we're no closer to being able to use BMI1/2 as a baseline for ahead-of-time compiled stuff that should run on "modern desktops".)


Opcode coding space

VEX has lots of coding space, and mask instructions use that. Also, AVX-512 is only implemented by high-end CPUs; it will be a long time if ever before Intel's low-power Silvermont family CPUs implement it. So needing to decode all those different VEX-coded mask instructions is something AVX-512 CPUs just have to deal with.

AVX-512 (or a predecessor) was originally designed for Larrabee, a GPU project which turned into Xeon Phi compute cards. So AVX-512 ISA-design choices don't fully reflect what you might design with general-purpose usage in mind. Although having lots of relatively small cores would mean you'd want to avoid anything that inflated decoder die-area or power too much, so it's not unreasonable.

But without VEX, x86 opcode space is very crowded (literally no 1-byte opcodes left in 32-bit mode, and few 0f xx left. http://ref.x86asm.net/coder32.html). Intel (unlike AMD) still for some reason likes to make some CPUs that can't decode VEX prefixes. Of course they could change that and add VEX decoding into Silvermont so they could have VEX-coded integer instructions without supporting AVX (or all of BMI2). (BMI2 includes pext/pdep which are expensive to implement fast in a dedicate execution unit. AMD chooses to micro-code them so they've very slow, but that lets code use other BMI2 instructions usefully.)

(Unfortunately there's no way for a CPU to advertize (via CPUID) that it supports only 128-bit vector size AVX instructions, which would have allowed narrower CPUs to still get non-destructive instructions. OTOH, without some forward-compatible way for code to use wider instructions on CPUs that do support it, making 128-bit AVX code to optimize for current CPUs might end up being called "good enough" and not have anyone bother to make 256=bit versions for CPUs that can support it.)

Footnote 1: opcodes for original-8086 instructions

Just getting every different opcode decoded was a challenge for 8086, and each ALU instruction has about 8 different opcodes: memory dest, memory source, immediate source, and special-case no modrm AL/AX forms. And times two for 8 and 16-bit versions of each of those. Plus xnor r/m16, sign_extended_imm8. Of course the immediate forms can use the /r field in ModRM as extra opcode bits, but xnor r/m8, r and xnor r, r/m8 and the 16-bit forms would need 4 separate opcode bytes, and so would xnor al, imm8 and xnor ax, imm16, so that's 6 whole opcode bytes per instruction, plus some overloaded opcode /constant

(semi-related: https://codegolf.stackexchange.com/questions/132981/tips-for-golfing-in-x86-x64-machine-code/160739#160739 re: short-form AL,imm8 encodings.)

Part of the patterns you can see in the original-8086 opcodes is that one bit selects between r/m destination vs. r/m source, and another bit between 8 and 16-bit operand-size (Is there a pattern to x86 op codes? (other than direction and size bits) / Are x86 opcodes arbitrary?). So doing it differently for a few rarer instructions (by leaving out memory-dst or 8-bit forms for example) might have broken the pattern and if so needed more extra transistors than the standard patterns for feeding the ALU after a load or register fetch, or load/alu/store.

In fact, I don't think 8086 left enough room for even one more ALU instruction that supported all the standard forms like add or or. And 8086 didn't decode any 0f xx opcodes; that came later for extensions.

Jotunheim answered 6/1, 2021 at 16:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.