Is there some benefit in the following assembly commands?
Asked Answered
L

6

5

In our system's programming classes, we're being taught assembly language. In most of the sample programs our prof. has shown in classes; he's using:

XOR CX, CX

instead of

MOV CX, 0

or

OR AX, AX
JNE SOME_LABEL

instead of

CMP AX, 0
JNE SOME_LABEL

or

AND AL, 0FH        ; To convert input ASCII value to numeral
; The value in AL has already been checked to lie b/w '0' and '9'

instead of

SUB AL, '0'

My question is the following, is there some kind of better performance when using the AND/OR or XOR instead of the alternate (easy to understand/read) method?

Since these programs are generally shown to us during theory lecture hours, most of the class is unable to actually evaluate them verbally. Why spend 40 minutes of lecture explaining these trivial statements?

Ladin answered 12/8, 2013 at 17:9 Comment(2)
The instructions may be shorter, and they don't produce null bytes.Manille
...and there are special optimizations like register renaming that recognize xor eax,eaxSatori
A
6
XOR CX, CX  ;0x31 0xC9

Uses only two bytes: opcode 0x31 and ModR/M byte that stores source and destination register (in this case these two are same).

MOV CX, 0  ;0xB8 0x08 0x00 0x00

Needs more bytes: opcode 0xB8, ModR/M for destination (in this case CX) and two byte immediate filled with zeroes. There is no difference from clocking perspective (both take only one clock), but mov needs 4 bytes while xor uses only two.

OR AX, AX  ;0x0A 0xC0

again uses only opcode byte and ModRM byte, while

CMP AX, 0  ;0x3D 0x00 0x00 <-- but usually 0x3B ModRM 0x00 0x00

uses three or four bytes. In this case it uses three bytes (opcode 0x3D, word immediate representing zero) because x86 has special opcodes for some operations with Accumulator register, but normally it would use four bytes (opcode, ModR/M, word immediate). It's again the same when talking about CPU clocks.

There's no difference to processor when executing

AND AL, 0x0F  ;0x24 0x0F  <-- again special opcode for Accumulator

and

SUB AL, '0'  ;0x2D 0x30 0x00  <-- again special opcode for Accumulator

(only one byte difference), but when you substract ASCII zero, you can't be sure that there won't remain value greater than 9 in Accumulator. Also anding sets OF and CF to zero, while sub sets them according to the result ANDing can be safer, but my personal opinion is that this usage depends on context.

Aesop answered 12/8, 2013 at 17:49 Comment(2)
CMP AX, 0 won't use 4 bytes; even with a different register cmp si, 0 would use cmp r/m16, sign_extended_imm8. Only mov and test lack sign-extended-imm8 forms for wider operand-size. Unless you told you assembler to be dumb and not use the shortest encodings. Also, and al, imm8 and sub al, imm8 are both 2 bytes. (felixcloutier.com/x86/sub shows the 2C ib encoding. You picked the 2D imm16 encoding for sub ax, '0'). See also Tips for golfing in x86/x64 machine codeDemurral
mov cx, 0 does not use a ModR/M byte.Demythologize
S
4

Apart from code size savings mentioned in the other answers, I thought I'd mention a few more things which you can read more about in Intel's optimization manual and Agner Fog's x86 optimization guide:

XOR REG,REG and SUB REG,REG (with REG being the same for both operands) are recognized by modern x86 processors as dependency breakers; meaning that they also serve a purpose in breaking false dependencies on previous register/flag values. Note that this doesn't necessarily apply if you clear an 8- or 16-bit register, but it will if you clear a 32-bit register.


OR AX, AX
JNE SOME_LABEL

I believe the preferred instruction would be TEST AX,AX. TEST can be macro-fused with any conditional jump (basically combined with the jump instruction into a single instruction prior to decoding) on modern x86 processors. CMP can only be fused with unsigned conditional jumps, at least prior to the Nehalem architecture. Again, I'm not sure if this is the case for 16-bit operands.

Shores answered 12/8, 2013 at 19:31 Comment(1)
mov breaks dependencies on the previous value of a register, too. It only gets mentioned for xor and so on because in the general case the output does depend on the previous value, and so it needs special support to recognize that case. movzx, movd and so on all zero the rest of the dest reg, and thus break dep chains. (as opposed to pinsrw, or movlhps.)Demurral
D
2
  1. Duplicate of What is the best way to set a register to zero in x86 assembly: xor, mov or and? - xor. Although most of those advantages don't apply for a register smaller than 32-bit, at least on modern CPUs. Maybe earlier P6-family CPUs would still special-case xor cx,cx if they rename CX separately from ECX, and CL and CH separately from CX. e.g. to avoid partial-register stalls if writing CL and then reading CX.

    But the code-size advantage always applies.

  2. Duplicate of Test whether a register is zero with CMP reg,0 vs OR reg,reg? - or ax,ax is less efficient on some CPUs than test ax,ax which is designed for this purpose. The use of or seems to be a holdover from 8080. Both save a byte of code-size over cmp ax, 0, but all of those set FLAGS the same way (see my linked answer for that and the 8080 ora a idiom.)

  3. No advantage to AND here. Both are the same code-size (2 bytes). AND reminds you that the low 4 bits of an ASCII digit are the integer value.

    Generally sub al, '0' is more useful because you can do it as part of checking if a character is a digit or not. e.g. sub al, '0' / cmp al, 9 / ja non-digit, otherwise you have the integer value in a register. Using and as the first step there would always create a result in the 0..15 range, thus giving many false positives. See NASM Assembly convert input to integer? for a use-case: a loop that stops at the first non-digit character.

    See also What is the idea behind ^= 32, that converts lowercase letters to upper and vice versa? re: range checks on ASCII.

Demurral answered 11/6, 2022 at 2:6 Comment(0)
F
1

An important difference is whether they impact the CPU operation flags. When you use logical operations xor, or, etc, then the operation flags are affected. So:

XOR  CX, CX

Will not only zero out CX, but, for example, the zero flag of the CPU will be set. The mov instruction does not affect flags. So:

MOV  CX, 0

Will not, for example, set the zero flag.

Flogging answered 12/8, 2013 at 18:12 Comment(2)
When is ZF needed after this xor usage?Aesop
@user35443, it could be needed if you are checking flags at a point which may have been arrived at from more than one place in the code. So the place where the check occurs may not have knowledge that the prior flag-affecting instruction was xor.Flogging
T
1

In addition to the instruction scheduling mentioned previously, which instruction is faster may also depend on the the actual instruction sequence being executed.

An example of a seeemingly innocent instruction having a great impact see page 8 in this paper by Torbjörn Granlund of GMP fame. In example three on the top right of the page a very fast division loop begins with the "nop" instruction. Acoording to footnote 4 on the same page the absence of the nop instruction causes the loop to execute 1 clock cycle slower. Granlund suggests experimenting by placing other nops inside the loop to achieve further speed-ups.

My initial, gut reaction to this was more instructions = more time. However, there is clearly much more to instruction scheduling and execution than can be gleaned from manuals.

Thai answered 16/8, 2013 at 8:16 Comment(1)
That probably aligns later instructions better for the complex/simple decoders. Core2 predates the loop cache (Nehalem) and uop cache (Sandybridge), so decoder throughput was a factor even for short loops.Demurral
F
-2

XOR operation works faster than MOV since it is a bitwise operation,all bitwise operations are performed faster by the CPU.

Ferne answered 12/8, 2013 at 19:29 Comment(3)
Huh? Why would you use a shifter to implement XOR?Shores
I meant to write bitwise, sorry my badFerne
That's not true. Both mov reg, imm and xor reg, reg tak only one clock.Aesop

© 2022 - 2024 — McMap. All rights reserved.