Increasing Efficiency of binary -> gray code for 8086

(This answer was written based on the first version of the question, where there was some not-already-optimal hex-print code, and it was complete source for a .exe program. The updates to the question have removed the only parts that had room to be optimized except for ILP which is irrelevant on 8086, so I don't intend to remove those parts of the answer.)

Code size optimizations (which correlate with speed on 8086 and especially 8088, see this retrocomputing answer):

bin2Gray part: no change, unless you count reloading from mem or making a constant. Just re-ordering instructions for ILP on Pentium and later. Or maybe a table for xlat.
byte->hex digits: 21 bytes, down from 32 (and fewer bytes of code fetch)
exit: 1 byte (ret) down from 4 (mov ah / int). Works for at least .com executables, which are also smaller.

I probably should have counted code size in total bytes that need to get fetched (i.e. bytes of instructions executed), rather than static code size, although that's also useful to optimize for.

Dropping the looping from the 21-byte bin2hex part would cost a couple bytes of static code size, but reduce dynamic byte count by about 5, IIRC. And avoid any prefetch buffer discarding on taken branch, except for the int 10h of course.

int 20h can also exit (without needing any AH setup), but only from a .com executable. The interesting part to me is computing the ASCII digits in the desired register with compact code, but if you want a tiny overall program, a .com is the way to go. That also avoids setup of DS. (Although you wouldn't need to set up DS if you make a and EQU or = constant.)

Not attempted: taking advantage of initial register values, which apparently are reliable across some DOS versions. If you're pretending like you're writing a block that might be useful as part of a larger program, that's not viable.

Your program basically has two separate parts; calculating a Gray code, and bin->hex for a 1-byte value in a register. Extracting the nibbles separately doesn't usefully optimize backwards into the Gray-code calculation, so I think we can just keep those fully separated.

There are multiple different ways to make a Gray code (only one bit flips between consecutive values). AFAIK, x ^ (x>>1) is the cheapest to compute from binary, but it's not impossible there could be something you could do with only 2 instructions, given input in a register.

Also related: Gray code algorithm (32 bit or less) for gray->binary points out that the standard x ^ (x>>1) is multiplication in GF(2^k). So on very recent CPUs with Galois-Field instructions, you can do 16 bytes at a time with gf2p8affineqb, I think. (gf2p8mulb uses a fixed polynomial which I think isn't the one we want for this.)

8088 performance is mostly about memory access (including code fetch)

https://www2.math.uni-wuppertal.de/~fpf/Uebungen/GdR-SS02/opcode_i.html shows instruction timings, but those timings are only execution, not code-fetch. 8088 has a 4-byte prefetch buffer (and only an 8-bit data bus), 6-byte on 8086 with its 16-bit bus. Supercat suggested in an answer there:

On the original 8088 processor, the easiest way to estimate execution speed is generally to ignore cycle counts and instead count memory accesses, including instruction fetches, and multiply by four.

I think it's about the same for 8086, except each access can be a whole 2-byte word. So straight-line code (no branches) can get fetched 2 bytes at a time.

For simplicity, I just annotated asm source with instruction size and cycle counts from the table without trying to model how the prefetch buffer would behave.

xlat (like al = ds:[bx+al]) is only 1 byte, and can be worth using if you don't mind having a 256-byte table. It takes 11 bytes to execute, but that includes the data access it makes. Not counting code-fetch, mov bl,al / shr al,1 / xor al,bl is 2+2+3 cycles, but 3 words of code size will cost 12 cycles to fetch. xlat takes almost that long, but when it finishes the prefetch buffer will have had some time to fetch later instructions, so I think it's even more of a win.

Still, it does require that table to come from somewhere, either from disk when your executable was loaded, or you have to pre-compute it. And you need to get a pointer into BX, so it may only be a win if you can do this in a loop.

But if you're using a table, you can combine both parts of the problem and look up both ASCII hex digit characters for the gray-code for a given binary, e.g. with mov dx, [bx + si] with table pointer in SI, binary byte in BL, and BH=0. (DX sets you up for outputting DL with a DOS call.) This of course would require your table to be 256 words (512 bytes). Having a tiny executable may be more valuable than saving a few cycles here; the actual I/O to the screen or a file is probably slow enough for it to not matter much. If you're doing this for multiple bytes, though, copying pairs of ASCII bytes into a buffer can be good.

There's one optimization that will help on more modern CPUs (starting with Pentium) that can run more than 1 instruction in parallel: copy the register, then shift the original so that can happen in the same cycle as the copy.

; optimized for Instruction-level Parallelism
;; input: AL    output: AL = bin_to_gray(AL)
;; clobbers: DL
   mov  dl, al         ; 2B  2 cycles (not counting code-fetch bottlenecks)
   shr  al, 1          ; 2B  2c
   xor  al, dl         ; 2B  3c

(For more about modern CPUs, see https://agner.org/optimize/. And also Can x86's MOV really be "free"? Why can't I reproduce this at all? - mov-elimination doesn't work on byte or word registers because that's merging into the low part of EDX. So even on CPUs with mov-elimination in general, it can't apply here so this optimization saves latency.)

I'm pretty sure there's no further room for improvement in bin -> gray. Even modern x86 doesn't have a copy-and-right-shift (except with the count in another register, BMI2 shrx, or for SIMD registers with AVX, but only for word/dword/qword element sizes). Also no right-shift-and-xor, so there's no avoiding a mov, and the shr and xor are obviously also necessary. XOR is add-without-carry, but I don't think that helps. Unless you had carryless multiply (pclmulqdq) and a multiplier constant to get two copies of the input at the right offsets from each other into the high half of a multiply result, you're going to need to do those operations separately. Or with Galois-Field New Instructions (GFNI): What are the AVX-512 Galois-field-related instructions for?

Still, if you want to exhaustively check, https://en.wikipedia.org/wiki/Superoptimization - ask a superoptimizer to look for sequences that produce the same AL result as the mov/shr/xor sequence.

In real use-cases, you'd normally want code that took data in a register, because that's how you should pass data to functions. After mov al, a, that's what your code is doing.

But if it was a global in memory, you could save a byte of code size by loading it twice instead of copying a register with mov, at the expense of speed. Or even better, make it an assemble-time constant. (Although if you do that, the next step is mov al, a ^ (a>>1) to do the calculation at assemble time.)

; a equ 0ACh         ; makes the following instructions 2 bytes each
;;; otherwise, with a being static storage, loading from memory twice sucks
    mov  al, a
    shr  al, 1    ; 2B, 2 cycles
    xor  al, a    ; reg,imm: 2B, 4 cycles on 8088.    reg,mem: 3B, 13+6 cycles

Byte -> 2 ASCII hex digits

This is the more interesting part.

Looping for only 2 iterations is sometimes not worth it, especially when you can save some work if you do separate things to each half. (e.g. low nibble is x & 0xf, high is x >> 4. Using rol/mov/and is not optimal.)

Tricks:

Prefer insn al, imm - x86 has short-form special cases for immediate operands with AL. (Also AX,imm16).
Wanting to do stuff in AL means it's more efficient to print with BIOS int 10h / AH=0Eh teletype output which takes its input in AL, and doesn't destroy any other registers. I think BIOS output will ignore DOS I/O redirection like foo > outfile.txt and always print to the screen.
There's an evil hack that abuses DAS to turn a 0..15 integer into an ASCII hex digit '0'..'9' or 'A'..'F' without branching. On 8086 (unlike modern x86) DAS is just as fast as typical integer instruction. See this codegolf.SE answer for a breakdown on exactly why it works; it's highly non-obvious, but it avoid branching so it's actually a big speedup on 8086.
BIOS / DOS calls generally don't modify AH, so setting it can be done outside the loop.
Then for code-size, instead of just unrolling, use the cl=4 as a loop counter to loop back and re-run some of the earlier code once (not including the shift). sub cl, 2 / jnz would work, but using the parity flag is a way to use dec cx (1B) / jpe to jump backwards once, then fall through the next time.
DOS programs (or at least .com programs) have SP pointing at the address of some code that exits cleanly. So you can exit via ret.

(I didn't look at improving your loop while keeping the overall strategy. Using AL for as many instructions as possible is prob. worth it, but running rol twice instead of shifting once costs a lot of cycles on 8086: 8 + 4*n for shift-by-CL.)

;; input: byte in AL.   output: print 2 ASCII hex digits with BIOS int 10h
;; clobbers: CX, DX
hexprint_byte:
     mov   ah, 0Eh                       ; BIOS teletype call #
;     push  ax            ; 1B 15c
     mov   dx, ax        ; 2B 2c         ; save number, and AH=call number
     mov   cl, 4         ; 2B 4c
     shr   al, cl        ; 2B 8+4*4 cycles   isolate the high nibble
.loop:
     cmp   al, 10        ; 2B 4c  set CF according to digit <= 9
     sbb   al, 69h       ; 2B 4c  read CF, set CF and conditionally set AF
     das                 ; 1B 4c   magic, which happens to work
     int   10h           ; 2B        BIOS teletype output (AL), no return value
;     pop   ax            ; 1B 12c    ; would do one extra pop if you used this instead of mov/xchg, so you'd need jmp ax instead of ret.  But AND destroys AX
     xchg  ax, dx        ; 1B 3c     ; retrieve the original again (with AH=0Eh call number)
     and   al, 0Fh       ; 2B 4c     ; isolate the low nibble this time

     dec   cx            ; 1B 3c     ; PF is set from the low byte only, CH garbage isn't a problem.  
     jpe   .loop         ; 2B 4c-not-taken, 16c-taken
               ;  4-1 = 3, (0b11) which has even parity
               ; so JPE is taken the first time, falls through the 2nd 

;; size = 21 bytes

Then you can exit the program with ret, or with int 20h.

This is NASM syntax; if your assembler doesn't like .loop then change it to something else. (NASM doesn't allow 2: as a local label so I had to pick different names anyway.) I tested this single-stepping on Linux to make sure the loop branch was taken once, and that I got the right values in AH/AL when int 10h was reached. (I replaced it with a NOP since I actually built this into a 32-bit static executable so I could easily single-step it in GDB, without messing around with an obsolete 16-bit dev setup. Byte counts are from assembling as 16-bit, of course.)

For speed, it would cost only a few more bytes to just duplicate the cmp/sbb/das/int 10h, saving the dec/jpe. (Like 7 bytes instead of 3 for the dec/jpe). The xchg / AND after the first print are necessary either way.

Taken branches cost 16 cycles, and that would avoid a 2nd redundant/useless execution of xchg / and (3 bytes / 7 cycles), and of the loop overhead.

You asked for small (and fast on 8086), so that's what I did. This sacrifices everything else, very much including readability, to save bytes. But that's the fun of code-golfing in assembly!

Unfortunately it's also definitely not simpler, like you asked for in the title. Simpler might use a lookup table, perhaps with xlatb. That might also be faster on 8086, especially if you want to avoid the DAS hack.

Another trick that might help for code size (but very bad for performance) is aam 16 to set AH= quotient = leading digit, AL = remainder = trailing digit (low). (Note that's opposite of div bl) Displaying Time in Assembly shows an example of using it with BIOS int 10h output for a 2-digit decimal number. (Normally AAM is used with an immediate 10, and apparently NEC V20 CPUs ignore the immediate and always divide by 10. Intel CPUs just do immediate division of AL). On 8088/8086, AAM takes 83 cycles, similar to div, which is basically what it does. Using HW division with a power of 2 is generally horrible.

A version using AAM 16 came in at 23 bytes, not using any looping (I didn't have any constants in registers to exploit, so mov cx, 1 / loop would be 5 bytes, while cmp/sbb/das/int 10h is 7 total)

Slower and larger than the looping version, but "simpler"

     aam   16       ; 83 cycles, 2 bytes   AH= quotient = leading digit   AL = remainder = trailing digit (low)
  ; normally never use div or aam by a power of 2, only for code-size over speed.
     cmp   al, 10        ; 2B 4c  set CF according to digit <= 9
     sbb   al, 69h       ; 2B 4c  read CF, set CF and conditionally set AF
     das                 ; 1B 4c   magic, which happens to work
     xchg  dx, ax        ; 1B 3c    stash low digit in DL

     mov   al, dh        ; 2B 2c    get leading digit
     cmp   al, 10        ; 2B 4c
     sbb   al, 69h       ; 2B 4c    most significant (high) nibble as ASCII hex
     das                 ; 1B 4c
     
     mov   ah, 0Eh       ; 2B 3c BIOS teletype output (of AL), advancing cursor
     int   10h           ; 2B ?
     mov   al, dl        ; 2B 2c  ; get low digit back from DL   xchg  ax, dx breaks AH callnum
     int   10h           ; 2B
; size=23B

I wonder if I could use int 21h / AH=2 with input from DL for one of the outputs? That would require changing AH, but could perhaps be done for the 2nd output. Unfortunately that DOS call steps on AL, setting it to the character printed. (e.g. from using this int 10h call).

How to convert a binary integer number to a hex string?

8088 performance is mostly about memory access (including code fetch)

Byte -> 2 ASCII hex digits

Slower and larger than the looping version, but "simpler"

Recommended topics

Hot tags