Increasing Efficiency of binary -> gray code for 8086
Asked Answered
R

1

1

I am a beginner in assembly and This is a code that I designed to convert from binary and to gray, and prints the resulting bit-pattern in hex.

      mov      al, a               
      mov      bl, al
      shr      bl, 1
      xor      al, bl                         

Although the program is working, I want to learn about other simpler methods to increase the efficiency, I tried many other ways but it is affecting the output.

Rendering answered 5/5, 2021 at 11:10 Comment(8)
Are you limited to only 8086 instructions, or can you use 186 instructions like rol r/m, imm8 (e.g. shl ax, 4 or shr al, 4 to get the high half into the bottom of AH or AL respectively)? Or 386 instructions like movzx ax, byte ptr [a]. And are you purely optimizing for code-size, or also speed, in case it matters? (And speed on 8086 only, or speed on 386 for example, or also modern x86 like Skylake?)Alonaalone
only 8086 instructions till nowRendering
Till now? So are you interested in 186 or 386 instructions? I assume you've looked at How to convert a binary integer number to a hex string?, specifically the link to Little Endian Number to String Conversion for a trick abusing DAS on codegolf.SE?Alonaalone
I have already looked at it, that is what helped me write this code, but now I was wondering can I reduce few instructions to make it faster on 8086?Rendering
Ok, faster on 8086 specifically, which usually means smaller machine-code size, but you're not interested in making it slower if that would save more size. But using more instructions if it saves code-size could be a win. (Especially on 8088, half the bus width). You should edit your question to say exactly what parameters you're optimizing for; it makes a difference. Probably Little Endian Number to String Conversion is useful, then, if DAS and SBB aren't too slow on 8086.Alonaalone
Your last edit made no sense so I rolled it back. (Removed most of the printing code, leaving some useless instructions, with the text still claiming the program printed output and worked). The binary->gray part is so simple that there's no room for optimization, assuming you start from the point where you have the value in a register.Alonaalone
Oh great, glad I spent multiple hours optimizing the hex-print part you included but didn't care about. And you even removed the part that defines a as a byte in memory, so even that part of my answer is orphaned now. My answer does already address the x ^ (x>>1) part, though, as much as is possible; there isn't any more to say now that you removed the interesting part.Alonaalone
If you want to exhaustively look for other instruction sequences that could give the same result, use a superoptimizer. en.wikipedia.org/wiki/Superoptimization. I don't expect there to be anything better than what you're doing, but unlike looking at compiler output (godbolt.org/z/crWjhrzhn) you know it considered all possibilities.Alonaalone
A
2

(This answer was written based on the first version of the question, where there was some not-already-optimal hex-print code, and it was complete source for a .exe program. The updates to the question have removed the only parts that had room to be optimized except for ILP which is irrelevant on 8086, so I don't intend to remove those parts of the answer.)

Code size optimizations (which correlate with speed on 8086 and especially 8088, see this retrocomputing answer):

  • bin2Gray part: no change, unless you count reloading from mem or making a constant. Just re-ordering instructions for ILP on Pentium and later. Or maybe a table for xlat.
  • byte->hex digits: 21 bytes, down from 32 (and fewer bytes of code fetch)
  • exit: 1 byte (ret) down from 4 (mov ah / int). Works for at least .com executables, which are also smaller.

I probably should have counted code size in total bytes that need to get fetched (i.e. bytes of instructions executed), rather than static code size, although that's also useful to optimize for.

Dropping the looping from the 21-byte bin2hex part would cost a couple bytes of static code size, but reduce dynamic byte count by about 5, IIRC. And avoid any prefetch buffer discarding on taken branch, except for the int 10h of course.


int 20h can also exit (without needing any AH setup), but only from a .com executable. The interesting part to me is computing the ASCII digits in the desired register with compact code, but if you want a tiny overall program, a .com is the way to go. That also avoids setup of DS. (Although you wouldn't need to set up DS if you make a and EQU or = constant.)

Not attempted: taking advantage of initial register values, which apparently are reliable across some DOS versions. If you're pretending like you're writing a block that might be useful as part of a larger program, that's not viable.


Your program basically has two separate parts; calculating a Gray code, and bin->hex for a 1-byte value in a register. Extracting the nibbles separately doesn't usefully optimize backwards into the Gray-code calculation, so I think we can just keep those fully separated.


There are multiple different ways to make a Gray code (only one bit flips between consecutive values). AFAIK, x ^ (x>>1) is the cheapest to compute from binary, but it's not impossible there could be something you could do with only 2 instructions, given input in a register.

Also related: Gray code algorithm (32 bit or less) for gray->binary points out that the standard x ^ (x>>1) is multiplication in GF(2k). So on very recent CPUs with Galois-Field instructions, you can do 16 bytes at a time with gf2p8affineqb, I think. (gf2p8mulb uses a fixed polynomial which I think isn't the one we want for this.)


8088 performance is mostly about memory access (including code fetch)

https://www2.math.uni-wuppertal.de/~fpf/Uebungen/GdR-SS02/opcode_i.html shows instruction timings, but those timings are only execution, not code-fetch. 8088 has a 4-byte prefetch buffer (and only an 8-bit data bus), 6-byte on 8086 with its 16-bit bus. Supercat suggested in an answer there:

On the original 8088 processor, the easiest way to estimate execution speed is generally to ignore cycle counts and instead count memory accesses, including instruction fetches, and multiply by four.

I think it's about the same for 8086, except each access can be a whole 2-byte word. So straight-line code (no branches) can get fetched 2 bytes at a time.

For simplicity, I just annotated asm source with instruction size and cycle counts from the table without trying to model how the prefetch buffer would behave.

xlat (like al = ds:[bx+al]) is only 1 byte, and can be worth using if you don't mind having a 256-byte table. It takes 11 bytes to execute, but that includes the data access it makes. Not counting code-fetch, mov bl,al / shr al,1 / xor al,bl is 2+2+3 cycles, but 3 words of code size will cost 12 cycles to fetch. xlat takes almost that long, but when it finishes the prefetch buffer will have had some time to fetch later instructions, so I think it's even more of a win.

Still, it does require that table to come from somewhere, either from disk when your executable was loaded, or you have to pre-compute it. And you need to get a pointer into BX, so it may only be a win if you can do this in a loop.

But if you're using a table, you can combine both parts of the problem and look up both ASCII hex digit characters for the gray-code for a given binary, e.g. with mov dx, [bx + si] with table pointer in SI, binary byte in BL, and BH=0. (DX sets you up for outputting DL with a DOS call.) This of course would require your table to be 256 words (512 bytes). Having a tiny executable may be more valuable than saving a few cycles here; the actual I/O to the screen or a file is probably slow enough for it to not matter much. If you're doing this for multiple bytes, though, copying pairs of ASCII bytes into a buffer can be good.

There's one optimization that will help on more modern CPUs (starting with Pentium) that can run more than 1 instruction in parallel: copy the register, then shift the original so that can happen in the same cycle as the copy.

; optimized for Instruction-level Parallelism
;; input: AL    output: AL = bin_to_gray(AL)
;; clobbers: DL
   mov  dl, al         ; 2B  2 cycles (not counting code-fetch bottlenecks)
   shr  al, 1          ; 2B  2c
   xor  al, dl         ; 2B  3c

(For more about modern CPUs, see https://agner.org/optimize/. And also Can x86's MOV really be "free"? Why can't I reproduce this at all? - mov-elimination doesn't work on byte or word registers because that's merging into the low part of EDX. So even on CPUs with mov-elimination in general, it can't apply here so this optimization saves latency.)

I'm pretty sure there's no further room for improvement in bin -> gray. Even modern x86 doesn't have a copy-and-right-shift (except with the count in another register, BMI2 shrx, or for SIMD registers with AVX, but only for word/dword/qword element sizes). Also no right-shift-and-xor, so there's no avoiding a mov, and the shr and xor are obviously also necessary. XOR is add-without-carry, but I don't think that helps. Unless you had carryless multiply (pclmulqdq) and a multiplier constant to get two copies of the input at the right offsets from each other into the high half of a multiply result, you're going to need to do those operations separately. Or with Galois-Field New Instructions (GFNI): What are the AVX-512 Galois-field-related instructions for?

Still, if you want to exhaustively check, https://en.wikipedia.org/wiki/Superoptimization - ask a superoptimizer to look for sequences that produce the same AL result as the mov/shr/xor sequence.


In real use-cases, you'd normally want code that took data in a register, because that's how you should pass data to functions. After mov al, a, that's what your code is doing.

But if it was a global in memory, you could save a byte of code size by loading it twice instead of copying a register with mov, at the expense of speed. Or even better, make it an assemble-time constant. (Although if you do that, the next step is mov al, a ^ (a>>1) to do the calculation at assemble time.)

; a equ 0ACh         ; makes the following instructions 2 bytes each
;;; otherwise, with a being static storage, loading from memory twice sucks
    mov  al, a
    shr  al, 1    ; 2B, 2 cycles
    xor  al, a    ; reg,imm: 2B, 4 cycles on 8088.    reg,mem: 3B, 13+6 cycles

Byte -> 2 ASCII hex digits

This is the more interesting part.

Looping for only 2 iterations is sometimes not worth it, especially when you can save some work if you do separate things to each half. (e.g. low nibble is x & 0xf, high is x >> 4. Using rol/mov/and is not optimal.)

Tricks:

  • Prefer insn al, imm - x86 has short-form special cases for immediate operands with AL. (Also AX,imm16).

  • Wanting to do stuff in AL means it's more efficient to print with BIOS int 10h / AH=0Eh teletype output which takes its input in AL, and doesn't destroy any other registers. I think BIOS output will ignore DOS I/O redirection like foo > outfile.txt and always print to the screen.

  • There's an evil hack that abuses DAS to turn a 0..15 integer into an ASCII hex digit '0'..'9' or 'A'..'F' without branching. On 8086 (unlike modern x86) DAS is just as fast as typical integer instruction. See this codegolf.SE answer for a breakdown on exactly why it works; it's highly non-obvious, but it avoid branching so it's actually a big speedup on 8086.

  • BIOS / DOS calls generally don't modify AH, so setting it can be done outside the loop.

  • Then for code-size, instead of just unrolling, use the cl=4 as a loop counter to loop back and re-run some of the earlier code once (not including the shift). sub cl, 2 / jnz would work, but using the parity flag is a way to use dec cx (1B) / jpe to jump backwards once, then fall through the next time.

  • DOS programs (or at least .com programs) have SP pointing at the address of some code that exits cleanly. So you can exit via ret.

(I didn't look at improving your loop while keeping the overall strategy. Using AL for as many instructions as possible is prob. worth it, but running rol twice instead of shifting once costs a lot of cycles on 8086: 8 + 4*n for shift-by-CL.)

;; input: byte in AL.   output: print 2 ASCII hex digits with BIOS int 10h
;; clobbers: CX, DX
hexprint_byte:
     mov   ah, 0Eh                       ; BIOS teletype call #
;     push  ax            ; 1B 15c
     mov   dx, ax        ; 2B 2c         ; save number, and AH=call number
     mov   cl, 4         ; 2B 4c
     shr   al, cl        ; 2B 8+4*4 cycles   isolate the high nibble
.loop:
     cmp   al, 10        ; 2B 4c  set CF according to digit <= 9
     sbb   al, 69h       ; 2B 4c  read CF, set CF and conditionally set AF
     das                 ; 1B 4c   magic, which happens to work
     int   10h           ; 2B        BIOS teletype output (AL), no return value
;     pop   ax            ; 1B 12c    ; would do one extra pop if you used this instead of mov/xchg, so you'd need jmp ax instead of ret.  But AND destroys AX
     xchg  ax, dx        ; 1B 3c     ; retrieve the original again (with AH=0Eh call number)
     and   al, 0Fh       ; 2B 4c     ; isolate the low nibble this time

     dec   cx            ; 1B 3c     ; PF is set from the low byte only, CH garbage isn't a problem.  
     jpe   .loop         ; 2B 4c-not-taken, 16c-taken
               ;  4-1 = 3, (0b11) which has even parity
               ; so JPE is taken the first time, falls through the 2nd 

;; size = 21 bytes

Then you can exit the program with ret, or with int 20h.

This is NASM syntax; if your assembler doesn't like .loop then change it to something else. (NASM doesn't allow 2: as a local label so I had to pick different names anyway.) I tested this single-stepping on Linux to make sure the loop branch was taken once, and that I got the right values in AH/AL when int 10h was reached. (I replaced it with a NOP since I actually built this into a 32-bit static executable so I could easily single-step it in GDB, without messing around with an obsolete 16-bit dev setup. Byte counts are from assembling as 16-bit, of course.)

For speed, it would cost only a few more bytes to just duplicate the cmp/sbb/das/int 10h, saving the dec/jpe. (Like 7 bytes instead of 3 for the dec/jpe). The xchg / AND after the first print are necessary either way.

Taken branches cost 16 cycles, and that would avoid a 2nd redundant/useless execution of xchg / and (3 bytes / 7 cycles), and of the loop overhead.


You asked for small (and fast on 8086), so that's what I did. This sacrifices everything else, very much including readability, to save bytes. But that's the fun of code-golfing in assembly!

Unfortunately it's also definitely not simpler, like you asked for in the title. Simpler might use a lookup table, perhaps with xlatb. That might also be faster on 8086, especially if you want to avoid the DAS hack.


Another trick that might help for code size (but very bad for performance) is aam 16 to set AH= quotient = leading digit, AL = remainder = trailing digit (low). (Note that's opposite of div bl) Displaying Time in Assembly shows an example of using it with BIOS int 10h output for a 2-digit decimal number. (Normally AAM is used with an immediate 10, and apparently NEC V20 CPUs ignore the immediate and always divide by 10. Intel CPUs just do immediate division of AL). On 8088/8086, AAM takes 83 cycles, similar to div, which is basically what it does. Using HW division with a power of 2 is generally horrible.

A version using AAM 16 came in at 23 bytes, not using any looping (I didn't have any constants in registers to exploit, so mov cx, 1 / loop would be 5 bytes, while cmp/sbb/das/int 10h is 7 total)

Slower and larger than the looping version, but "simpler"

     aam   16       ; 83 cycles, 2 bytes   AH= quotient = leading digit   AL = remainder = trailing digit (low)
  ; normally never use div or aam by a power of 2, only for code-size over speed.
     cmp   al, 10        ; 2B 4c  set CF according to digit <= 9
     sbb   al, 69h       ; 2B 4c  read CF, set CF and conditionally set AF
     das                 ; 1B 4c   magic, which happens to work
     xchg  dx, ax        ; 1B 3c    stash low digit in DL

     mov   al, dh        ; 2B 2c    get leading digit
     cmp   al, 10        ; 2B 4c
     sbb   al, 69h       ; 2B 4c    most significant (high) nibble as ASCII hex
     das                 ; 1B 4c
     
     mov   ah, 0Eh       ; 2B 3c BIOS teletype output (of AL), advancing cursor
     int   10h           ; 2B ?
     mov   al, dl        ; 2B 2c  ; get low digit back from DL   xchg  ax, dx breaks AH callnum
     int   10h           ; 2B
; size=23B

I wonder if I could use int 21h / AH=2 with input from DL for one of the outputs? That would require changing AH, but could perhaps be done for the 2nd output. Unfortunately that DOS call steps on AL, setting it to the character printed. (e.g. from using this int 10h call).

Related:

Alonaalone answered 5/5, 2021 at 15:7 Comment(12)
Using int 20h to terminate a DOS process may be equivalent to calling int 21h with ax = 4C00h, but the interrupt list states that cs should equal the PSP segment: ctyme.com/intr/rb-2471.htm (.COM files generally start with cs == PSP but can do far branches to change the cs. MZ .EXE files can execute code with cs == PSP but often do not meet this requirement.)Foreplay
The free MS-DOS 2.00 sources do show that int 20h is the same as int 21h service 00h. And both end up in a handler that states cs should point to the PSP. I haven't found the part where function 4Ch is handled yet.Foreplay
@ecm: Thanks. I didn't end up posting source for a complete program. If I had, I was going to make it a .com to simplify away the setup of ds.Alonaalone
Found it, the $EXIT procedure. On line 123 this sets the stack frame's user cs to equal the current process. So this appears to be needed by the common code that this branches to.Foreplay
what about xlat?Grigri
@AkiSuihkonen: That's better for code size, but according to www2.math.uni-wuppertal.de/~fpf/Uebungen/GdR-SS02/opcode_i.html it takes 11 cycles on 8088. That's more than 2 + 2 + 3 for mov reg,reg/shr/xor reg,reg. Like I said, I didn't try to fully model code prefetch, and that might depend on surrounding code. Still worth mentioning I guess. Especially if tuning for 8086, not 8088, (so code-fetch happens 2 bytes at a time) 3 fast register instructions might be better.Alonaalone
"Even modern x86 doesn't have a copy-and-right-shift". Well, there's shrx. But it's limited to 32- or 64-bit operands so maybe not that helpful here. For actual modern processors I presume one should look toward SIMD anyway.Ales
@NateEldredge: It needs the count in a register, so that only helps if you were doing this in a loop. In which case yeah, you'd want SIMD, even though that means emulating a byte right-shift with a wider shift and masking away the bit that crossed the word boundary.Alonaalone
@PeterCordes: Huh, x86 has no SIMD packed byte shift? That's interesting. ARM64 supports it: ushr v1.16b, v0.16b, #1.Ales
@NateEldredge: That's right, just psrlw/d/q :( Of course left shift by 1 can be done with paddb xmm0,xmm0, but not even AVX-512 has a byte-element right shift. (despite adding variable-count word shifts, and fixed/variable rotates). If it wasn't randomly missing some useful element-size for some operation, it just wouldn't be x86 anymore >.<Alonaalone
Too bad the question no longer matches the answer! Nevertheless +1 for the many interesting tips and tricks.Miner
@AkiSuihkonen: I did some more research, and that table of instruction timings didn't include code fetch bottlenecks at all. So xlat is probably a minor win, especially getting later instructions prefetched. I rewrote that section. (And added some stuff about Galois-Field new instructions; I think they're usable here on CPUs that have them, although I don't grok the math well enough to write an example)Alonaalone

© 2022 - 2024 — McMap. All rights reserved.