Fast signed 16-bit divide by 7 for 6502

Asked 18/7, 2018 at 21:30 Answered 28/12, 2018 at 17:28

Solved assembly division micro-optimization integer-division 6502

I am working on an assembly language program for a 6502 cpu, and am finding that I need a fast-as-possible divide-by-seven routine, in particular one which could take a 16-bit dividend.

I am familiar with the routines found here, but generalizing the divide-by-seven routine found there is quite complicated, and a cursory examination of the general algorithm (using integer division)

x/7 ~= (x + x/8 + x/64 ... )/8

indicates that to handle a 16-bit range, it would likely take over 100 cycles to complete because of the 6502's single accumulator register and the relative slowness of individual memory bit shifts on the 6502.

I thought that a lookup table might help, but on the 6502, I am of course restricted to lookup tables that are 256 bytes or fewer. To that end one could assume the existence of two 256-byte lookup tables, xdiv7 and xmod7, which, when using an unsigned one-byte value as an index into the table, can quickly obtain the result of a byte divided by 7 or modulo 7, respectively. I am not sure how I could utilize these to find the values for the full 16-bit range, however.

In parallel, I also need a modulo 7 algorithm, although ideally whatever solution that can be worked out with the division will produce a mod7 result as well. If additional precomputed tables are needed, I am amenable to adding those, as long as the total memory requirements for all tables does not exceed about 3k.

Although I ultimately need a signed division algorithm, an unsigned one would suffice, because I can generalize that to a signed routine as needed.

Any assistance would be greatly appreciated.

Tanga answered 18/7, 2018 at 21:30 Comment(7)

I'm unfamiliar with 6502. But at a quick glance, it looks like there's no multiplier either? If you can efficiently do a 16-bit x 16-bit -> 32-bit multiply, then a divide-by-7 is both easy and fast. – Accession 18/7, 2018 at 22:46

Other than that, the "add groups of 3 bits" algorithm that you've already mentioned is probably the fastest non-LUT approach. Not having an efficient shifter is going to be a huge handicap. – Accession 18/7, 2018 at 22:53

100 cycles doesn't actually sound that bad for a divide. An 8-bit divide was over 100 cycles on an 8086 CPU. – Janice 19/7, 2018 at 0:36

Yes, although I'm holding out hope that I can use table lookup and some addition. 6502 bit shifting has a cost per bit shifted, and itself is nearly twice as expensive as adding unless the data is already in the accumulator. – Tanga 19/7, 2018 at 0:41

Did you try multiplying by 022222 like this. I think you can also use a LUT for the shift 3 in the above approach because you don't have a barrel shifter. This way you can also do a modulo 7 in parallel because it's just the modulo 7 of the sum of digits in base 8 – Sydelle 19/7, 2018 at 1:43

You can do it with lookup tables, but it requires 5 tables of 256 bytes, and 2 tables of 13 bytes. The resulting assembly consists of 7 table lookups and 4 add instructions. If you're interested, I'll post C code that generates the lookup tables, and demonstrates the division by 7. – Extrauterine 19/7, 2018 at 5:26

Please do, and feel free to post it as an answer. Thank you. – Tanga 19/7, 2018 at 5:27

Note: As @Damien_The_Unbeliever pointed out in the comments, the upperHigh and lowerLow tables are identical. So they can be combined into a single table. However, that optimization will make the code harder to read, and the explanation harder to write, so combining the tables is left as an exercise for the reader.

The code below shows how to generate the quotient and remainder when dividing a 16-bit unsigned value by 7. The simplest way to explain the code (IMO) is with an example, so let's consider dividing 0xa732 by 7. The expected result is:

quotient = 0x17e2
remainder = 4

We start by considering the input as two 8-bit values, the upper byte and the lower byte. The upper byte is 0xa7 and the lower byte is 0x32.

We compute a quotient and remainder from the upper byte:

0xa700 / 7 = 0x17db
0xa700 % 7 = 3

So we need three tables:

upperHigh stores the high byte of the quotient: upperHigh[0xa7] = 0x17
upperLow stores the low byte of the quotient: upperLow[0xa7] = 0xdb
upperRem stores the remainder: upperRem[0xa7] = 3

And we compute the quotient and remainder from the lower byte:

0x32 / 7 = 0x07
0x32 % 7 = 1

So we need two tables:

lowerLow stores the low byte of the quotient: lowerLow[0x32] = 0x07
lowerRem stores the remainder: lowerRem[0x32] = 1

Now we need to assemble the final answer. The remainder is the sum of two remainders. Since each remainder is in the range [0,6] the sum is in the range [0,12]. So we can use two 13 byte lookups to convert the sum into a final remainder and a carry.

The low byte of the quotient is the sum of that carry and the values from the lowerLow and upperLow tables. Note that the sum may generate a carry into the high byte.

The high byte of the quotient is the sum of that carry and the value from the upperHigh table.

So, to complete the example:

remainder = 1 + 3 = 4              // simple add (no carry in)
lowResult = 0x07 + 0xdb = 0xe2     // add with carry from remainder
highResult = 0x17                  // add with carry from lowResult

The assembly code to implement this consists of 7 table lookups, an add-without-carry instruction and two add-with-carry instructions.

#include <stdio.h>
#include <stdint.h>

uint8_t upperHigh[256];  // index:(upper 8 bits of the number)  value:(high 8 bits of the quotient)
uint8_t upperLow[256];   // index:(upper 8 bits of the number)  value:(low  8 bits of the quotient)
uint8_t upperRem[256];   // index:(upper 8 bits of the number)  value:(remainder when dividing the upper bits by 7)
uint8_t lowerLow[256];   // index:(lower 8 bits of the number)  value:(low  8 bits of the quotient)
uint8_t lowerRem[256];   // index:(lower 8 bits of the number)  value:(remainder when dividing the lower bits by 7)
uint8_t carryRem[13]    = { 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1 };
uint8_t combinedRem[13] = { 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5 };

void populateLookupTables(void)
{
    for (uint16_t i = 0; i < 256; i++)
    {
        uint16_t upper = i << 8;
        upperHigh[i] = (upper / 7) >> 8;
        upperLow[i] = (upper / 7) & 0xff;
        upperRem[i] = upper % 7;

        uint16_t lower = i;
        lowerLow[i] = lower / 7;
        lowerRem[i] = lower % 7;
    }
}

void divideBy7(uint8_t upperValue, uint8_t lowerValue, uint8_t *highResult, uint8_t *lowResult, uint8_t *remainder)
{
    uint8_t temp = upperRem[upperValue] + lowerRem[lowerValue];
    *remainder = combinedRem[temp];
    *lowResult = upperLow[upperValue] + lowerLow[lowerValue] + carryRem[temp];
    uint8_t carry = (upperLow[upperValue] + lowerLow[lowerValue] + carryRem[temp]) >> 8;  // Note this is just the carry flag from the 'lowResult' calcaluation
    *highResult = upperHigh[upperValue] + carry;
}

int main(void)
{
    populateLookupTables();

    uint16_t n = 0;
    while (1)
    {
        uint8_t upper = n >> 8;
        uint8_t lower = n & 0xff;

        uint16_t quotient1  = n / 7;
        uint16_t remainder1 = n % 7;

        uint8_t high, low, rem;
        divideBy7(upper, lower, &high, &low, &rem);
        uint16_t quotient2 = (high << 8) | low;
        uint16_t remainder2 = rem;

        printf("n=%u q1=%u r1=%u q2=%u r2=%u", n, quotient1, remainder1, quotient2, remainder2);
        if (quotient1 != quotient2 || remainder1 != remainder2)
            printf(" **** failed ****");
        printf("\n");

        n++;
        if (n == 0)
            break;
    }
}

Extrauterine answered 19/7, 2018 at 5:36 Comment(4)

May be a silly question, but are the contents of upperHigh and lowerLow different? – Elis 19/7, 2018 at 6:27

@Elis Oops, they are indeed the same. I'll update the answer, thanks! – Extrauterine 19/7, 2018 at 6:33

If they are the same, I am confused how it adds any clarity to leave them as separate... the combined table could simply have a name that reflects its two different usages, rather than waste an additional 256 bytes that offers no performance benefit and negligible readability benefit that cannot be equivalently obtained simply by using descriptive naming. – Tanga 19/7, 2018 at 14:15

@markt1964: In asm, just put two labels in the same place, don't actually duplicate the memory. In C, a #define or static const uint8_t *upperHigh = lowerLow; – Robinrobina 20/7, 2018 at 1:28

At Unsigned Integer Division Routines for 8-bit division by 7:

;Divide by 7 (From December '84 Apple Assembly Line)
;15 bytes, 27 cycles
  sta  temp
  lsr
  lsr
  lsr
  adc  temp
  ror
  lsr
  lsr
  adc  temp
  ror
  lsr
  lsr

The estimate of about 100 cycles with shifts was pretty accurate: 104 cycles to the last ror, 106 cycles total not including rts, 112 cycles for the whole function.
NOTE: after assembly for C64 and using VICE emulator for C64 I found the algorithm fails, for example 65535 gives 9343 and the correct answer is 9362.

   ; for 16 bit division  by 7
   ; input:
  ;   register A is low byte
  ;   register X is high byte
  ; output 
  ;   register A is low byte
  ;   register X is high byte
  ;
  ; memory on page zero
  ; temp     is on page zero, 2 bytes
  ; aHigh    is on page zero, 1 byte
  --
  sta temp
  stx temp+1
  stx aHigh
  --
  lsr aHigh
  ror a
  lsr aHigh
  ror a
  lsr aHigh
  ror a
  ---
  adc temp
  tax
  lda aHigh
  adc temp+1
  sta aHigh
  txa
  --
  ror aHigh
  ror a
  lsr aHigh
  ror a
  lsr aHigh
  ror a
  --
  adc temp
  tax
  lda aHigh
  adc temp+1
  sta aHigh
  txa
  --
  ror aHigh
  ror a
  lsr aHigh
  ror a
  lsr aHigh
  ror a     -- 104 cycles
  ;-------
  ldx aHigh  ; -- 106
  rts        ; -- 112 cycles

Melanism answered 19/7, 2018 at 22:14 Comment(8)

Your code calculates ((x / 8 + x) / 8 + x) / 8 where the OP's formula reads (x + x / 8 + x / 64) / 8. Which one is correct for dividing by 7? – Queer 22/7, 2018 at 19:13

I get it that you generalized the algorithm for dividing a byte by 7, but shouldn't this entail that apart from x/8 and x/64 also x/512, x/4096 and x/32768 are required? This would bring the cycles up to 250! – Queer 22/7, 2018 at 19:19

I did visit that link before, and indeed yours is the correct translation of the code found there. Still not what the OP wrote. Perhaps you need to make this clear in your answer. – Queer 22/7, 2018 at 19:23

Sep, I don't understand how the original 8-bit division works. I only extend for 16-bits values. – Melanism 22/7, 2018 at 19:27

Alvalongo, use @SepRoland to notify someone when you reply to their comment. (You get notified because we're commenting under your answer.) – Robinrobina 22/7, 2018 at 19:29

Basically what is needed is repeating the second step (the one that does x/64 plus the addition) 4 times. – Queer 22/7, 2018 at 19:31

Bad news, after compile for C64 and using VICE emulator for C64 I found the algorithm fails, for 65535 gives 9343 and the correct answer is 9362. – Melanism 22/7, 2018 at 19:31

Not surprising if you just did the same steps as 8-bit division. You need more steps for wider division (to get 16 bits of precision), as well as each step being wider. – Robinrobina 23/7, 2018 at 16:40

A different way to do this would be to convert the division into a multiplication.

To work out the multiplication factor, we are interested in taking the reciprocal. Essentially we do:

d = n*(1/7)

To make things more accurate we multiply up by by a convenient power of 2. 2^16 works well:

d = floor(n*floor(65536/7)/65536)

The multiplication factor is: floor(65536/7) which is 9362. The size of the result will be:

ceiling(log2(65535*9362)) = 30 bits (4 bytes rounded up)

We can then discard the lower two bytes to divide by 65536, or simply just use the upper 2 bytes for the end result.

To work out the actual rotates and adds we need, we examine the binary representation of the factor 9362:

10010010010010

Note the repetition of the bit pattern. So an efficient scheme would be to calculate:

((n*9*256/4 + n*9)*8 + n)*2 = 9362*n

Calculating n*9 only needs ceiling(log2(65535*9)) = 20 bits (3 bytes).

In pseudo assembly this is:

LDA number          ; lo byte
STA multiply_nine
LDA number+1        ; high byte
STA multiply_nine+1
LDA #0              
STA multiply_nine+2 ; 3 byte result

ASL multiply_nine   ; multiply by 2
ROL multiply_nine+1
ROL mulitply_nine+2
ASL multiply_nine   ; multiply by 2 (4)
ROL multiply_nine+1
ROL mulitply_nine+2
ASL multiply_nine   ; multiply by 2 (8)
ROL multiply_nine+1
ROL mulitply_nine+2

CLC                 ; not really needed as result is only 20 bits, carry always zero
LDA multiply_nine
ADC number
STA multiply_nine
LDA multiply_nine+1
ADC number+1
STA multiply_nine+1
LDA multiply_nine+2
ADC #0
STA multiply_nine+2 ; n*9

The rest of the exercise I leave to the OP. Note it's not necessary to multiply by 256, as this is just a whole byte shift up.

Conias answered 28/12, 2018 at 17:28 Comment(2)

This is similar to the same fixed-point multiplicative inverse that compilers use on machines with fast multiply instructions, for exact division by a constant over all possible dividends, I think. But for that you usually have to right-shift the high half of the full multiply result. Why does GCC use multiplication by a strange number in implementing integer division?. Yeah, and for tricky divisors like 7, there are some more steps on x86: godbolt.org/z/e5ut7S. Is this fixed-point version exact for all 16-bit dividends? – Robinrobina 30/12, 2018 at 6:44

A quick check in Excel shows that it isn't exact. But at worst will only be 1 out, this may or may not be ok. Accuracy can always be increased however at the expense of code size. I expect that using say 2^24 scaling factor, would eliminate the discrepancy. Code size could be decreased by using loops for repetitive shifts, but this will take more cycles. – Conias 30/12, 2018 at 11:7

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags