How does one do integer (signed or unsigned) division on ARM?

Asked 1/12, 2011 at 20:45 Answered 14/6, 2015 at 17:38

Solved assembly arm integer-division instruction-set cortex-a8

I'm working on Cortex-A8 and Cortex-A9 in particular. I know that some architectures don't come with integer division, but what is the best way to do it other than convert to float, divide, convert to integer? Or is that indeed the best solution?

Cheers! = )

Roos answered 1/12, 2011 at 20:45 Comment(7)

Of course the compiler will support integer division in software mode even if not present in the hardware. I doubt those high spec chips do NOT have integer division. I think the ATMega (like Arduino) lacks it. – Quadrivial 1/12, 2011 at 20:48

An Assembly instruction for integer division on ARM does not exist. – Roos 1/12, 2011 at 20:49

Either convert to float or do a manual divide with an unrolled 3 opcode pattern. – Pikestaff 1/12, 2011 at 20:52

@Phonon: Does it not accept SDIV or UDIV? Cortex-A8 is ARMv7, but from infocenter.arm.com/help/topic/com.arm.doc.qrc0001m/… it looks only some processors are supported. – Quadrivial 1/12, 2011 at 21:2

See stackoverflow.com/questions/938038/… – Starlight 1/12, 2011 at 21:16

ARMv7-R, ARMv7VE, otherwise optional in ARMv7-A is what is listed for SDIV and UDIV. You have to look at the options purchased for that core and/or look at the TRM for the specific core you are using. Or just encode the instruction, execute it and see if you get an undefined instruction fault... – Benumb 1/12, 2011 at 21:35

ARMv7-A does not support SDIV or UDIV (unless it's a vendor-designed core extending it). ARMv7-R does. This is a peculiar oversight by ARM. – Extremism 10/12, 2011 at 4:1

The compiler normally includes a divide in its library, gcclib for example I have extracted them from gcc and use them directly:

https://github.com/dwelch67/stm32vld/ then stm32f4d/adventure/gcclib

going to float and back is probably not the best solution. you can try it and see how fast it is...This is a multiply but could as easily make it a divide:

https://github.com/dwelch67/stm32vld/ then stm32f4d/float01/vectors.s

I didnt time it though to see how fast/slow. Understood I am using a cortex-m above and you are talking about a cortex-a, different ends of the spectrum, similar float instructions and the gcc lib stuff is similar, for the cortex-m I have to build for thumb but you can just as easily build for arm. Actually with gcc it should all just work automagically you should not need to do it the way I did it. Other compilers as well you should not need to do it the way I did it in the adventure game above.

Benumb answered 1/12, 2011 at 21:16 Comment(0)

Division by a constant value is done quickly by doing a 64bit-multiply and shift-right, for example, like this:

LDR     R3, =0xA151C331
UMULL   R3, R2, R1, R3
MOV     R0, R2,LSR#10

here R1 is divided by 1625. The calculation is done like this: 64bitreg(R2:R3) = R1*0xA151C331, then the result is the upper 32bit right shifted by 10:

R1*0xA151C331/2^(32+10) = R1*0.00061538461545751488 = R1/1624.99999980

You can calculate your own constants from this formula:

x / N ==  (x*A)/2^(32+n)   -->       A = 2^(32+n)/N

select the largest n, for which A < 2^32

Mulligan answered 16/10, 2012 at 7:50 Comment(1)

There are rounding errors here. For unsigned 32-bit division by N = 7, we have n = 2 and A = 2454267026.28... If we round down the value of A we pick, then it gives a result too small for "4294967292 / 7". If we round it up, then it gives a result too large for "4294967291 / 7". This can only occur if the fractional part of the exact value of A is smaller than 0.5, so it works fine for about half the values of N (like 3, 5 or 1625). – Khiva 23/11, 2014 at 17:59

Some copy-pasta from elsewhere for an integer divide: Basically, 3 instructions per bit. From this website, though I've seen it many other places as well. This site also has a nice version which may be faster in general.


@ Entry  r0: numerator (lo) must be signed positive
@        r2: deniminator (den) must be non-zero and signed negative
idiv:
        lo .req r0; hi .req r1; den .req r2
        mov hi, #0 @ hi = 0
        adds lo, lo, lo
        .rept 32 @ repeat 32 times
          adcs hi, den, hi, lsl #1
          subcc hi, hi, den
          adcs lo, lo, lo
        .endr
        mov pc, lr @ return
@ Exit   r0: quotient (lo)
@        r1: remainder (hi)

Pikestaff answered 1/12, 2011 at 22:45 Comment(2)

This is 3 instructions per bit but not 3 cycles per bit. All of the instructions in each step are immediately dependent on the flag setting of the previous, which means a result delay of 3-4 cycles depending on the core. This will likely take 9-12 cycles per step, for a total of ~360 cycles. – Extremism 10/12, 2011 at 4:3

Sounds about right. Inverse multiply fixed point is always a better option if you can swing it. – Pikestaff 10/12, 2011 at 17:2

The compiler normally includes a divide in its library, gcclib for example I have extracted them from gcc and use them directly:

https://github.com/dwelch67/stm32vld/ then stm32f4d/adventure/gcclib

going to float and back is probably not the best solution. you can try it and see how fast it is...This is a multiply but could as easily make it a divide:

https://github.com/dwelch67/stm32vld/ then stm32f4d/float01/vectors.s

Benumb answered 1/12, 2011 at 21:16 Comment(0)

I wrote my own routine to perform an unsigned division as I could not find an unsigned version on the web. I needed to divide a 64 bit value with a 32 bit value to get a 32 bit result.

The inner loop is not as efficient as the signed solution provided above, but this does support unsigned arithmetic. This routine performs a 32 bit division if the high part of the numerator (hi) is smaller than the denominator (den), otherwise a full 64 bit division is performed (hi:lo/den). The result is in lo.

  cmp     hi, den                   // if hi < den do 32 bits, else 64 bits
  bpl     do64bits
  REPT    32
    adds    lo, lo, lo              // shift numerator through carry
    adcs    hi, hi, hi
    subscc  work, hi, den           // if carry not set, compare        
    subcs   hi, hi, den             // if carry set, subtract
    addcs   lo, lo, #1              // if carry set, and 1 to quotient
  ENDR

  mov     r0, lo                    // move result into R0
  mov     pc, lr                    // return

do64bits:
  mov     top, #0
  REPT    64
    adds    lo, lo, lo              // shift numerator through carry
    adcs    hi, hi, hi
    adcs    top, top, top
    subscc  work, top, den          // if carry not set, compare        
    subcs   top, top, den           // if carry set, subtract
    addcs   lo, lo, #1              // if carry set, and 1 to quotient
  ENDR
  mov     r0, lo                    // move result into R0
  mov     pc, lr                    // return

Extra checking for boundary conditions and power of 2 can be added. Full details can be found at http://www.idwiz.co.za/Tips%20and%20Tricks/Divide.htm

Horseweed answered 24/8, 2012 at 9:48 Comment(0)

I wrote the following functions for the ARM GNU assembler. If you don't have a CPU with udiv/sdiv machine support, just cut out the first few lines up to the "0:" label in either function.

.arm
.cpu    cortex-a7
.syntax unified

.type   udiv,%function
.globl  udiv
udiv:   tst     r1,r1
        bne     0f
        udiv    r3,r0,r2
        mls     r1,r2,r3,r0
        mov     r0,r3
        bx      lr
0:      cmp     r1,r2
        movhs   r1,r2
        bxhs    lr
        mvn     r3,0
1:      adds    r0,r0
        adcs    r1,r1
        cmpcc   r1,r2
        subcs   r1,r2
        orrcs   r0,1
        lsls    r3,1
        bne     1b
        bx      lr
.size   udiv,.-udiv

.type   sdiv,%function
.globl  sdiv
sdiv:   teq     r1,r0,ASR 31
        bne     0f
        sdiv    r3,r0,r2
        mls     r1,r2,r3,r0
        mov     r0,r3
        bx      lr
0:      mov     r3,2
        adds    r0,r0
        and     r3,r3,r1,LSR 30
        adcs    r1,r1
        orr     r3,r3,r2,LSR 31
        movvs   r1,r2
        ldrvc   pc,[pc,r3,LSL 2]
        bx      lr
        .int    1f
        .int    3f
        .int    5f
        .int    11f
1:      cmp     r1,r2
        movge   r1,r2
        bxge    lr
        mvn     r3,1
2:      adds    r0,r0
        adcs    r1,r1
        cmpvc   r1,r2
        subge   r1,r2
        orrge   r0,1
        lsls    r3,1
        bne     2b
        bx      lr
3:      cmn     r1,r2
        movge   r1,r2
        bxge    lr
        mvn     r3,1
4:      adds    r0,r0
        adcs    r1,r1
        cmnvc   r1,r2
        addge   r1,r2
        orrge   r0,1
        lsls    r3,1
        bne     4b
        rsb     r0,0
        bx      lr
5:      cmn     r1,r2
        blt     6f
        tsteq   r0,r0
        bne     7f
6:      mov     r1,r2
        bx      lr
7:      mvn     r3,1
8:      adds    r0,r0
        adcs    r1,r1
        cmnvc   r1,r2
        blt     9f
        tsteq   r0,r3
        bne     10f
9:      add     r1,r2
        orr     r0,1
10:     lsls    r3,1
        bne     8b
        rsb     r0,0
        bx      lr
11:     cmp     r1,r2
        blt     12f
        tsteq   r0,r0
        bne     13f
12:     mov     r1,r2
        bx      lr
13:     mvn     r3,1
14:     adds    r0,r0
        adcs    r1,r1
        cmpvc   r1,r2
        blt     15f
        tsteq   r0,r3
        bne     16f
15:     sub     r1,r2
        orr     r0,1
16:     lsls    r3,1
        bne     14b
        bx      lr

There are two functions, udiv for unsigned integer division and sdiv for signed integer division. They both expect a 64-bit dividend (either signed or unsigned) in r1 (high word) and r0 (low word), and a 32-bit divisor in r2. They return the quotient in r0 and the remainder in r1, thus you can define them in a C header as extern returning a 64-bit integer and mask out the quotient and remainder afterwards. An error (division by 0 or overflow) is indicated by a remainder having an absolute value greater than or equal the absolute value of the divisor. The signed division algorithm uses case distinction by the signs of both dividend and divisor; it does not convert to positive integers first, since that wouldn't detect all overflow conditions properly.

Leotie answered 14/6, 2015 at 17:38 Comment(0)

Recommended topics

Hot tags