I implemented the Sven's code as UINT16, to intensively test it:
uint16_t muldiv16(uint16_t a, uint16_t b, uint16_t c);
int main(int argc, char *argv[]){
uint32_t a;
uint32_t b;
uint32_t c;
uint16_t r1, r2;
// ~167 days, estimated on i7 6700k, single thread.
// Split the 'a' range, to run several instances of this code on multi-cores processor
// ~1s, with an UINT8 implementation
for(a=0; a<=UINT16_MAX; a++){
for(b=0; b<=UINT16_MAX; b++){
for(c=1; c<=UINT16_MAX; c++){
r1 = uint16_t( a*b/c );
r2 = muldiv16(uint16_t(a), uint16_t(b), uint16_t(c));
if( r1 != r2 ){
std::cout << "Err: " << a << " * " << b << " / " << c << ", result: " << r2 << ", exected: " << r1 << std::endl;
return -1;
}
}
}
std::cout << a << std::endl
}
std::cout << "Done." << std::endl;
return 0;
}
Unfortunately, it seems that it is limited to UINT31 for 'b' (0-2147483647).
Here is my correction, that seems to work (not completed the test on UINT16, but run a lot. Completed on UINT8).
uint32_t muldiv32(uint32_t a, uint32_t b, uint32_t c)
{
uint32_t q = 0; // the quotient
uint32_t r = 0; // the remainder
uint32_t qn = b / c;
uint32_t rn = b % c;
uint32_t r_carry;
uint32_t rn_carry;
while(a)
{
if (a & 1)
{
q += qn;
r_carry = (r > UINT32_MAX-rn);
r += rn;
if (r >= c || r_carry)
{
q++;
r -= c;
}
}
a >>= 1;
qn <<= 1;
rn_carry = rn & 0x80000000UL;
rn <<= 1;
if (rn >= c || rn_carry)
{
qn++;
rn -= c;
}
}
return q;
}
Edit: an improvement, that returns the remainder, manages the round, warns about overflow and, of course, manages the full range of UINT32 for a, b and c:
typedef enum{
ROUND_DOWNWARD=0,
ROUND_TONEAREST,
ROUND_UPWARD
}ROUND;
//remainder is always positive for ROUND_DOWN ( a * b = c * q + remainder )
//remainder is always negative for ROUND_UPWARD ( a * b = c * q - remainder )
//remainder is signed for ROUND_CLOSEST ( a * b = c * q + sint32_t(remainder) )
uint32_t muldiv32(uint32_t a, uint32_t b, uint32_t c, uint32_t *remainder, ROUND round, uint8_t *ovf)
{
uint32_t q = 0; // the quotient
uint32_t r = 0; // the remainder
uint32_t qn = b / c;
uint32_t rn = b % c;
uint32_t r_carry;
uint32_t rn_carry;
uint8_t o = 0;
uint8_t rup;
while(a)
{
if (a & 1)
{
o |= (q > UINT32_MAX-qn);
q += qn;
r_carry = (r > UINT32_MAX-rn);
r += rn;
if (r >= c || r_carry)
{
o |= (q == UINT32_MAX);
q++;
r -= c;
}
}
a >>= 1;
qn <<= 1;
rn_carry = rn & 0x80000000;
rn <<= 1;
if (rn >= c || rn_carry)
{
qn++;
rn -= c;
}
}
rup = (round == ROUND_UPWARD && r);
rup |= (round == ROUND_TONEAREST && ((r<<1) >= c || r & 0x80000000));
if(rup)
{ //round
o |= (q == UINT32_MAX);
q++;
r = (round == ROUND_UPWARD) ? c-r : r-c;
}
if(remainder)
*remainder = r;
if(ovf)
*ovf = o;
return q;
}
Maybe there could exist another approach, perhaps even more efficient:
8-bits, 16-bits and 32-bits MCU are able to compute 64-bits calculations (long long int).
Anyone known how the compilers emulate it?
Edit 2:
Here is some interresting timings, on 8-bits MCU:
UINT8 x UINT8 / UINT8: 3.5µs
UINT16 x UINT16 / UINT16: 22.5µs, muldiv8: 29.9 to 45.3µs
UINT32 x UINT32 / UINT32: 84µs, muldiv16: 120 to 189µs
FLOAT32 * FLOAT32 / FLOAT32: 40.2 ot 135.5µs, muldiv32: 1.193 to 1.764ms
And on 32-bits MCU:
Type - optimized code - without optimization
UINT32: 521ns - 604ns
UINT64: 2958ns - 3313ns
FLOAT32: 2563ns - 2688ns
muldiv32: 6791ns - 25375ns
So, the compilers are clever than this C algorithm.
And it is always better to work with float variables (even without FPU) than whith integer bigger than the native registers (even though float32 has worst precision than uint32, starting 16777217).
Edit3: Ok, so: my N-bits MCU are using a N-bits MUL N-bits
native instruction, that produce a 2N-bits result, stored into two N-Bits registers.
Here, you can found a C implementation (prefer the EasyasPi's solution)
But they don't have 2N-bits DIV N-bits
native instruction. Instead, they are using the __udivdi3 function from gcc, with loops and 2N-bits variables (here, UINT64). So, this cannot be a solution for the original question.
UINT_MAX * UINT_MAX / 1
for example) – Horizon