In Hacker's delight there is an algorithm to calculate the double word product of two (signed) words.
The function muldws1
uses four multiplications and five additions to calculate
the double word from two words.
Towards the end of that code there is a line commented out
/* w[1] = u*v; // Alternative. */
This alternative uses five multiplications and four addition, i.e. it exchanges an addition for a multiplication.
But I think this alternative method can be improved. I have not said anything about hardware yet. Let's assume a hypothetical CPU which can calculate the lower word of the product of two words but not the upper word (e.g. for 32-bit words 32x32 to lower 32). In this case it seems to me that this algorithm can be improved. Here is what I have come up with assuming 32-bit words (the same concept would work for 64-bit words).
void muldws1_improved(int w[], int32_t x, int32_t y) {
uint16_t xl = x; int16_t xh = x >> 16;
uint16_t yl = y; int16_t yh = y >> 16;
uint32 lo = x*y;
int32_t t = xl*yh + xh*yl;
uint16_t tl = t; int16_t th = t >>16;
uint16_t loh = lo >> 16;
int32_t cy = loh<tl; //carry
int32_t hi = xh*yh + th + cy;
w[0] = hi; w[1] = lo;
}
This uses four multiplications, three additions, and one comparison. This is a smaller improvement then I had hoped for.
Can this be improved? Is there a better way to determine the carry flag? I should point out I am also assuming the hardware has no carry flag (e.g. no ADDC instruction) but words can be compared (e.g. word1<word
).
Edit: as Sander De Dycker pointed out my function fails the unit tests. Here is a version which passes the unit tests but it's less efficient. I think it can be improved.
void muldws1_improved_v2(int w[], int32_t x, int32_t y) {
uint16_t xl = x; int16_t xh = x >> 16;
uint16_t yl = y; int16_t yh = y >> 16;
uint32_t lo = x*y;
int32_t t2 = xl*yh;
int32_t t3 = xh*yl;
int32_t t4 = xh*yh;
uint16_t t2l = t2; int16_t t2h = t2 >>16;
uint16_t t3l = t3; int16_t t3h = t3 >>16;
uint16_t loh = lo >> 16;
uint16_t t = t2l + t3l;
int32_t carry = (t<t2l) + (loh<t);
int32_t hi = t4 + t2h + t3h + carry;
w[0] = hi; w[1] = lo;
}
This uses four multiplications, five additions, and two comparisons which is worse that the original function.
&
s. Of course, it may be that 16-bit shifts are free or extremely cheap, depending on CPU architecture. – Robichaud{0x7fffffff, 0x7eeeeeee, 0x3f777776,0x81111112}
unit test from the original code. – Wishfulint64_t
. In fact it returns nothing. – Cosmaint w[]
should beint32_t w[]
or better yet:int32_t *whi, uint32_t *wlo
. – Aseptic