Speed up large modular multiplication in base 2^8 without multiplier
Asked Answered
E

1

6

I am currenty converting the nacl library to risc-v. I already have poly1305 working. I am trying to do this using the risc-v core instruction set, so I don't have a multiplier. The algorithm for Pol1305 is using at the moment ceil(m/16)*17*17 8-bit multiplications where m is the message length in bytes(multiplying two 2^130 integers in base 2^8 modulo 2^130-5). So I want to use a fast multiplication algorithm to keep it fast.

At the moment I have the shift-and-add algorithm working for the multiplication. However that takes 63 cycles for 8-bit values, since I need to avoid branching (timing side-channel) and thus some masking is involved which takes a few more cycles.

    andi  t2, t0, 1     //t0 is the multiplier
    sub   t2, zero, t2  //creating a mask
    and   t3, t1, t2    //applying the mask to the multiplicand
    add   a0, a0, t3    //doing the add
    srli  t0, t0, 1     //shifting the multiplier
    slli  t1, t1, 1     //shifting the multiplicand

This is giving me valid results at 63 cycles per multiplication. The issue is that the total execution time of the program is 175219 cycles for a message of 131 bytes. Of this time 9*17*17*63 = 163863 cycles are used for multiplication. Which I would like to improve.

Emmer answered 1/11, 2019 at 14:4 Comment(17)
The ban on branches sounds like crypto. Are data-dependent memory accesses allowed? For Quarter Square multiplication or similarFlaminius
Do you mean this code block of 6 instructions takes 63 cycles on the core you care about? Or 32 iterations of this inside a loop takes 63 cycles? You should show at least that inner loop. (And what do you mean you're "not allowed" to use branches? Do you actually mean that branchy would be slower for random data so you choose to avoid it?) Also why use base 2^8, instead of e.g. base 2^32 to use the full width of a register? Or at least base 2^16, or maybe 2^30 if you need to leave room for manual carry-propagation. (IIRC RISC-V doesn't have flags so add-with-carry needs emulation.)Periwig
Is your core superscalar at all? Can you expose ILP by interleaving two chunks of the full multiply?Periwig
possible micro-optimization: use SAR by 31 to broadcast a bit to all positions, instead of and/sub. (To get low bits into the top position, maybe rotate right? Or start with shifts outside the loop to set up.) If you bottleneck on shift ALU throughput, you can replace left-shift with add same,samePeriwig
@harold yeah it is crypto, data-dependent memory accesses are not allowed, but I will look into quarter square multiplication anyways to check.Emmer
@PeterCordes This code block of 6 is a part of the unrolled loop, so this is repeated 8 times. I am not allowed to use branches due to the possiblity of a timing attack. Where information about the data leaks depending on whether a branch is taken or not, so it is not a choice it is mandatory. About using a different base I was trying to get it to work on base 2^16, but I can't get the math to work out. I made a post about it [here]( math.stackexchange.com/questions/3417910/…) if you are interested.Emmer
So, just to be clear (referencing the math. question), cb x 39 takes 8 of these sequences to do one 8-bitx8-bit=>16-bit result, then ab x 39 takes another (set of 8), cb x 25 another, and so on for a total of 8 of these (8x8=>16) multiplies?Veliz
Can you answer whether you can use a lookup table, say maybe 512 bytes in size?Veliz
And how about a case statement (indirect branch) where each case is exactly the same instruction sequence?Veliz
@ErikEidt: The OP already said data-dependent memory accesses are not allowed (presumably because the machine has cache which could miss). If the machine has indirect branch prediction at all, even a computed jump (not jump table) would be ruled out as well. But yeah it might not.Periwig
@StefanvdBerg: I edited the phrasing of your question to say you "need to avoid" branches. In English, saying you're "not allowed" to use branches makes it sound like some arbitrary requirement that you don't understand or agree with, e.g. in a homework assignment. "need to avoid for timing side channels" is much clearer.Periwig
Sorry folks, what is a "data dependent" memory access (vs. a non-...). does that include any table lookup no matter the size of the table??Veliz
If you always access all 512 bytes of a table (and discard what you don't want) then it's not a data dependent memory access (because what you access didn't depend on what you wanted to access).Nonobjective
@ErikEidt: A data dependent memory access is something like "y = array[x]" where an attacker can determine which cache line you accessed (and infer the value of x) from measuring the time differences between "cache hit or miss" of other (possibly only related by cache aliasing) accesses.Nonobjective
@ErikEidt A switch case is not allowed due to the branch prediction as PeterCordes noted. It is possible to transform an if-statement into a mask as shown in the example, however this takes 3 instructions, so for an switch case it would take 3*the amount of cases to create all the masks.Emmer
@StefanvdBerg, so a timing attack would be able to tell if not all the cases of the switch statement were executed, even if there were only 16 cases, and one of them was sure to run, and, each case was the same count and same instructions (same opcodes but different operands)?Veliz
@ErikEidt to go to a case a branch statement would be used, which can have a timing attack due to the branch predictor, so the only way to create something similar is to create 16 masks and apply that to the data, but that would take more too many cycles to work.Emmer
C
3

Here is a little improved code. This is based on the algorithm shown in Patterson and Hennessy's textbook.

    // initialization
    add   a0, zero, t0  //copy multiplier to the product
    slli  t1, t1, 8     //shift the multiplicand to the left half of a 16bit register

    // repeat 8 times from here
    andi  t2, a0, 1     //the right half of a0 is the multiplier
    sub   t2, zero, t2  //creating a mask
    and   t3, t1, t2    //applying the mask to the multiplicand
    add   a0, a0, t3    //doing the add to the left half of the product
    srli  a0, a0, 1     //shifting the product
    // to here

Also, you can apply the loop-unfolding method by repeating the above code 8 times instead of looping by branch/jump.

On the algorithm level, Karatsuba algorithm can reduce the number of 8bit multiplications in multi-precision arithmetic.

Chide answered 7/11, 2019 at 4:33 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.