I'm working on arithmetic for multiplication of very long integers (some 100,000 decimal digits). As part of my library I to add two long numbers.
Profiling shows that my code runs up to 25% of it's time in the add() and sub() routines, so it's important they are as fast as possible. But I don't see much potential, yet. Maybe you can give me some help, advice, insight or ideas. I'll test them and get back to you.
So far my add routine does some setup and then uses a 8-times unrolled loop:
mov rax, QWORD PTR [rdx+r11*8-64]
mov r10, QWORD PTR [r8+r11*8-64]
adc rax, r10
mov QWORD PTR [rcx+r11*8-64], rax
7 more blocks with different offsets follow and then it loops.
I tried loading the values from memory earlier, but that didn't help. I guess that is because of good prefetching. I use an Intel i7-3770 Ivy Bridge 4-core CPU. But I'd like to write code that works good on any modern CPU.
Edit: I did some timings: It adds 1k words in about 2.25 cycles/word. If I remove the ADC, so only the MOVs remain, it still takes about 1.95 cycles/word. So the main bottleneck seems to be the memory access. A library memcpy()
works in about 0.65 cycles/word, but has only one input, not two. Still, it's much faster because of its use of SSE registers, I guess.
Some questions:
- Is it useful to use "load, load, add, store" structure or would a "load, add-to-memory" help? So far my tests didn't show any advantages.
- As usual, there is no help from SSE(2,3,4) to be expected?
- Does the addressing (scaled index plus base plus offset) impact badly? I could use
ADD r11, 8
instead. - What about the loop unrolling? I read unrolling was bad for Sandy Bridge architecture (Agner Fog http://www.agner.org/optimize/). Is it to be preferred or avoided?
- (Edit) Can I use SSE registers to load and store words in larger chunks from memory and efficiently exchange words with general purpose registers and SSE registers?
I highly appreciate any comments.