Expression templates vs. hand-written code

I am currently writing a C++ template expression library and comparing some instantiations with hand-written code at assembly level. The hand-written function is the following:

spinor multiply(vector const& a, vector const& b)
{
        spinor result = {
                a.at<1>() * b.at<1>() - a.at<2>() * b.at<2>()
                          - a.at<4>() * b.at<4>() - a.at<8>() * b.at<8>(),
                a.at<1>() * b.at<2>() - a.at<2>() * b.at<1>(),
                a.at<1>() * b.at<4>() - a.at<4>() * b.at<1>(),
                a.at<1>() * b.at<8>() - a.at<8>() * b.at<1>(),
                a.at<2>() * b.at<4>() - a.at<4>() * b.at<2>(),
                a.at<2>() * b.at<8>() - a.at<8>() * b.at<2>(),
                a.at<4>() * b.at<8>() - a.at<8>() * b.at<4>()
        };

        return result;
}

The vector class is just a wrapper over four doubles which can be read by using the at<index>() member function. Because of design decisions, the indices for the four components are 1, 2, 4, 8 which are accessed with at<index>() instead of the usual 0, 1, 2, 3.

The purpose of this function is to return the result of a multiplication of two vectors (in Minkowski-space). If you are familiar with Geometric Algebra, you will see the dot-product (first component of result, symmetric under exchange of a and b) and wedge-product (rest of the components, antisymmetric under exchange of a and b). If you are not familiar with Geometric Algebra, just take this function as a prescription for multiplying vectors.

If I compile the function above with GCC 4.7 and look at the disassembly given by objdump -SC a.out this gives me the following output:

400bc0: movsd  0x8(%rsi),%xmm6
400bc5: mov    %rdi,%rax
400bc8: movsd  (%rsi),%xmm8
400bcd: movsd  0x8(%rdx),%xmm5
400bd2: movapd %xmm6,%xmm9
400bd7: movsd  (%rdx),%xmm7
400bdb: movapd %xmm8,%xmm0
400be0: mulsd  %xmm5,%xmm9
400be5: movsd  0x10(%rsi),%xmm4
400bea: mulsd  %xmm7,%xmm0
400bee: movsd  0x10(%rdx),%xmm1
400bf3: movsd  0x18(%rdx),%xmm3
400bf8: movsd  0x18(%rsi),%xmm2
400bfd: subsd  %xmm9,%xmm0
400c02: movapd %xmm4,%xmm9
400c07: mulsd  %xmm1,%xmm9
400c0c: subsd  %xmm9,%xmm0
400c11: movapd %xmm3,%xmm9
400c16: mulsd  %xmm2,%xmm9
400c1b: subsd  %xmm9,%xmm0
400c20: movapd %xmm6,%xmm9
400c25: mulsd  %xmm7,%xmm9
400c2a: movsd  %xmm0,(%rdi)
400c2e: movapd %xmm5,%xmm0
400c32: mulsd  %xmm8,%xmm0
400c37: subsd  %xmm9,%xmm0
400c3c: movapd %xmm4,%xmm9
400c41: mulsd  %xmm7,%xmm9
400c46: mulsd  %xmm2,%xmm7
400c4a: movsd  %xmm0,0x8(%rdi)
400c4f: movapd %xmm1,%xmm0
400c53: mulsd  %xmm8,%xmm0
400c58: mulsd  %xmm3,%xmm8
400c5d: subsd  %xmm9,%xmm0
400c62: subsd  %xmm7,%xmm8
400c67: movapd %xmm4,%xmm7
400c6b: mulsd  %xmm5,%xmm7
400c6f: movsd  %xmm0,0x10(%rdi)
400c74: mulsd  %xmm2,%xmm5
400c78: movapd %xmm1,%xmm0
400c7c: mulsd  %xmm6,%xmm0
400c80: movsd  %xmm8,0x18(%rdi)
400c86: mulsd  %xmm3,%xmm6
400c8a: mulsd  %xmm2,%xmm1
400c8e: mulsd  %xmm4,%xmm3
400c92: subsd  %xmm7,%xmm0
400c96: subsd  %xmm5,%xmm6
400c9a: subsd  %xmm1,%xmm3
400c9e: movsd  %xmm0,0x20(%rdi)
400ca3: movsd  %xmm6,0x28(%rdi)
400ca8: movsd  %xmm3,0x30(%rdi)
400cad: retq   
400cae: nop
400caf: nop

This looks pretty good to me - the components of the first (%rsi) and second (%rdx) vectors are accessed only once and the actual computations are done only in registers. At the end, the result is written at the adress in the register %rdi. Since this is the first argument register I think a return value optimization is employed here.

Compare this with following listing for the expression template version of the function above:

400cb0: mov    (%rsi),%rdx
400cb3: mov    0x8(%rsi),%rax
400cb7: movsd  0x1f1(%rip),%xmm4        # 400eb0 <_IO_stdin_used+0x10>
400cbe: 
400cbf: movsd  0x10(%rdx),%xmm3
400cc4: movsd  0x18(%rdx),%xmm0
400cc9: mulsd  0x10(%rax),%xmm3
400cce: xorpd  %xmm4,%xmm0
400cd2: mulsd  0x18(%rax),%xmm0
400cd7: movsd  0x8(%rdx),%xmm2
400cdc: movsd  (%rdx),%xmm1
400ce0: mulsd  0x8(%rax),%xmm2
400ce5: mulsd  (%rax),%xmm1
400ce9: subsd  %xmm3,%xmm0
400ced: subsd  %xmm2,%xmm0
400cf1: addsd  %xmm0,%xmm1
400cf5: movsd  %xmm1,(%rdi)
400cf9: movsd  (%rdx),%xmm0
400cfd: movsd  0x8(%rdx),%xmm1
400d02: mulsd  0x8(%rax),%xmm0
400d07: mulsd  (%rax),%xmm1
400d0b: subsd  %xmm1,%xmm0
400d0f: movsd  %xmm0,0x8(%rdi)
400d14: movsd  (%rdx),%xmm0
400d18: movsd  0x10(%rdx),%xmm1
400d1d: mulsd  0x10(%rax),%xmm0
400d22: mulsd  (%rax),%xmm1
400d26: subsd  %xmm1,%xmm0
400d2a: movsd  %xmm0,0x10(%rdi)
400d2f: movsd  0x8(%rdx),%xmm0
400d34: movsd  0x10(%rdx),%xmm1
400d39: mulsd  0x10(%rax),%xmm0
400d3e: mulsd  0x8(%rax),%xmm1
400d43: subsd  %xmm1,%xmm0
400d47: movsd  %xmm0,0x18(%rdi)
400d4c: movsd  (%rdx),%xmm0
400d50: movsd  0x18(%rdx),%xmm1
400d55: mulsd  0x18(%rax),%xmm0
400d5a: mulsd  (%rax),%xmm1
400d5e: subsd  %xmm1,%xmm0
400d62: movsd  %xmm0,0x20(%rdi)
400d67: movsd  0x8(%rdx),%xmm0
400d6c: movsd  0x18(%rdx),%xmm1
400d71: mulsd  0x18(%rax),%xmm0
400d76: mulsd  0x8(%rax),%xmm1
400d7b: subsd  %xmm1,%xmm0
400d7f: movsd  %xmm0,0x28(%rdi)
400d84: movsd  0x10(%rdx),%xmm0
400d89: movsd  0x18(%rdx),%xmm1
400d8e: mulsd  0x18(%rax),%xmm0
400d93: mulsd  0x10(%rax),%xmm1
400d98: subsd  %xmm1,%xmm0
400d9c: movsd  %xmm0,0x30(%rdi)
400da1: retq

The signature of this function is

spinor<product<vector, vector>>(product<vector, vector> const&)

I hope you trust me that both version give the same result. The first two lines extract the first and second vector which are stored as references in product. I wondered about the following things:

What does movsd 0x1f1(%rip),%xmm4 in combination with xorpd %xmm4,%xmm0 do? I already found out that this is called "RIP relative addressing", see http://www.x86-64.org/documentation/assembly.html
Why does GCC not make use of more registers, e.g. to cache 0x10(%rax) which is read four times?

I also benchmarked both functions by generating 100000000 random vectors and taking the time both functions needed:

ET: 7.5 sec
HW: 6.8 sec

The hand-written function is about 10% faster. Does anyone have experience with expression templates and knows how to make them perform closer to their hand-written counterpart?

It would be clear if we did know for sure the contents of address 0x400eb0, but I suspect it is 0x8000 0000 0000 0000 8000 0000 0000 0000, or similar (possibly with a leading 0, because the code is not vectorized), written as 128 bit int.

In that case a xorpd does change the sign of the second operand.

To why the register read isn't cached - better ask this on the gcc-help mailing list. Possibly the compiler can't prove that the two vectors or an intermediate result do not alias.

But against the general opinion, compilers do not optimize always perfectly, but only better than 90% (or 99% ?) of all programmers (if they try to write assembly), and sometimes (rarely) they produce really slow code.

But your approach is very good - benchmarking and looking a the generated object code is the right thing to do if you want to optimize.

PS: They code could possibly be accelerated by using vector instructions (mulpd instead of mulsd), which does multiply two or four doubles in one go), aka SSE or AVX. But some instructions are needed to shuffle the values to the right places in the registers, so the gain is always slower than two or four times.

Recommended topics

Hot tags