Expression templates vs. hand-written code
Asked Answered
O

1

12

I am currently writing a C++ template expression library and comparing some instantiations with hand-written code at assembly level. The hand-written function is the following:

spinor multiply(vector const& a, vector const& b)
{
        spinor result = {
                a.at<1>() * b.at<1>() - a.at<2>() * b.at<2>()
                          - a.at<4>() * b.at<4>() - a.at<8>() * b.at<8>(),
                a.at<1>() * b.at<2>() - a.at<2>() * b.at<1>(),
                a.at<1>() * b.at<4>() - a.at<4>() * b.at<1>(),
                a.at<1>() * b.at<8>() - a.at<8>() * b.at<1>(),
                a.at<2>() * b.at<4>() - a.at<4>() * b.at<2>(),
                a.at<2>() * b.at<8>() - a.at<8>() * b.at<2>(),
                a.at<4>() * b.at<8>() - a.at<8>() * b.at<4>()
        };

        return result;
}

The vector class is just a wrapper over four doubles which can be read by using the at<index>() member function. Because of design decisions, the indices for the four components are 1, 2, 4, 8 which are accessed with at<index>() instead of the usual 0, 1, 2, 3.

The purpose of this function is to return the result of a multiplication of two vectors (in Minkowski-space). If you are familiar with Geometric Algebra, you will see the dot-product (first component of result, symmetric under exchange of a and b) and wedge-product (rest of the components, antisymmetric under exchange of a and b). If you are not familiar with Geometric Algebra, just take this function as a prescription for multiplying vectors.

If I compile the function above with GCC 4.7 and look at the disassembly given by objdump -SC a.out this gives me the following output:

400bc0: movsd  0x8(%rsi),%xmm6
400bc5: mov    %rdi,%rax
400bc8: movsd  (%rsi),%xmm8
400bcd: movsd  0x8(%rdx),%xmm5
400bd2: movapd %xmm6,%xmm9
400bd7: movsd  (%rdx),%xmm7
400bdb: movapd %xmm8,%xmm0
400be0: mulsd  %xmm5,%xmm9
400be5: movsd  0x10(%rsi),%xmm4
400bea: mulsd  %xmm7,%xmm0
400bee: movsd  0x10(%rdx),%xmm1
400bf3: movsd  0x18(%rdx),%xmm3
400bf8: movsd  0x18(%rsi),%xmm2
400bfd: subsd  %xmm9,%xmm0
400c02: movapd %xmm4,%xmm9
400c07: mulsd  %xmm1,%xmm9
400c0c: subsd  %xmm9,%xmm0
400c11: movapd %xmm3,%xmm9
400c16: mulsd  %xmm2,%xmm9
400c1b: subsd  %xmm9,%xmm0
400c20: movapd %xmm6,%xmm9
400c25: mulsd  %xmm7,%xmm9
400c2a: movsd  %xmm0,(%rdi)
400c2e: movapd %xmm5,%xmm0
400c32: mulsd  %xmm8,%xmm0
400c37: subsd  %xmm9,%xmm0
400c3c: movapd %xmm4,%xmm9
400c41: mulsd  %xmm7,%xmm9
400c46: mulsd  %xmm2,%xmm7
400c4a: movsd  %xmm0,0x8(%rdi)
400c4f: movapd %xmm1,%xmm0
400c53: mulsd  %xmm8,%xmm0
400c58: mulsd  %xmm3,%xmm8
400c5d: subsd  %xmm9,%xmm0
400c62: subsd  %xmm7,%xmm8
400c67: movapd %xmm4,%xmm7
400c6b: mulsd  %xmm5,%xmm7
400c6f: movsd  %xmm0,0x10(%rdi)
400c74: mulsd  %xmm2,%xmm5
400c78: movapd %xmm1,%xmm0
400c7c: mulsd  %xmm6,%xmm0
400c80: movsd  %xmm8,0x18(%rdi)
400c86: mulsd  %xmm3,%xmm6
400c8a: mulsd  %xmm2,%xmm1
400c8e: mulsd  %xmm4,%xmm3
400c92: subsd  %xmm7,%xmm0
400c96: subsd  %xmm5,%xmm6
400c9a: subsd  %xmm1,%xmm3
400c9e: movsd  %xmm0,0x20(%rdi)
400ca3: movsd  %xmm6,0x28(%rdi)
400ca8: movsd  %xmm3,0x30(%rdi)
400cad: retq   
400cae: nop
400caf: nop

This looks pretty good to me - the components of the first (%rsi) and second (%rdx) vectors are accessed only once and the actual computations are done only in registers. At the end, the result is written at the adress in the register %rdi. Since this is the first argument register I think a return value optimization is employed here.

Compare this with following listing for the expression template version of the function above:

400cb0: mov    (%rsi),%rdx
400cb3: mov    0x8(%rsi),%rax
400cb7: movsd  0x1f1(%rip),%xmm4        # 400eb0 <_IO_stdin_used+0x10>
400cbe: 
400cbf: movsd  0x10(%rdx),%xmm3
400cc4: movsd  0x18(%rdx),%xmm0
400cc9: mulsd  0x10(%rax),%xmm3
400cce: xorpd  %xmm4,%xmm0
400cd2: mulsd  0x18(%rax),%xmm0
400cd7: movsd  0x8(%rdx),%xmm2
400cdc: movsd  (%rdx),%xmm1
400ce0: mulsd  0x8(%rax),%xmm2
400ce5: mulsd  (%rax),%xmm1
400ce9: subsd  %xmm3,%xmm0
400ced: subsd  %xmm2,%xmm0
400cf1: addsd  %xmm0,%xmm1
400cf5: movsd  %xmm1,(%rdi)
400cf9: movsd  (%rdx),%xmm0
400cfd: movsd  0x8(%rdx),%xmm1
400d02: mulsd  0x8(%rax),%xmm0
400d07: mulsd  (%rax),%xmm1
400d0b: subsd  %xmm1,%xmm0
400d0f: movsd  %xmm0,0x8(%rdi)
400d14: movsd  (%rdx),%xmm0
400d18: movsd  0x10(%rdx),%xmm1
400d1d: mulsd  0x10(%rax),%xmm0
400d22: mulsd  (%rax),%xmm1
400d26: subsd  %xmm1,%xmm0
400d2a: movsd  %xmm0,0x10(%rdi)
400d2f: movsd  0x8(%rdx),%xmm0
400d34: movsd  0x10(%rdx),%xmm1
400d39: mulsd  0x10(%rax),%xmm0
400d3e: mulsd  0x8(%rax),%xmm1
400d43: subsd  %xmm1,%xmm0
400d47: movsd  %xmm0,0x18(%rdi)
400d4c: movsd  (%rdx),%xmm0
400d50: movsd  0x18(%rdx),%xmm1
400d55: mulsd  0x18(%rax),%xmm0
400d5a: mulsd  (%rax),%xmm1
400d5e: subsd  %xmm1,%xmm0
400d62: movsd  %xmm0,0x20(%rdi)
400d67: movsd  0x8(%rdx),%xmm0
400d6c: movsd  0x18(%rdx),%xmm1
400d71: mulsd  0x18(%rax),%xmm0
400d76: mulsd  0x8(%rax),%xmm1
400d7b: subsd  %xmm1,%xmm0
400d7f: movsd  %xmm0,0x28(%rdi)
400d84: movsd  0x10(%rdx),%xmm0
400d89: movsd  0x18(%rdx),%xmm1
400d8e: mulsd  0x18(%rax),%xmm0
400d93: mulsd  0x10(%rax),%xmm1
400d98: subsd  %xmm1,%xmm0
400d9c: movsd  %xmm0,0x30(%rdi)
400da1: retq   

The signature of this function is

spinor<product<vector, vector>>(product<vector, vector> const&)

I hope you trust me that both version give the same result. The first two lines extract the first and second vector which are stored as references in product. I wondered about the following things:

  • What does movsd 0x1f1(%rip),%xmm4 in combination with xorpd %xmm4,%xmm0 do? I already found out that this is called "RIP relative addressing", see http://www.x86-64.org/documentation/assembly.html
  • Why does GCC not make use of more registers, e.g. to cache 0x10(%rax) which is read four times?

I also benchmarked both functions by generating 100000000 random vectors and taking the time both functions needed:

ET: 7.5 sec
HW: 6.8 sec

The hand-written function is about 10% faster. Does anyone have experience with expression templates and knows how to make them perform closer to their hand-written counterpart?

Obstructionist answered 22/3, 2012 at 11:33 Comment(7)
Nothing to do with your problem, but even with scoping rules I would try and not name new classes the same as some of the standard container classes. If you use using namespace std; things might become very confusing.Thereinto
@this is going to be tough without the actual ET code. Care to share?Cassiecassil
@JoachimPileborg: if you use using namespace std; I don't care if you get issues (you should be more specific)... but I agree with the general idea about trying to avoid name clashes anyway.Alysa
Thats a perfect reason to not use using namespace std ;-)Achieve
could you please show us how you implemented the expression template solution?Cutcherry
Everythings in a separate namespace, so no problem with namespace std.Obstructionist
I will publish the code eventually on github, but for now its just too many functions in too many files.Obstructionist
A
3

It would be clear if we did know for sure the contents of address 0x400eb0, but I suspect it is 0x8000 0000 0000 0000 8000 0000 0000 0000, or similar (possibly with a leading 0, because the code is not vectorized), written as 128 bit int.

In that case a xorpd does change the sign of the second operand.

To why the register read isn't cached - better ask this on the gcc-help mailing list. Possibly the compiler can't prove that the two vectors or an intermediate result do not alias.

But against the general opinion, compilers do not optimize always perfectly, but only better than 90% (or 99% ?) of all programmers (if they try to write assembly), and sometimes (rarely) they produce really slow code.

But your approach is very good - benchmarking and looking a the generated object code is the right thing to do if you want to optimize.

PS: They code could possibly be accelerated by using vector instructions (mulpd instead of mulsd), which does multiply two or four doubles in one go), aka SSE or AVX. But some instructions are needed to shuffle the values to the right places in the registers, so the gain is always slower than two or four times.

Achieve answered 22/3, 2012 at 13:0 Comment(8)
I also do not know at this address, it is pointing beyond the end of my code (see comment above generated by objdump). Here is a full printout:Obstructionist
students.uni-mainz.de/cschwan/asm.out (older version, but the situation remains, see line 373)Obstructionist
I also thought about using SSE, but I think this it is very hard to combine expression templates with SSE intrinsics, especially because the computations may contain signs and are possibly very different (compare first and second component in multiply).Obstructionist
You can try objdump -s a.out and look at the contents of .data, which is usually relocated to the end of the .text section. You can find the exact relocation information using objdump -r, look for a value 0x400cba, which is the location of the offset of the mov $1f1(%rip)... instruction.Achieve
Thanks! I finally found the value - its just 16 byte zeros in a .rodatasection. Now I wonder GCC does not emit xorpd %xmm4, %xmm4 instead of loading a zero from a read-only section into that register.Obstructionist
I had an similar issue in my code (using gcc-4.7 too) with a zero 128 bit constant. The point was that the variable was declared using intrinsics as static const __m128i zero = _mm_set1_epi8(0) which is the recommended way. But the compiler generated a load to a memory block containing zeros. After declaring it static const __m128i zero = {0} it got correctly folded to a xor instruction. But another question in your case is: Why isn't the xor completly optimized away? After all, if the source operand is 0, it does nothing.Achieve
Thats a good question. Maybe it is some kind of remnant from the optimization step. I have to compile with -fno-signed-zeros in order to get rid + 0.0 expressions which are added when certain sub-expressions must left out.Obstructionist
@Obstructionist If you don't care about NaN, infinities and strict C-Standard/IEEE 754 conformance, you can always use -ffast-math, which allows reordering of instructions and some other "mathematical" optimizations. This doesn't mean that the results are generally less precise, they may only differ on different architectures.Achieve

© 2022 - 2024 — McMap. All rights reserved.