I'm investigating ways to speed up a large section of C++ code, which has automatic derivatives for computing jacobians. This involves doing some amount of work in the actual residuals, but the majority of the work (based on profiled execution time) is in calculating the jacobians.
This surprised me, since most of the jacobians are propagated forward from 0s and 1s, so the amount of work should be 2-4x the function, not 10-12x. In order to model what a large amount of the jacobian work is like, I made a super minimal example with just a dot product (instead of sin, cos, sqrt and more that would be in a real situation) that the compiler should be able to optimize to a single return value:
#include <Eigen/Core>
#include <Eigen/Geometry>
using Array12d = Eigen::Matrix<double,12,1>;
double testReturnFirstDot(const Array12d& b)
{
Array12d a;
a.array() = 0.;
a(0) = 1.;
return a.dot(b);
}
Which should be the same as
double testReturnFirst(const Array12d& b)
{
return b(0);
}
I was disappointed to find that, without fast-math enabled, neither GCC 8.2, Clang 6 or MSVC 19 were able to make any optimizations at all over the naive dot-product with a matrix full of 0s. Even with fast-math (https://godbolt.org/z/GvPXFy) the optimizations are very poor in GCC and Clang (still involve multiplications and additions), and MSVC doesn't do any optimizations at all.
I don't have a background in compilers, but is there a reason for this? I'm fairly sure that in a large proportion of scientific computations being able to do better constant propagation/folding would make more optimizations apparent, even if the constant-fold itself didn't result in a speedup.
While I'm interested in explanations for why this isn't done on the compiler side, I'm also interested for what I can do on a practical side to make my own code faster when facing these kinds of patterns.
b
you don't have any constants. Locals and code to initialise them which may have visible side-effect. Also with floating point the compiler would have to full emulate the production environment to guarantee the same results. – Sasha(1.0 / 3.0) * 3.0
!=(1.0 * 3.0)/3.0
because rounding behaviour is fully specified, so you cannot simply cancel the 3. – Caravaggiodot
. Probably, it is not just afor
loop with accumulation, but involves rescaling. No wonder that compilers can't optimize it. – Elevated-ffast-math
is to say "it's not necessary to comply with the standard". MSVC equivalent of fast-math is/fp:fast
you may find that it does some optimisation if you specify that. – Caravaggio((1.0 * 1343.0 + 100)/1343.0 - 100/1343.0) == 9.99999999999999890
not1.0
– CaravaggioArray12d::dot()
. You'll have to provide code for that to get help optimizing it. – Enhanced-ffast-math
still yields unsatisfactory results. Why are you elaborating so much on it? – Tyrannyconstexpr
functions help? – Curnin-ffast-math
the remaining "problem" is explicit vectorization, see my answer. – Dewittdewlap-O3
and the MS equivalent? – Georg