efficient index computation using meta programming
Asked Answered
T

1

15

Given an multi-dimensional array with shape [A][B][C][D] but stored as a 1-dim array with length [A*B*C*D]. I want to use template meta-programming to simplify the index-computation. The index (a,b,c,d) should be at position

a*B*C*D + b*C*D + c*D + d

I currently use

#include <iostream>
#include <cstdlib>
#include <array>


template<size_t start, size_t AXES>
struct prod_func
{
  constexpr inline size_t operator()(const std::array<const size_t, AXES> arr) const
  {
    return arr[start] * prod_func < start + 1, AXES > ()(arr);
  }
} ;

template<size_t AXES>
struct prod_func<AXES, AXES>
{
  constexpr inline size_t operator()(const std::array<const size_t, AXES> arr) const
  {
    return 1;
  }
} ;


template<int AXES>
class index
{
  const std::array<const size_t, AXES> shapes;

public:

  index(std::array<const size_t, AXES> s) : shapes(s) {}

  template <typename... Dims>
  constexpr inline size_t operator()(int off, Dims... dims) const {
    return off * (prod_func < AXES - (sizeof...(Dims)), AXES > ()(shapes)) + operator()(dims...);
  }

  constexpr inline size_t operator()(int t) const {
    return t;
  }


};


int main()
{
    size_t A=2, B=3, C=6, D=7;
    auto idx = index<4>({A,B,C,D});

    int a=1, b=1, c=1, d=1;
    std::cin >> a;
    std::cin >> b;
    std::cin >> c;
    std::cin >> d;

    asm ("nop");
    size_t result =  idx(a,b,c,d);
    asm ("nop"); 
    std::cout << result << std::endl;
    asm ("nop"); 
    result = (a*B*C*D + b*C*D + c*D + d);
    asm ("nop");
    std::cout << result << std::endl;

    return 0;

}

The cin is just to ensure run-time values. Inspecting the assembly g++ -O2 -S ../main.cpp -std=c++11 gives

imull   $105, 8(%rsp), %edx
imull   $35, 12(%rsp), %eax
movl    $_ZSt4cout, %edi
addl    %edx, %eax
movl    16(%rsp), %edx
leal    (%rax,%rdx,8), %esi
subl    %edx, %esi
addl    20(%rsp), %esi

for the (a*B*C*D + b*C*D + c*D + d) part. This is what I was expecting from the compiler. But for the index-class it produces some more operations:

movslq  8(%rsp), %rax
movl    $_ZSt4cout, %edi
leaq    (%rax,%rax,2), %rdx
leaq    (%rax,%rdx,4), %rdx
leaq    (%rax,%rdx,8), %rcx
movslq  12(%rsp), %rax
leaq    (%rax,%rax,4), %rdx
leaq    (%rcx,%rdx,8), %rax
subq    %rdx, %rax
movslq  20(%rsp), %rdx
addq    %rdx, %rax
movslq  16(%rsp), %rdx
leaq    (%rax,%rdx,8), %rsi
subq    %rdx, %rsi

and does not get the optimization B*C*D=105. Is there any way to get similar assembly? I would like to wrap some CUDA code, so it really needs to be identical code (in C++11). To be clear, only the number of axes is known at compile-time. Or any other ways to write this?

edit: Although I am now conviced, that it has the same efficiency, I would like to still get the same assembly: https://godbolt.org/g/RHwBV6

Thromboplastic answered 23/7, 2017 at 19:52 Comment(6)
Did you try nested std::arrays?Kampong
I just want to call it like index<4>({A,B,C,D})(a,b,c,d) or index<4>(A,B,C,D)(a,b,c,d). Not sure how nested std::arrays can help.Thromboplastic
It does get the optimization, but in a different form. For example, the first three lea instructions actually a multiply by 105 (it calculates 1+8*(1+4*3)=105). But, if A-D is not a compile time constant, you'll have a different assembly anyway. You can simulate it with initializing A-D with volatile variables. Like this: volatile size_t vA = 3; A = vA;Pyrology
Even when using const size_t I do not get the same assembly.Thromboplastic
I recently did a test of computing the index into a 1d "carray" at runtime, vs nested std::array, and the array won. But I did not try to use template meta programming -- I just used * and +.Phyllous
Have you tried making the index constructor constexpr?Upstart
U
2

Yes, it is possible to obtain identical assembly (proof). I arrived there by "computing" the pitches for each dimension in the constructor of the index object and "initializing" a nonstatic array data member.

template<size_t Nd>
struct Index {
  static_assert(Nd >= 1, "");
  size_t extents_[Nd];
  size_t pitches_[Nd];
 public:
  template<class... Ts>
  constexpr Index(size_t e0, Ts... es) noexcept
    : Index{MakeIndSeq<Nd>{}, e0, size_t(es)...}
  {}
 private:
  template<size_t... ds, class... Ts>
  constexpr Index(IndSeq<ds...>, size_t e0, Ts... es) noexcept
    : extents_{e0, es...}
    , pitches_{extents2pitch<ds>(e0, es...)...}
  {}
 public:
  template<class... Ts>
  constexpr size_t operator()(size_t i0, Ts... is) const {
    return operator()(MakeIndSeq<Nd>{}, i0, is...);
  }
 private:
  template<size_t... ds, class... Ts>
  constexpr size_t operator()(IndSeq<ds...>, Ts... is) const {
    return sum((is*pitches_[ds])...);
  }
};

where extents2pitch could look like

template<size_t d, size_t... ds, class... Ts>
constexpr size_t extents2pitch_impl(IndSeq<ds...>, size_t N0, Ts... Ns) {
  return product<size_t>(
    Array<size_t, size_t(1)+sizeof...(Ns)>{N0, Ns...}[sizeof...(Ns)-ds]...
  );
}

template<size_t d, class... Ts>
constexpr size_t extents2pitch(size_t N0, Ts... Ns) {
  return extents2pitch_impl<d>(MakeIndSeq<sizeof...(Ns)-d>{}, N0, Ns...);
}
U answered 11/10, 2017 at 19:29 Comment(3)
BTW: Should I paste all the other boilerplate code (sum, product, IndSeq, ...) from the working example into this answer?U
Thanks! Don't know if the link expires at some point. I think we need C++14 as soon as possible. I would like to upvote this answer twice.Thromboplastic
Just a note for everybody who thinks this could be used in a cuda kernel. Let me say the nvcc is dumb and uses more registers following this solution compared to the usual nightmare of index computations. Nothing wrong with this answer. On the CPU everything is fine.Thromboplastic

© 2022 - 2024 — McMap. All rights reserved.