Fast array permutation (generalised tensor transpose) in Armadillo (C++)
Asked Answered
H

1

6

I have a project which involves a lot of permutations on 3D arrays ( arma::Cube<cx_double>). In particular, the required permutation is the interchange of columns by slices. In Matlab this is efficiently calculated by permute(cube,[1,3,2]) and in Python by numpy.transpose(cube,axis=[0,2,1]).

Unfortunately Armadillo doesn't have a permute function on its own. I have tried different approaches but all of them rather slow when compared to Matlab. I would like to know what's the faster way to permute (rather large) Cubes in Armadillo. Profiling the code with gprof, most of the time is spent in the permute functions that I have tried below, whereas in Matlab, for the same ported project, most of the time is spent in SVD or QR matrix decompositions (reshape and permute are fast in matlab).

I would like to understand which is the fastest way to do this permutation in Armadillo and why some approaches work better than others.

Option 1: Raw permutation (Fastest option) (Is there a faster approach?)

Element-wise assignment of input cube to output cube.

template <typename T>
static Cube<T> permute (Cube<T>& cube){

uword D1=cube.n_rows;
uword D2=cube.n_cols;
uword D3=cube.n_slices;

Cube<T> output(D1,D3,D2);

for (uword s = 0; s < D3; ++s){
for (uword c = 0; c < D2; ++c){
for (uword r = 0; r < D1; ++r){
    output.at(r, s, c) = cube.at(r, c, s);
    // output[ D1*D3*c + D1*s+ r ] = cube[ D1*D2*s + D1*c + r ]; 

}
}
}

return output;
}

Option 2: Filling slices (Very slow)

Fill the slices of the output cube by non-contiguous subcube views.

template <typename T>
static Cube<T> permute (Cube<T>& cube_in){

    uword D1 = cube_in.n_rows;
    uword D2 = cube_in.n_cols;
    uword D3 = cube_in.n_slices;

    Cube<T> output; 
    output.zeros(D1, D3, D2);

    for (uword c=0; c<D2; ++c) {
        output.slice(c) = cube_in.subcube( span(0,D1-1),span(c),span(0,D3-1) );
    }

    return output;
}

Option 3: Transposing layers (slower than Raw permutation but comparable)

We can iterate over layers (fixed row) of the input cube and transpose them.

template <typename T>
static Cube<T> permute (Cube<T>& cube_in){
    // in a cube, permute {1,3,2} (permute slices by columns)

    uword D1 = cube_in.n_rows;
    uword D2 = cube_in.n_cols;
    uword D3 = cube_in.n_slices;

    if(D3 > D2){
        cube_in.resize(D1,D3,D3);
    } else if (D2 > D3) {
        cube_in.resize(D1,D2,D2);
    }

    for (uword r=0; r<D1; ++r) {        
        static cmat layer = cmat(cube_in.rows(r,r)); 
        inplace_strans(layer);
        cube_in.rows(r,r)=layer;               
    }

    cube_in.resize(D1,D3,D2);

    return cube_in;
}

Option 4: Look-up table Get non-contiguous access by reading indeces in a vector.

template <typename T>
arma::Cube<T> permuteCS (arma::Cube<T> cube_in){
    // in a cube, permute {1,3,2} (permute slices by columns)
    uword D1 = cube_in.n_rows;
    uword D2 = cube_in.n_cols;
    uword D3 = cube_in.n_slices;

    cx_vec onedcube = cube_in.elem(gen_trans_idx(cube_in));            

    return arma::Cube<T>(onedcube.memptr(), D1, D3, D2, true ) ;
}

where gen_trans_idx is a function that generates the indeces of the permuted cube:

template <typename T>
uvec gen_trans_idx(Cube<T>& cube){

    uword D1 = cube.n_rows;
    uword D2 = cube.n_cols;
    uword D3 = cube.n_slices;

    uvec perm132(D1*D2*D3);    
    uword ii = 0;                
    for (int c = 0; c < D2; ++c){
    for (int s = 0; s < D3; ++s){
    for (int r = 0; r < D1; ++r){
        perm132.at(ii) = sub2ind(size(cube), r, c, s);
        ii=ii+1;
    }}} 

    return perm132;
}

Ideally, one would pre-calculate these look-up tables if the Cube dimensions are determined beforehand.

Option 5 (in-place transposition) Very slow, memory efficient

// Option: In-place transpose
template <typename T>
arma::Cube<T> permuteCS (arma::Cube<T> cube_in, uvec permlist ){    

    T* Qpoint = cube_in.memptr(); // pointer to first element of cube_in

    uvec updateidx = find(permlist - arma::linspace<uvec>(0,cube_in.n_elem-1,cube_in.n_elem)); // index of elements that change position in memory           
    uvec skiplist(updateidx.n_elem,fill::zeros);

    uword rr = 0; // aux index for updatelix
    for(uword jj=0;jj<updateidx.n_elem;++jj){                

        if(any(updateidx[jj] == skiplist)){ // if element jj has already been updated
            // do nothing
        } else {

            uword scope = updateidx[jj];
            T target = *(Qpoint+permlist[scope]);        // store the value of the target element             

            while(any(scope==skiplist)-1){              // while wareyou has not been updated

                T local  = *(Qpoint+scope);                  // store local value                                                                
                *(Qpoint+scope) = target;       
                skiplist[rr]=scope;
                ++rr;            
                uvec wareyou = find(permlist==scope);        // find where the local value will appear                
                scope = wareyou[0];
                target = local;                            
            }

        }
    }
   cube_in.reshape(cube_in.n_rows,cube_in.n_slices,cube_in.n_cols);

    return cub

e_in;
}
Holland answered 14/8, 2018 at 13:9 Comment(9)
I think .t or .st transposes a amadrillo matix. arma.sourceforge.net/docs.htmlGriefstricken
.tor .st are methods of arma::Mat not arma::CubeHolland
Interesting question, but why the rcpp tag?Deucalion
I've been told that RcppArmadillo users deal with Cubes on a constant basis and maybe I can get some insight from their experience.Holland
related question: #35796748Cassidycassie
the answer to the question referred by @Cassidycassie is essentially a raw permutation (Option 1 above). I am wondering about the fastest strategy to do this permutation.Holland
At the first blush you can optimize it by doing Cube<T> output(D1, D3, D2); instead of Cube<T> output; output.zeros(D1, D3, D2);, because you actually don't need to initialize memory. Also you can try to do a hack using memcpy and the fact that all the data is stored contigously (arma.sourceforge.net/docs.html#Cube). Of course, that won't be safe and will be platform dependent, but should be a lot more faster. Another idea is to create your wrapper with extra boolean field for transposed / not transposed, and override methods, like indexingMahogany
Hi @CrafterKolyan, you are right about the initialisation of output (I edited my answer). Regarding the alternative using memcpy, I am not sure how would I implement this option. In option 4, I access the element non-contiguously in a vector and construct a new Cube using a pointer to this vector. Could you maybe develop your comment into an answer? Many thanksHolland
@cacosomoza I will try to think about this only on evening. Also I looked a little bit through your code, and want you to add const reference where possible to prevent copying objects. E.g.: arma::Cube<T> permuteCS (const arma::Cube<T> &cube_in, const uvec &permlist). Also i think you'll need to do your tests again, because this might affect your speed greatly if D1 * D2 * D3 is large. And I have a question: do you always need a new Cube<T> or maybe we can modify Cube<T> in parameter of function and not create a new instance each time?Mahogany
M
5

This code goes as addition to my commentary about memcpy hack. Also don't forget to try to add const reference to prevent copying objects.

template <typename T>
static Cube<T> permute(const Cube<T> &cube){
    const uword D1 = cube.n_rows;
    const uword D2 = cube.n_cols;
    const uword D3 = cube.n_slices;
    const uword D1_mul_D3 = D1 * D3;
    const Cube<T> output(D1, D3, D2);

    const T * from = cube.memptr();
    T *to = output.memptr();

    for (uword s = 0; s < D3; ++s){
        T *to_tmp = to + D1 * s;
        for (uword c = 0; c < D2; ++c){
            memcpy(to_tmp, from, D1 * sizeof(*from));
            from += D1;
            to_tmp += D1_mul_D3;
        }
    }

    return output;
}
Mahogany answered 24/8, 2018 at 21:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.