I have a project which involves a lot of permutations on 3D arrays ( arma::Cube<cx_double>
). In particular, the required permutation is the interchange of columns by slices. In Matlab this is efficiently calculated by permute(cube,[1,3,2])
and in Python by numpy.transpose(cube,axis=[0,2,1])
.
Unfortunately Armadillo doesn't have a permute
function on its own. I have tried different approaches but all of them rather slow when compared to Matlab. I would like to know what's the faster way to permute (rather large) Cubes in Armadillo. Profiling the code with gprof
, most of the time is spent in the permute functions that I have tried below, whereas in Matlab, for the same ported project, most of the time is spent in SVD or QR matrix decompositions (reshape and permute are fast in matlab).
I would like to understand which is the fastest way to do this permutation in Armadillo and why some approaches work better than others.
Option 1: Raw permutation (Fastest option) (Is there a faster approach?)
Element-wise assignment of input cube to output cube.
template <typename T>
static Cube<T> permute (Cube<T>& cube){
uword D1=cube.n_rows;
uword D2=cube.n_cols;
uword D3=cube.n_slices;
Cube<T> output(D1,D3,D2);
for (uword s = 0; s < D3; ++s){
for (uword c = 0; c < D2; ++c){
for (uword r = 0; r < D1; ++r){
output.at(r, s, c) = cube.at(r, c, s);
// output[ D1*D3*c + D1*s+ r ] = cube[ D1*D2*s + D1*c + r ];
}
}
}
return output;
}
Option 2: Filling slices (Very slow)
Fill the slices of the output cube by non-contiguous subcube
views.
template <typename T>
static Cube<T> permute (Cube<T>& cube_in){
uword D1 = cube_in.n_rows;
uword D2 = cube_in.n_cols;
uword D3 = cube_in.n_slices;
Cube<T> output;
output.zeros(D1, D3, D2);
for (uword c=0; c<D2; ++c) {
output.slice(c) = cube_in.subcube( span(0,D1-1),span(c),span(0,D3-1) );
}
return output;
}
Option 3: Transposing layers (slower than Raw permutation but comparable)
We can iterate over layers (fixed row) of the input cube and transpose them.
template <typename T>
static Cube<T> permute (Cube<T>& cube_in){
// in a cube, permute {1,3,2} (permute slices by columns)
uword D1 = cube_in.n_rows;
uword D2 = cube_in.n_cols;
uword D3 = cube_in.n_slices;
if(D3 > D2){
cube_in.resize(D1,D3,D3);
} else if (D2 > D3) {
cube_in.resize(D1,D2,D2);
}
for (uword r=0; r<D1; ++r) {
static cmat layer = cmat(cube_in.rows(r,r));
inplace_strans(layer);
cube_in.rows(r,r)=layer;
}
cube_in.resize(D1,D3,D2);
return cube_in;
}
Option 4: Look-up table Get non-contiguous access by reading indeces in a vector.
template <typename T>
arma::Cube<T> permuteCS (arma::Cube<T> cube_in){
// in a cube, permute {1,3,2} (permute slices by columns)
uword D1 = cube_in.n_rows;
uword D2 = cube_in.n_cols;
uword D3 = cube_in.n_slices;
cx_vec onedcube = cube_in.elem(gen_trans_idx(cube_in));
return arma::Cube<T>(onedcube.memptr(), D1, D3, D2, true ) ;
}
where gen_trans_idx
is a function that generates the indeces of the permuted cube:
template <typename T>
uvec gen_trans_idx(Cube<T>& cube){
uword D1 = cube.n_rows;
uword D2 = cube.n_cols;
uword D3 = cube.n_slices;
uvec perm132(D1*D2*D3);
uword ii = 0;
for (int c = 0; c < D2; ++c){
for (int s = 0; s < D3; ++s){
for (int r = 0; r < D1; ++r){
perm132.at(ii) = sub2ind(size(cube), r, c, s);
ii=ii+1;
}}}
return perm132;
}
Ideally, one would pre-calculate these look-up tables if the Cube dimensions are determined beforehand.
Option 5 (in-place transposition) Very slow, memory efficient
// Option: In-place transpose
template <typename T>
arma::Cube<T> permuteCS (arma::Cube<T> cube_in, uvec permlist ){
T* Qpoint = cube_in.memptr(); // pointer to first element of cube_in
uvec updateidx = find(permlist - arma::linspace<uvec>(0,cube_in.n_elem-1,cube_in.n_elem)); // index of elements that change position in memory
uvec skiplist(updateidx.n_elem,fill::zeros);
uword rr = 0; // aux index for updatelix
for(uword jj=0;jj<updateidx.n_elem;++jj){
if(any(updateidx[jj] == skiplist)){ // if element jj has already been updated
// do nothing
} else {
uword scope = updateidx[jj];
T target = *(Qpoint+permlist[scope]); // store the value of the target element
while(any(scope==skiplist)-1){ // while wareyou has not been updated
T local = *(Qpoint+scope); // store local value
*(Qpoint+scope) = target;
skiplist[rr]=scope;
++rr;
uvec wareyou = find(permlist==scope); // find where the local value will appear
scope = wareyou[0];
target = local;
}
}
}
cube_in.reshape(cube_in.n_rows,cube_in.n_slices,cube_in.n_cols);
return cub
e_in;
}
.t
or.st
transposes a amadrillo matix. arma.sourceforge.net/docs.html – Griefstricken.t
or.st
are methods of arma::Mat not arma::Cube – HollandCube<T> output(D1, D3, D2);
instead ofCube<T> output; output.zeros(D1, D3, D2);
, because you actually don't need to initialize memory. Also you can try to do a hack usingmemcpy
and the fact that all the data is stored contigously (arma.sourceforge.net/docs.html#Cube). Of course, that won't be safe and will be platform dependent, but should be a lot more faster. Another idea is to create your wrapper with extra boolean field for transposed / not transposed, and override methods, like indexing – Mahoganyoutput
(I edited my answer). Regarding the alternative usingmemcpy
, I am not sure how would I implement this option. In option 4, I access the element non-contiguously in a vector and construct a new Cube using a pointer to this vector. Could you maybe develop your comment into an answer? Many thanks – Hollandconst reference
where possible to prevent copying objects. E.g.:arma::Cube<T> permuteCS (const arma::Cube<T> &cube_in, const uvec &permlist)
. Also i think you'll need to do your tests again, because this might affect your speed greatly ifD1 * D2 * D3
is large. And I have a question: do you always need anew Cube<T>
or maybe we can modifyCube<T>
in parameter of function and not create a new instance each time? – Mahogany