I was recently asked about a piece of code to decimate/downsample the array "in-place". This "decimation" function takes an array of ints and stores an entry at an even index i
in the array at the index i/2
. It does it for all entries in the array.
This would move all even indexed entries in the original array to the first-half of the array. The rest of the array can then be initialized to 0. The overall result is an array that preserved all even index entries in the original array (by moving them to the first-half) and the second-half of the array being 0. This is apparently used to downsample signals in signal processing.
The code looks something like this:
void decimate (vector<int>& a) {
int sz = a.size();
for (int i =0; i < sz; i++) {
if (i%2 == 0) {
a[i/2] = a[i];
}
}
for (int i =(sz-1)/2; i < sz; i++) a[i] = 0;
}
After suggesting basic improvements that keep certain variables in registers, I can't find any further way to optimize it but not sure if that can't be done.
Are there ways one could optimize the memory access pattern in the loop for better cache performance? Or any other ways to optimize the main copy operations of compressing/down-sampling the array into the first-half ? (e.g. by vectorization for platforms that support it)
for (int i =0; i < sz; i++) {
if (i%2 == 0) {
a[i/2] = a[i];
}
}
Are there any loop transformations (such as tiling/strip-mining) that can lead to highly efficient code for such decimate loop?
EDIT: There are a few different ways suggested in the answers below that seem to take advantage of memset/fill or pointer arithmetic to gain speed efficiency. This question is main focused on whether there are well-defined Loop Transformations that can significantly improve locality or cache misses (e.g. if it were a loop-nest with two loops, one could potentially look into loop tiling to optimize cache misses)