I am quite new to GPU programming, but since I have a computationally intensive task I have turned to the GPU for possible performance gains.
I tried rewriting my program with ArrayFire Free version. It is indeed faster than my CPU routine with multi-threading enabled, but not to the degree I expected (that is, < 100% speedup), and the returned results are not quite right (< 1% error compared to CPU routine, assuming the CPU routine's results are correct).
My task is mainly element-wise float-32 maths operations on large matrices (300MB-500MB size), with little if-thens/switch-cases etc. I guess the performance bottleneck is likely the bandwidth between CPU and GPU memory since there is a lot of data-reading, etc. The GPU I tested is a GeForce 580GTX with 3GB of video memory.
Is there still some significant room for optimization if I write raw CUDA code (with CUBLAS etc. and average optimization) instead of using ArrayFire for my task? I read some NVIDIA optimization guides; it seems that there is some memory-access tricks there for faster data-access and reducing bank-conflicts. Does ArrayFire use these general tricks automatically or not?