CPU/Intel OpenCL performance issues, implementation questions

Asked 15/11, 2012 at 19:33 Answered 26/8, 2015 at 22:38

Solved opencl cpu intel vectorization hyperthreading

I have some questions hanging in the air without an answer for a few days now. The questions arose because I have an OpenMP and an OpenCL implementations of the same problem. The OpenCL runs perfectly on GPU but has 50% less performance when ran on CPU (compared to the OpenMP implementation). A post is already dealing with the difference between OpenMP and OpenCL performances, but it doesn't answer my questions. At the moment I face these questions:

1) Is it really that important to have "vectorized kernel" (in terms of the Intel Offline Compiler)?

There is a similar post, but I think my question is more general.

As I understand: a vectorized kernel not necessarily means that there is no vector/SIMD instruction in the compiled binary. I checked the assembly codes of my kernels, and there are a bunch of SIMD instructions. A vectorized kernel means that by using SIMD instructions you can execute 4 (SSE) or 8 (AVX) OpenCL "logical" threads in one CPU thread. This can only be achieved if ALL your data is consecutively stored in the memory. But who has such perfectly sorted data?

So my question would be: Is it really that important to have your kernel "vectorized" in this sense?

Of course it gives performance improvement, but if most of the computation intensive parts in the kernel are done by vector instructions then you might get near the "optimal" performance. I think the answer to my question lies in the memory bandwidth. Probably vector registers better fit to efficient memory access. In that case the kernel arguments (pointers) have to be vectorized.

2) If I allocate data in local memory on a CPU, where will it be allocated? OpenCL shows L1 cache as local memory, but it is clearly not the same type of memory like on GPU's local memory. If its stored in the RAM/global memory, then there is no sense copying data into it. If it would be in cache, some other process would might flush it out... so that doesn’t make sense either.

3) How are "logical" OpenCL threads mapped to real CPU software/hardware(Intel HTT) threads? Because if I have short running kernels and the kernels are forked like in TBB (Thread Building Blocks) or OpenMP then the fork overhead will dominate.

4) What is the thread fork overhead? Are there new CPU threads forked for every "logical" OpenCL threads or are the CPU threads forked once, and reused for more "logical" OpenCL threads?

I hope I'm not the only one who is interested in these tiny things and some of you might now bits of these problems. Thank you in advance!

UPDATE

3) At the moment OpenCL overhead is more significant then OpenMP, so heavy kernels are required for efficient runtime execution. In Intel OpenCL a work-group is mapped to an TBB thread, so 1 virtual CPU core executes a whole work-group (or thread block). A work-group is implemented with 3 nested for loops, where the inner most loop is vectorized, if possible. So you could imagine it something like:

#pragam omp parallel for
for(wg=0; wg < get_num_groups(2)*get_num_groups(1)*get_num_groups(0); wg++) {

  for(k=0; k<get_local_size(2); k++) {
    for(j=0; j<get_local_size(1); j++) {
      #pragma simd
      for(i=0; i<get_local_size(0); i++) {
        ... work-load...
      }
    }
  }
}

If the inner most loop can be vectorized it steps with SIMD steps:

for(i=0; i<get_local_size(0); i+=SIMD) {

4) Every TBB thread is forked once during the OpenCL execution and they are reused. Every TBB thread is tied to a virtual core, ie. there is no thread migration during the computation.

I also accept @natchouf-s answer.

Falda answered 15/11, 2012 at 19:33 Comment(1)

Unlike with most GPUs, CPU caches are not directly addressable. Data is always stored in main memory and if it fits in the cache, it stays in the cache as long as possible. – Celebrated 16/11, 2012 at 10:56

I may have a few hints to your questions. In my little experience, a good OpenCL implementation tuned for the CPU can't beat a good OpenMP implementation. If it does, you could probably improve the OpenMP code to beat the OpenCL one.

1) It is very important to have vectorized kernels. It is linked to your question number 3 and 4. If you have a kernel that handles 4 or 8 input values, you'll have much less work items (threads), and hence much less overhead. I recommend to use the vector instructions and data provided by OpenCL (like float4, float8, float16) instead of relying on auto-vectorization. Do not hesitate to use float16 (or double16): this will be mapped to 4 sse or 2 avx vectors and will divide by 16 the number of work items required (which is good for CPU, but not always for GPU: I use 2 different kernels for CPU and GPU).

2) local memory on CPU is the RAM. Don't use it on a CPU kernel.

3 and 4) I don't really know, it will depend on the implementation, but the fork overhead seems important to me.

Crain answered 16/11, 2012 at 11:16 Comment(3)

Frist of all, thank you for the answers, they are light in the total darkness. Based on you first two answers, I need to totally rewrite the kernels, and see if it work well on CPU and GPU as well. I'll let you know if I'm done with that. – Falda 16/11, 2012 at 11:45

about #2, I read a paper a while ago where they were forcing local memory into the cache by making any global requests fully commit or read without caching. Could be an interesting concept, but as far as Intel and AMD currently you are correct about them just slamming all the data into RAM. – Manolete 17/11, 2012 at 0:58

@Falda You will need two separate kernels, one for GPU and one for CPU and then decide at run-time which you are running on. GPU optimizations are often times bad for the CPU and vice versa. – Manolete 17/11, 2012 at 0:59

for question 3:

Intel group logical OpenCL threads into one hardware thread. and the group size can varies from 4, 8, to 16. A logical OpenCL thread map to one SIMD lane of execution unit. one execution unit has two SIMD engines with a width of 4. please refer to following document for further details. https://software.intel.com/sites/default/files/Faster-Better-Pixels-on-the-Go-and-in-the-Cloud-with-OpenCL-on-Intel-Architecture.pdf

Nolan answered 26/8, 2015 at 22:38 Comment(0)

Recommended topics

Hot tags