Should I look into PTX to optimize my kernel? If so, how?

The first point to make about PTX is that it is only an intermediate representation of the code run on the GPU -- a virtual machine assembly language. PTX is assembled to target machine code either by ptxas at compile time, or by the driver at runtime. So when you are looking at PTX, you are looking at what the compiler emitted, but not at what the GPU will actually run. It is also possible to write your own PTX code, either from scratch (this is the only JIT compilation model supported in CUDA), or as part of inline-assembler sections in CUDA C code (the latter officially supported since CUDA 4.0, but "unofficially" supported for much longer than that). CUDA has always shipped with a complete guide to the PTX language with the toolkit, and it is fully documented. The ocelot project has used this documentation to implement their own PTX cross compiler, which allows CUDA code to run natively on other hardware, initially x86 processors, but more recently AMD GPUs.

If you want to see what the GPU is actualy running (as opposed to what the compiler is emitting), NVIDIA now supply a binary disassembler tool called cudaobjdump which can show the actual machine code segments in code compiled for Fermi GPUs. There was an older, unofficialy tool called decuda which worked for G80 and G90 GPUs.

Having said that, there is a lot to be learned from PTX output, particularly at how the compiler is applying optimizations and what instructions it is emitting to implement certain C contructs. Every version of the NVIDIA CUDA toolkit comes with a guide to nvcc and documentation for the PTX language. There is plenty of information contained in both documents to both learn how to compile a CUDA C/C++ kernel code to PTX, and to understand what the PTX instructions will do.

Recommended topics

Hot tags