I am trying to establish two overall measurements for memory bandwidth utilization and compute throughput utilization for my GPU-accelerated application using CUDA nsight profiler on ubuntu. The application runs on a Tesla K20c GPU.
The two measurements I want are to some extend comparable to the ones given in this graph:
The problems are that no exact numbers are given here and more importantly that I do not know how these percentages are being calculated.
Memory Bandwidth Utilization
The Profiler tells me that my GPU has a Max Global Memory Bandwidth of 208 GB/s.
Does this refer to Device Memory BW or the Global Memory BW? It sais Global but the first one makes more sense to me.
For my kernel the profiler tells me that the Device Memory Bandwidth is 98.069 GB/s.
Assuming that the max of 208 GB/s refer to the Device Memory could I then simply calculate the Memory BW Utilization as 90.069/208 = 43%? Note that this kernel is executed multiple times without additional CPU-GPU data transfers. The system BW is therefore not important.
Compute Throughput Utilization
I am not exactly sure what the best way is to put Compute Throughput Utilization into a number. My best guess is to use the Instructions per Cycle to max Instructions per cycle ratio. The profiler tells me that the max IPC is 7 (see picture above).
First of all, what does that actually mean? Each multiprocessor has 192 cores and therefore a maximum of 6 active warps. Wouldnt that mean that max IPC should be 6?
The profiler tells me that my kernel has issued IPC = 1.144 and executed IPC = 0.907. Should I calculate the compute utilization as 1.144/7 = 16% or 0.907/7 = 13% or none of these?
Are these two measurements (Memory and compute utilization) giving an adequate first impression of how efficiently my kernel is using the resources? Or are there other important metrics that should be included?
Additional Graph