CUDA Profiler: Calculate memory and compute utilization
Asked Answered
D

1

7

I am trying to establish two overall measurements for memory bandwidth utilization and compute throughput utilization for my GPU-accelerated application using CUDA nsight profiler on ubuntu. The application runs on a Tesla K20c GPU.

The two measurements I want are to some extend comparable to the ones given in this graph: Figure one

The problems are that no exact numbers are given here and more importantly that I do not know how these percentages are being calculated.

Memory Bandwidth Utilization

The Profiler tells me that my GPU has a Max Global Memory Bandwidth of 208 GB/s.

enter image description here

Does this refer to Device Memory BW or the Global Memory BW? It sais Global but the first one makes more sense to me.

For my kernel the profiler tells me that the Device Memory Bandwidth is 98.069 GB/s.

enter image description here

Assuming that the max of 208 GB/s refer to the Device Memory could I then simply calculate the Memory BW Utilization as 90.069/208 = 43%? Note that this kernel is executed multiple times without additional CPU-GPU data transfers. The system BW is therefore not important.

Compute Throughput Utilization

I am not exactly sure what the best way is to put Compute Throughput Utilization into a number. My best guess is to use the Instructions per Cycle to max Instructions per cycle ratio. The profiler tells me that the max IPC is 7 (see picture above).

First of all, what does that actually mean? Each multiprocessor has 192 cores and therefore a maximum of 6 active warps. Wouldnt that mean that max IPC should be 6?

The profiler tells me that my kernel has issued IPC = 1.144 and executed IPC = 0.907. Should I calculate the compute utilization as 1.144/7 = 16% or 0.907/7 = 13% or none of these?

Are these two measurements (Memory and compute utilization) giving an adequate first impression of how efficiently my kernel is using the resources? Or are there other important metrics that should be included?

Additional Graph

enter image description here

Delubrum answered 25/1, 2015 at 16:16 Comment(1)
Global Memory Bandwidth is the theoretical bandwidth to device memory (memory on the GPU board).Hippel
H
3

NOTE: I will try to update this answer for additional details in the future. I do not think all of the individual components of the calculations of easily visible in the Visual Profiler reports.

Compute Utilization

This is the pipeline utilization of the logical pipes: memory, control flow, and arithmetic. The SMs have numerous execution pipes that are not document. If you look at the instruction throughput charts you can determine to a high level how to calculate utilization. You can read the kepler or maxwell architecture documents for more information on the pipelines. A CUDA core is a marketing term for a integer/single precision floating point math pipeline.

This calculation is not based upon IPC. Its based upon pipeline utilization and issue cycles. For example, you can be at 100% utilization if you issue 1 instruction/cycle (never dual-issue). You can also be at 100% if you issue a double precision instruction at maximum rate (depends on GPU).

Memory Bandwidth Utilization

The profiler calculates the utilization of L1, TEX, L2, and device memory. The highest value is shown. It is very possible to have very high data path utilization but very low bandwidth utilization.

A memory latency boundness reason should also be calculated. It is very easy to have a program bound by memory latency but not bound by compute utilization or memory bandwidth.

Hippel answered 25/1, 2015 at 21:6 Comment(5)
By kepler architecture documents basically you mean this one: Kepler Compute Architecture White Paper ?Delubrum
By instruction throughput charts do you mean like the one I posted above at the end (edited)?Delubrum
In graph four the Load/Store Units have the highest utilization but in graph one memory operations have a smaller share than arithmetic operations. Why is that?Delubrum
In the top graph I believe the value is normalized by issue slots or cycles and in the bottom it is normalized by throughput of the execution path. For example, a Kepler SM can only issue 1 LSU warp instruction per cycle but up to 6 ALU warp instructions per cycle (using all 4 warp schedulers).Hippel
So compute utilization is given by the utilization of the compute subsystem (load/store, arithmetic and logic) with the highest utilization?Delubrum

© 2022 - 2024 — McMap. All rights reserved.