Amdahl's law and GPU

Asked 13/9, 2012 at 3:14 Answered 24/7, 2023 at 11:47

I have a couple of doubts regarding the application of Amdahl's law with respect to GPUs. For instance, I have a kernel code that I have launched with a number of threads, say N. So,in the amdahl's law the number of processors will be N right? Also, for any CUDA programming using a large number of threads, is it safe for me to assume that the Amdahl's law is reduced to 1/(1-p) wherein p stands for the parallel code? Thanks

Winthorpe answered 13/9, 2012 at 3:14 Comment(14)

By your rationale, the code can have nearly infinite speed-up with GPU, which is certainly not true. If you run much more threads than there are actual ALUs on GPU, after some N the device will be saturated, and doubling the number of threads will ~double the time required to process them. In my opinion, even on "classical" CPU clusters Amdahl's law is not applicable due to complex synchronization schemes etc. It rather shows the general idea of parallelisation, but is not suitable for actual performance estimations. – Cantilena 13/9, 2012 at 3:21

I agree with the saturation part. The curve in Amdahl's law shows saturation once the number of processors hit around 2000. But isn't there a modification of Amdahl's for heterogeneous computing systems like GPUs wherein programmers can calculate theoretical speed-ups? – Winthorpe 13/9, 2012 at 3:27

Actual speed-up depends on too many values to be adequately estimated. E.g. the type of memory used plays very significant role on GPU, and "doing it wrong" can cause performance to drop 30 times (on CPU being cache-aware can grant some speed-up too, but not that much), although the "parallel" and "serial" parts of code remain essentially the same. IMO, the best way to estimate is looking through results achieved by similar projects (e.g. on nvidia's website) – Cantilena 13/9, 2012 at 3:39

The thrust of A's law is that the speedup due to parallelism is limited by the portion of the code that is not parallel. That is true no matter what extra synchronization and constants you are talking about, and applies to GPUs as much as any parallel system. – Clam 13/9, 2012 at 3:50

so, my question is the following: With respect to GPUs and especially for a program that is running several threads, A's law will be deprecated to 1/(1-p) or is there a modification of the law I might have overlooked tailored made for heterogeneous systems? – Winthorpe 13/9, 2012 at 3:58

@AnirudhKaushik A's law will be deprecated to 1/(1-p) only if your parallel processor has infinite performance, which is not true. Basically, your approximation means "the parallel part of my code runs infinitely fast and takes no time, so we consider only serial part". – Cantilena 13/9, 2012 at 4:8

@Cantilena hmm.. I understand that now. But I would like to now know another aspect of the Amdahl's law. The variable N denoting number of number of processors will be equal to the number of threads I launch right? – Winthorpe 13/9, 2012 at 4:15

@AnirudhKaushik From formulation of A's law, N should be the number of physical cores on GPU. However, due to high memory latency, each core can easily handle several threads, changing from one to another as they wait for data to arrive. However, there is some values of threads/core ratio (which depends on exact algorithm) when core is saturated (i.e. memory latency is not enough to hide multiple threads). So you should determine when you GPU becomes saturated (i.e. when execution time significantly increases with increasing number of threads) and use this "effective core number". – Cantilena 13/9, 2012 at 4:22

@Clam Well, ideologically this is true, but in reality separation of parallel and serial parts are trickier. E.g. is MPI_Bcast (which scales as O(log N)) considered serial or parallel? – Cantilena 13/9, 2012 at 4:31

Fair point. You're right that in heterogeneous systems things get a lot more complicated. And if you are talking about MPI, you are probably thinking about clusters: BIG heterogeneous systems. – Clam 13/9, 2012 at 6:24

@Cantilena do you want to post an answer to this question? – Clam 14/9, 2012 at 0:56

@Cantilena In regard to core number, so as I understand that in a TESLA architecture, the number of cores residing on a SM is 8. And each SM can process upto 1024 threads simulataneously. So if my program launches a kernel with 1024 threads, I can say that the number of cores used is 8 right? – Winthorpe 14/9, 2012 at 1:33

@AnirudhKaushik Not exactly. Even if you kernel runs of one SM (which might not be possible due to register/shmem limitation of scheduler's whim), one core can execute >1 thread. For example, there are 2 threads assigned to the core, T1 and T2. T1 makes some arithmetic and starts reading from global memory. While T1 waits for its data to load, core can execute T2, which also does some computations and starts memory operation. By this time, T1 got its data, processes it and starts writing it to global memory. And so on. Usually memory latency can hide quite a lot of parallel threads. – Cantilena 14/9, 2012 at 2:7

@Clam I DID IT! We can continue the discussion of Amdahl's law in comments to my answer if you wish :) – Cantilena 14/9, 2012 at 2:14

For instance, I have a kernel code that I have launched with a number of threads, say N. So,in the amdahl's law the number of processors will be N right?

Not exactly. GPU does not have as many physical cores (K) as the number of threads you can launch (N) (usually, K is around 10³, N is in range 10⁴ -- 10⁶). However, significant portion of kernel time is (usually) spend just waiting for the data to be read/written from/to global memory, so one core can seamlessly handle several threads. This way the device can handle up to N₀ threads without them interfering with each other, where N₀ is usually several times bigger than K, but actually depends upon you kernel function.

In my opinion, the best way to determine this N₀ is to experimentally measure performance of your application and then use this data to fit parameters of Amdahl's law :)

Also, for any CUDA programming using a large number of threads, is it safe for me to assume that the Amdahl's law is reduced to 1/(1-p) wherein p stands for the parallel code?

This assumption basically means that you neglect the time for the parallel part of your code (it is executed infinitely fast) and only consider time for serial part.

E.g. if you compute the sum of two 100-element vectors on GPU, then initializing of device, data copying, kernel launch overhead etc (serial part) takes much more time than kernel execution (parallel part). However, usually this is not true.

Also, the individual GPU core does not have the same performance as CPU core, so you should do some scaling, making Amdah'l law 1 / [(1-p) + k*p/N] (at it's simplest, k = Frequency(CPU) / Frequency(GPU), sometimes k is increased even more to take into account architectural differences, like CPU core having SIMD block).

I could also argue against literally applying Amdahl's law to real systems. Sure, it shows the general trend, but it does not grasp some non-trivial processes.

First, Amdahl's law assumes that given infinite number of cores the parallel part is executed instantly. This assumption is not true (though sometimes it might be pretty accurate). Even if you calculate the sum of two vectors, you can't compute it faster than it takes to add two bytes. One can neglect this "quanta", or include it in serial portion of algorithm, but it somewhat "breaks" the idea.

How to correctly estimate in Amdahl's law the effect of barrier synchronization, critical section, atomic operations etc. is, to the best of my knowledge, unresolved mystery. Such operations belong to parallel part, but walltime of their execution is at best independent of the number of threads and, at worst, is positively dependent.

Simple example: broadcasting time between computational nodes in CPU cluster scales as O(log N). Some initial initialization can take up to O(N) time.

In simple cases one can somewhat estimate the benefit of parallelisation of the algorithm, but (as often the case with CUDA) the static overhead of using the parallel processing might take more time, than parallel processing itself saves.

So, in my opinion, it is usually simpler to write application, measure it's performance and use it to plot Amdahl's curve than trying to a priori correctly estimate all the nuances of algorithm and hardware. In case where such estimations could be easily made, they are usually obvious without any "laws".

Cantilena answered 14/9, 2012 at 2:11 Comment(0)

Amdahl's law actually states that the speedup is less than or equal that fraction. So, that is the theoretical maximum, and the actual speedup will be less than always

Joceline answered 24/7, 2023 at 11:47 Comment(0)

Recommended topics

Hot tags