I have a couple of doubts regarding the application of Amdahl's law with respect to GPUs. For instance, I have a kernel code that I have launched with a number of threads, say N. So,in the amdahl's law the number of processors will be N right? Also, for any CUDA programming using a large number of threads, is it safe for me to assume that the Amdahl's law is reduced to 1/(1-p) wherein p stands for the parallel code? Thanks
For instance, I have a kernel code that I have launched with a number of threads, say N. So,in the amdahl's law the number of processors will be N right?
Not exactly. GPU does not have as many physical cores (K) as the number of threads you can launch (N) (usually, K is around 103, N is in range 104 -- 106). However, significant portion of kernel time is (usually) spend just waiting for the data to be read/written from/to global memory, so one core can seamlessly handle several threads. This way the device can handle up to N0 threads without them interfering with each other, where N0 is usually several times bigger than K, but actually depends upon you kernel function.
In my opinion, the best way to determine this N0 is to experimentally measure performance of your application and then use this data to fit parameters of Amdahl's law :)
Also, for any CUDA programming using a large number of threads, is it safe for me to assume that the Amdahl's law is reduced to 1/(1-p) wherein p stands for the parallel code?
This assumption basically means that you neglect the time for the parallel part of your code (it is executed infinitely fast) and only consider time for serial part.
E.g. if you compute the sum of two 100-element vectors on GPU, then initializing of device, data copying, kernel launch overhead etc (serial part) takes much more time than kernel execution (parallel part). However, usually this is not true.
Also, the individual GPU core does not have the same performance as CPU core, so you should do some scaling, making Amdah'l law 1 / [(1-p) + k*p/N]
(at it's simplest, k = Frequency(CPU) / Frequency(GPU)
, sometimes k is increased even more to take into account architectural differences, like CPU core having SIMD block).
I could also argue against literally applying Amdahl's law to real systems. Sure, it shows the general trend, but it does not grasp some non-trivial processes.
First, Amdahl's law assumes that given infinite number of cores the parallel part is executed instantly. This assumption is not true (though sometimes it might be pretty accurate). Even if you calculate the sum of two vectors, you can't compute it faster than it takes to add two bytes. One can neglect this "quanta", or include it in serial portion of algorithm, but it somewhat "breaks" the idea.
How to correctly estimate in Amdahl's law the effect of barrier synchronization, critical section, atomic operations etc. is, to the best of my knowledge, unresolved mystery. Such operations belong to parallel part, but walltime of their execution is at best independent of the number of threads and, at worst, is positively dependent.
Simple example: broadcasting time between computational nodes in CPU cluster scales as O(log N)
. Some initial initialization can take up to O(N)
time.
In simple cases one can somewhat estimate the benefit of parallelisation of the algorithm, but (as often the case with CUDA) the static overhead of using the parallel processing might take more time, than parallel processing itself saves.
So, in my opinion, it is usually simpler to write application, measure it's performance and use it to plot Amdahl's curve than trying to a priori correctly estimate all the nuances of algorithm and hardware. In case where such estimations could be easily made, they are usually obvious without any "laws".
Amdahl's law actually states that the speedup is less than or equal that fraction. So, that is the theoretical maximum, and the actual speedup will be less than always
© 2022 - 2024 — McMap. All rights reserved.
N
the device will be saturated, and doubling the number of threads will ~double the time required to process them. In my opinion, even on "classical" CPU clusters Amdahl's law is not applicable due to complex synchronization schemes etc. It rather shows the general idea of parallelisation, but is not suitable for actual performance estimations. – CantilenaN
should be the number of physical cores on GPU. However, due to high memory latency, each core can easily handle several threads, changing from one to another as they wait for data to arrive. However, there is some values of threads/core ratio (which depends on exact algorithm) when core is saturated (i.e. memory latency is not enough to hide multiple threads). So you should determine when you GPU becomes saturated (i.e. when execution time significantly increases with increasing number of threads) and use this "effective core number". – CantilenaMPI_Bcast
(which scales asO(log N)
) considered serial or parallel? – Cantilena