Why is the method of im2col with GEMM is more efficient than the method of direction implementation with SIMD in CNN

The convolutional layers are most computationally intense parts of Convolutional neural networks (CNNs).Currently the common approach to impement convolutional layers is to expand the image into a column matrix(im2col) and perform and perform Multiple Channel Multiple Kernel (MCMK) convolution using an existing parallel General Matrix Multiplication (GEMM) library. However im2col operation need load and store the image data, and also need another memory block to hold the intermediate data.

If I need to optimize the convolutional implementation, I may choose to direct implementation with SIMD instructions. Such method will not incur any memory operation overhead.

the benefits from the very regular patterns of memory access outweigh the wasteful storage costs.

From the following link, at the end of the link

https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/

So I hope to know the reason. May floating point operations require more instruction cycle? or the input image is not much large, so it may residue in the cache and the memory operations don't need access DDR and consume less cycles.

Cache-blocking a GEMM is possible so you get mostly L1 cache hits (see also What Every Programmer Should Know About Memory?).

Fitting in the large shared L3 cache on typical x86 CPUs is not sufficient to make things efficient. The per-core L2 caches are typically 256kiB, and even that's slower than the 32kiB L1d cache.

Memory latency is very slow compared to a CPU core clock, but memory/cache bandwidth is not terrible these days with fast DDR4, or L3 cache hits. (But like I said, for a matmul with good cache blocking / loop tiling you can reuse data while it's still hot in L1d if you only transpose parts of the input matrix on the fly. Reducing off-core bandwidth requirements is also important for efficient matmul, not just transposing one so its columns are sequential in memory.)

Beyond that, sequential access to memory is essential for efficient SIMD (loading a vector of multiple contiguous elements, letting you multiply / add / whatever 4 or 8 packed float elements with one CPU instruction). Striding down columns in a row-major matrix would hurt throughput even if the matrix was small enough to fit in L1d cache (32kiB).

Recommended topics

Hot tags