Do bank conflicts occur on non-GPU hardware?
Asked Answered
R

1

10

This blog post explains how memory bank conflicts kill the transpose function's performance.

Now I can't but wonder: does the same happen on a "normal" cpu (in a multithreaded context)? Or is this specific to CUDA/OpenCL? Or does it not even appear in modern CPUs because of their relatively large cache sizes?

Responser answered 19/6, 2014 at 14:9 Comment(7)
GPUs and CPUs access memory in the same way, and cache isn't magical. Transposing on a CPU will be slow too.Whithersoever
Yes, CPUs have cache bank conflicts as well. I've personally observed slow-downs of > 10x on an AMD Piledriver for writing 5 streams spaced out by the critical stride even though the data fits in L1 cache.Mucin
I admit that cache bank conflicts and false aliasing are different, but difficult to distinguish. So it's possible that I was hitting false aliasing instead of bank conflicts.Mucin
They definitely do suffer from bank conflicts, although it's an artifact of the exact micro-architecture discussed. see here for e.g. about banking changes in Haswell vs. SandyBridgeElegist
@rubenvb: I removed the CUDA tag from this question for a reason - it has nothing to do with CUDA programming. Why did you re-add it?Erle
@Erle I didn't notice you removed it honestly. It was to give this question visibility with the people who might actually know the answer. This question has nothing to do with C or OpenCL either, per se, yet it does, because the people following those tags might know the answer. In fact, it has lots to do with CUDA programming, because I found the issue on a CUDA blog. But anyway, remove it if you feel strongly it shouldn't be there.Responser
The transpose is a memory bound O(n^2) operation. CPU cores are too fast. The "cores" of GPUs are much slower so the transpose should be more efficient on the GPU. Fast core CPUs need O(n^3) operations to compete. Intel will brag about O(n^3) operations like matrix multiplication but avoid O(n^2) operations. The level4 cache in some Haswell processors will help CPUs compete.Ceramist
O
3

There have been bank conflicts since the earliest vector processing CPUs from the 1960's It's caused by interleaved memory or multi-channel memory access.

Interleaved memory access or MCMA solves the problem to slow RAM access, by phasing access to each word of memory from different banks or via different channels. But there is a side effect, memory access from the same bank takes longer than accessing memory from the adjacent bank.

From Wikipedia on the 1980's Cray 2 http://en.wikipedia.org/wiki/Cray-2

"Main memory banks were arranged in quadrants to be accessed at the same time, allowing programmers to scatter their data across memory to gain higher parallelism. The downside to this approach is that the cost of setting up the scatter/gather unit in the foreground processor was fairly high. Stride conflicts corresponding to the number of memory banks suffered a performance penalty (latency) as occasionally happened in power-of-2 FFT-based algorithms. As the Cray 2 had a much larger memory than Cray 1's or X-MPs, this problem was easily rectified by adding an extra unused element to an array to spread the work out"

Ondine answered 22/7, 2014 at 15:37 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.