Do bank conflicts occur on non-GPU hardware? - McMap

About

Do bank conflicts occur on non-GPU hardware?

Asked 19/6, 2014 at 14:9 Answered 22/7, 2014 at 15:37

Solved c opencl cpu-cache bank-conflict

R

1

10

This blog post explains how memory bank conflicts kill the transpose function's performance.

Now I can't but wonder: does the same happen on a "normal" cpu (in a multithreaded context)? Or is this specific to CUDA/OpenCL? Or does it not even appear in modern CPUs because of their relatively large cache sizes?

Responser answered 19/6, 2014 at 14:9 Comment(7)

GPUs and CPUs access memory in the same way, and cache isn't magical. Transposing on a CPU will be slow too. – Whithersoever 19/6, 2014 at 14:27

Yes, CPUs have cache bank conflicts as well. I've personally observed slow-downs of > 10x on an AMD Piledriver for writing 5 streams spaced out by the critical stride even though the data fits in L1 cache. – Mucin 20/6, 2014 at 7:17

I admit that cache bank conflicts and false aliasing are different, but difficult to distinguish. So it's possible that I was hitting false aliasing instead of bank conflicts. – Mucin 20/6, 2014 at 7:22

They definitely do suffer from bank conflicts, although it's an artifact of the exact micro-architecture discussed. see here for e.g. about banking changes in Haswell vs. SandyBridge – Elegist 21/6, 2014 at 19:18

@rubenvb: I removed the CUDA tag from this question for a reason - it has nothing to do with CUDA programming. Why did you re-add it? – Erle 22/6, 2014 at 9:46

@Erle I didn't notice you removed it honestly. It was to give this question visibility with the people who might actually know the answer. This question has nothing to do with C or OpenCL either, per se, yet it does, because the people following those tags might know the answer. In fact, it has lots to do with CUDA programming, because I found the issue on a CUDA blog. But anyway, remove it if you feel strongly it shouldn't be there. – Responser 22/6, 2014 at 10:14

The transpose is a memory bound O(n^2) operation. CPU cores are too fast. The "cores" of GPUs are much slower so the transpose should be more efficient on the GPU. Fast core CPUs need O(n^3) operations to compete. Intel will brag about O(n^3) operations like matrix multiplication but avoid O(n^2) operations. The level4 cache in some Haswell processors will help CPUs compete. – Ceramist 23/7, 2014 at 7:15

O

3

There have been bank conflicts since the earliest vector processing CPUs from the 1960's It's caused by interleaved memory or multi-channel memory access.

Interleaved memory access or MCMA solves the problem to slow RAM access, by phasing access to each word of memory from different banks or via different channels. But there is a side effect, memory access from the same bank takes longer than accessing memory from the adjacent bank.

From Wikipedia on the 1980's Cray 2 http://en.wikipedia.org/wiki/Cray-2

"Main memory banks were arranged in quadrants to be accessed at the same time, allowing programmers to scatter their data across memory to gain higher parallelism. The downside to this approach is that the cost of setting up the scatter/gather unit in the foreground processor was fairly high. Stride conflicts corresponding to the number of memory banks suffered a performance penalty (latency) as occasionally happened in power-of-2 FFT-based algorithms. As the Cray 2 had a much larger memory than Cray 1's or X-MPs, this problem was easily rectified by adding an extra unused element to an array to spread the work out"

Ondine answered 22/7, 2014 at 15:37 Comment(0)

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.