I just learned (from Why only one of the warps is executed by a SM in cuda?) that Kepler GPUs can actually execute instructions from several (apparently 4) warps at once.
Can a shared memory bank also serve four requests at once? If not, that would mean that bank conflicts can occur between threads of different warps that happen to be executed concurrently, even though there are no bank conflicts within any of the individual warps, right? Is there any information on this?