Background
I'm trying to understand whether a GPU's Last-Level Cache is invalidated or preserved across multiple kernel launches, so that the effective memory bandwidth can be increased. I'm aware that this possibly depends on the specific GPU architecture. If it's indeed preserved at least on some GPU architectures, perhaps the kernels can be carefully written as a way to exploit this cache as a communication buffer (when kernel fusion is not feasible).
However, currently the answer on the Web is unclear and contradictory, many are also outdated. I found a couple of posts on the Nvidia and AMD developer forums without a clear answer, the best was a suggestion to measure it using micro-benchmarks. On Stack Exchange, there are also several questions:
In the question NVidia CUDA: cache L2 and multiple kernel invocations from 2011, Zk1001 answered that:
Assuming you are talking about L2 data cache in Fermi. I think the caches are flushed after each kernel invocation. In my experience, running two consecutive launches of the same kernel with a lots of memory accesses (and #L2 cache misses) doesn't make any substantial changes to the L1/L2 cache statistics.
This answer only states that when the size of the working set is large, L2 cache is too small for any temporal locality. This is consistent with my observation (it's true even on the CPU), but it doesn't answer anything about whether the cache is persistent or not.
In the question How does cache affect while a same kernel is being launched repeatedly from 2016, Melissa P answered that:
For AMD Radeon GCNs, L1 and L2 cache is persistent between all kernels and all different kernels. A kernel can use cached data from any other kernel. Additionally, Local Memory inside a Compute Unit is not cleared/zeroed between kernel runs (more precisely, between work-group runs). This means you have to initialize local variables. The same should apply for nVidia/CUDA devices and generic SIMD CPUs.
That being said, OpenCL does not know or define different level of caches, caches are vendor specific. Any functionality that handles or manages caching is a vendor specific extension.
But without any citation.
In Nvidia's NVIDIA A100 Tensor Core GPU Architecture whitepaper, Nvidia states:
Alongside the raw data bandwidth improvements, A100 improves data fetch efficiency and reduces DRAM bandwidth demand with a 40 MB L2 cache that is almost 7x larger than that of Tesla V100. To fully exploit the L2 capacity A100 includes improved cache management controls. Optimized for neural network training and inferencing as well as general compute workloads, the new controls ensure that data in the cache is used more efficiently by minimizing writebacks to memory and keeping reused data in L2 to reduce redundant DRAM traffic.
For example, for DL inferencing workloads, ping-pong buffers can be persistently cached in the L2 for faster data access, while also avoiding writebacks to DRAM. For producer-consumer chains, such as those found in DL training, L2 cache controls can optimize caching across the write-to-read data dependencies. In LSTM networks, recurrent weights that are shared across multiple GEMM operations can be preferentially cached and reused in L2.
It appears that persistent Last-Level Cache is at least supported on the Nvidia A100, but it's unclear whether other GPU architectures support the same feature.
Question
As GPUs are started to include more Last-Level Cache, such as the 80 MiB cache in Nvidia A100, 128 MiB cache in AMD RDNA2, and 96 MiB cache in AMD RDNA3, using the Last-Level Cache as a communication buffer across kernels is becoming at least a theoretically feasible idea. So, is Last-Level Cache invalidation behavior across kernel launches implemented in GPU architectures?