physical memory on AMD devices: local vs private
Asked Answered
K

3

8

I'm writing an algorithm in OpenCL in which I'd need every work unit to remember a fair portion of data, say something between a long[70] and a long[200] or so per kernel.

Recent AMD devices have 32 KiB __local memory, which is (for the given amount of data per kernel) enough to store the info for 20-58 work units. However, from what I understand from the architecture (and especially from this drawing), each shader core also has a dedicated amount of private memory. I however fail to find its size.

Can anyone tell me how to find out how much private memory each kernel has?

I'm particularly curious about the HD7970, since I plan to buy some of these soon.

Edit: Problem solved, the answer is here in appendix D.

Krems answered 17/2, 2012 at 16:13 Comment(8)
I don't believe private memory is dedicated per core - it maps to the register file, which is per compute unit resource. Each work item gets registers allocated from the compute unit register file, how many are required determines the number of wavefronts in flight at any given instant.Blatman
From the famous everywhere-seen drawing codeproject.com/KB/showcase/Memory-Spaces/image001.jpg I concluded that the private memory is physically different from the __local memory, no?Krems
Yes, they are physically different. Private memory maps to the compute unit register file, local memory to compute unit level shared memory in most modern AMD devices. A few early OpenCL compatible GPUs didn't have on die shared memory, and local memory was just SDRAM. Neither is per core, and how much you use per work item for private and per work group for local effects the number of concurrent wavefronts running per compute unit.Blatman
Ok. Then I should re-word my question: how large is this register file? How to find out its size, either in general or for the HD7970 specifically.Krems
You have not understood, I think - private memory is (like the name says) private to each work item. But it is allocated to each work item from the compute unit register file(s), which acts as a common resource pool for all the work items running on a given compute unit. And I am pretty sure AMD's compiler puts a hard limit of 256 registers per work unit, irrespective of the size of the register file(s) on the GPU.Blatman
And what is the size of 1 register? 64 bits? If so, that's a hard limit of 2 KB per work item, which is quite huge, no? I assume it should be much smaller (otherwise my problem is trivially solved as it can contain the long[200] purely in the register).Krems
I think each register is a 32 bit word. But remember that all of the other variables in your code also consume registers. I think I remember typical AMD GPUs have a 64kb register file per compute unit which needs to be shared by a minimum either 4 or 8 wavefronts of 64 work items each. But I don't use their hardware much , so that might not be correct. Check the current release notes in their OpenCL SDK.Blatman
Indeed, there it is, thanks! It's in Appendix D of the AMD APP OpenCL Programming Guide developer.amd.com/sdks/amdappsdk/assets/…. Apparently a register is 128 bits (4x32) and there are 16384 for all modern high-end devices, so that's a remarkable 256KB per compute unit. Nice! If you can put this in a new answer, I can accept it and close the topic.Krems
K
4

The answer was given by user talonmies in the comments, so I'll write it in a new answer here to close the question.

These values can be found in Appendix D of the AMD APP OpenCL Programming Guide http://developer.amd.com/sdks/amdappsdk/assets/amd_accelerated_parallel_processing_opencl_programming_guide.pdf (a similar document exists for nVidia). Apparently a register is 128 bits (4x32) for AMD devices and there are 16384 registers for all modern high-end devices, so that's a remarkable 256KB per compute unit.

Krems answered 1/3, 2012 at 12:16 Comment(0)
C
0

I think you are looking for __local memory. That is what 32KB of local data storage is referring to. I don't think you can poll the device to get the private memory amount.

You can pass in a NULL long* cl_mem reference to allocate the memory. I think it is best to use a static amount of memory per WI. Assuming that long[200] will be required for each work item, you would use the code below. It would also be a good idea to divide the work into groups that have the same (or similar) memory requirements, in order to get the most out of the LDS memory.

void __kernel(__local long* localMem, const int localMemPerItem
       //more args...
       )
{
  //host has 'passed' localMemPerItem*get_local_size() long values in as locamMem
  //this work item has access to all of it, but can choose to restrict
  //itself to only the portion it needs.
  //work group size will be limited to CL_DEVICE_LOCAL_MEM_SIZE/(8*localMemPerItem)
  int startIndex=localMemPerItem*get_local_id(0);
  //use localMem[startIndex+ ... ]
}
Commercial answered 17/2, 2012 at 18:41 Comment(2)
You cannot poll it, but does it exist? From the famous everywhere-seen drawing codeproject.com/KB/showcase/Memory-Spaces/image001.jpg I assumed that there is a physically separate set of private registers on each work unit. No? I hoped to somehow do better than a CL_DEVICE_LOCAL_MEM_SIZE/(8*localMemPerItem) limitation, as it roughly leaves half of the cores unused. Accessing global memory would probably be way too slow, even though it is only incrementing a counter.Krems
I found some more info about cypress, cayman, and fermi register sizes here: realworldtech.com/page.cfm?ArticleID=RWT121410213827&p=11 You should be able to tweak some decent sized private vars into that size. I think that the LDS will still be your best bet though.Commercial
M
0

To answer how large is register file in a 79xx series card, since its based on GCN architecture it is 64KB as per the image in anandtech : http://www.anandtech.com/print/5261

To answer your question how to find out how much memory each kernel uses.. you can look run AMD APP Profiler on your kernel, it tell you in the kernel occupancy section how much space is utilized by the kernel.

Margret answered 20/2, 2012 at 14:50 Comment(2)
Oh really? That's weird. I thought to have found the answer, but it's a different one. In the AMD OpenCL programming guide developer.amd.com/sdks/amdappsdk/assets/… in Appendix D, there is the total register file size, and it's listed as 256 KB for all modern devices. Which is correct now? :SKrems
I believe both are correct. As I understand it, In the GCN architecture, one SIMD unit has a 64kb of register file, and there are 4 SIMD units per compute unit, ie. 4 * 64kb = 256kb of total register file per compute unit.Blatman

© 2022 - 2024 — McMap. All rights reserved.