I'm not real familiar with CUDA directly but I do have some experience in OpenGL and DirectX and I am also familiar with 3D Graphics Rendering APIs, Libraries and Pipelines and having the ability to setup and use those APIs.
When I look at your question(s):
How to deal with OpenGL cube map textures in CUDA?
And you proceed to explain it by this:
When one want to use OpenGL textures in CUDA kernel one of the things to do is to retrieve a CUDA array from registered image and mapped resource, in this case a texture. In driver API it is done by cuGraphicsSubResourceGetMappedArray call, which in a case of 2D texture is not a problem. But when talking about aforementioned cube map, third parameter of this function requires a face enum (like CU_CUBEMAP_FACE_POSITIVE_X). Thus some questions arise - when one passes such an enum, then the returned texture array will contain only data of that particular face, right? Then how to use cube texture as a whole, to perform cube mapping, likewise:
color = texCube(cubeMap, x, y, z);
Or is it impossible to do so in CUDA kernal and one need to use 2D textures with proper calculations and sampling in user code?
I went to CUDA's website for their API SDK & Programming Documentations. And found the function in question cuGraphicsSubResourceGetMappedArray()
CUresult cuGraphicsSubResourceGetMappedArray ( CUarray* pArray,
CUgraphicsResource resource,
unsigned int arrayIndex,
unsigned int mipLevel )
Get an array through which to access a subresource of a mapped graphics resource.
Parameters
- pArray - Returned array through which a subresource of resource may be accessed
- resource - Mapped resource to access
- arrayIndex - Array index for array textures or cubemap face index as defined by CUarray_cubemap_face for cubemap textures for the subresource to access
- mipLevel - Mipmap level for the subresource to access
Returns
CUDA_SUCCESS
, CUDA_ERROR_DEINITIALIZED
, CUDA_ERROR_NOT_INITIALIZED
,
CUDA_ERROR_INVALID_CONTEXT
, CUDA_ERROR_INVALID_VALUE
,
CUDA_ERROR_INVALID_HANDLE
, CUDA_ERROR_NOT_MAPPED
,
CUDA_ERROR_NOT_MAPPED_AS_ARRAY
Description
Returns in *pArray an array through which the subresource of the mapped graphics resource resource which corresponds to array index arrayIndex and mipmap level mipLevel may be accessed. The value set in *pArray may change every time that resource is mapped.
If resource
is not a texture
then it cannot be accessed via an array
and CUDA_ERROR_NOT_MAPPED_AS_ARRAY
is returned. If arrayIndex
is not a valid array index
for resource
then CUDA_ERROR_INVALID_VALUE
is returned. If mipLevel
is not a valid mipmap level
for resource
then CUDA_ERROR_INVALID_VALUE
is returned. If resource is not mapped
then CUDA_ERROR_NOT_MAPPED
is returned.
Note:
Note that this function may also return error codes from previous, asynchronous launches.
See also:
cuGraphicsResourceGetMappedPointer
Read more at: http://docs.nvidia.com/cuda/cuda-driver-api/index.html#ixzz4ic22V4Dz
Follow us: @GPUComputing on Twitter | NVIDIA on Facebook
This function method was found in NVidia CUDA's DriverAPI
and not in their RuntimeAPI
. When understanding hardware with CUDA capability is that there is a difference between the Host
and Device
programmable pipelines which can be found here: http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#axzz4ic6tFjXR
2. Heterogeneous Computing
CUDA programming involves running code on two different platforms concurrently: a host system with one or more CPUs and one or more CUDA-enabled NVIDIA GPU devices.
While NVIDIA GPUs are frequently associated with graphics, they are also powerful arithmetic engines capable of running thousands of lightweight threads in parallel. This capability makes them well suited to computations that can leverage parallel execution.
However, the device is based on a distinctly different design from the host system, and it's important to understand those differences and how they determine the performance of CUDA applications in order to use CUDA effectively.
- 2.1. Differences between Host and Device
The primary differences are in threading model and in separate physical memories:
- Threading resources - Execution pipelines on host systems can support a limited number of concurrent threads. Servers that have four hex-core processors today can run only 24 threads concurrently (or 48 if the CPUs support Hyper-Threading.) By comparison, the smallest executable unit of parallelism on a CUDA device comprises 32 threads (termed a warp of threads). Modern NVIDIA GPUs can support up to 1536 active threads concurrently per multiprocessor (see Features and Specifications of the CUDA C Programming Guide) On GPUs with 16 multiprocessors, this leads to more than 24,000 concurrently active threads.
- Threads - Threads on a CPU are generally heavyweight entities. The operating system must swap threads on and off CPU execution channels to provide multithreading capability. Context switches (when two threads are swapped) are therefore slow and expensive. By comparison, threads on GPUs are extremely lightweight. In a typical system, thousands of threads are queued up for work (in warps of 32 threads each). If the GPU must wait on one warp of threads, it simply begins executing work on another. Because separate registers are allocated to all active threads, no swapping of registers or other state need occur when switching among GPU threads. Resources stay allocated to each thread until it completes its execution. In short, CPU cores are designed to minimize latency for one or two threads at a time each, whereas GPUs are designed to handle a large number of concurrent, lightweight threads in order to maximize throughput.
- RAM - The host system and the device each have their own distinct attached physical memories. As the host and device memories are separated by the PCI Express (PCIe) bus, items in the host memory must occasionally be communicated across the bus to the device memory or vice versa as described in What Runs on a CUDA-Enabled Device?
These are the primary hardware differences between CPU hosts and GPU devices with respect to parallel programming. Other differences are discussed as they arise elsewhere in this document. Applications composed with these differences in mind can treat the host and device together as a cohesive heterogeneous system wherein each processing unit is leveraged to do the kind of work it does best: sequential work on the host and parallel work on the device.
Read more at: http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#ixzz4ic8ch2fq
Follow us: @GPUComputing on Twitter | NVIDIA on Facebook
Now knowing that there are two different APIs for CUDAs API Libraries we have to understand the difference between the two found here: Difference Between the driver and runtime APIs
1. Difference between the driver and runtime APIs
The driver and runtime APIs are very similar and can for the most part be used interchangeably. However, there are some key differences worth noting between the two.
Complexity vs. control
The runtime API eases device code management by providing implicit initialization, context management, and module management. This leads to simpler code, but it also lacks the level of control that the driver API has.
In comparison, the driver API offers more fine-grained control, especially over contexts and module loading. Kernel launches are much more complex to implement, as the execution configuration and kernel parameters must be specified with explicit function calls. However, unlike the runtime, where all the kernels are automatically loaded during initialization and stay loaded for as long as the program runs, with the driver API it is possible to only keep the modules that are currently needed loaded, or even dynamically reload modules. The driver API is also language-independent as it only deals with cubin objects.
Context management
Context management can be done through the driver API, but is not exposed in the runtime API. Instead, the runtime API decides itself which context to use for a thread: if a context has been made current to the calling thread through the driver API, the runtime will use that, but if there is no such context, it uses a "primary context." Primary contexts are created as needed, one per device per process, are reference-counted, and are then destroyed when there are no more references to them. Within one process, all users of the runtime API will share the primary context, unless a context has been made current to each thread. The context that the runtime uses, i.e, either the current context or primary context, can be synchronized with cudaDeviceSynchronize(), and destroyed with cudaDeviceReset().
Using the runtime API with primary contexts has its tradeoffs, however. It can cause trouble for users writing plug-ins for larger software packages, for example, because if all plug-ins run in the same process, they will all share a context but will likely have no way to communicate with each other. So, if one of them calls cudaDeviceReset() after finishing all its CUDA work, the other plug-ins will fail because the context they were using was destroyed without their knowledge. To avoid this issue, CUDA clients can use the driver API to create and set the current context, and then use the runtime API to work with it. However, contexts may consume significant resources, such as device memory, extra host threads, and performance costs of context switching on the device. This runtime-driver context sharing is important when using the driver API in conjunction with libraries built on the runtime API, such as cuBLAS or cuFFT.
Read more at: http://docs.nvidia.com/cuda/cuda-driver-api/index.html#ixzz4icCoAXb7
Follow us: @GPUComputing on Twitter | NVIDIA on Facebook
Since this happens to be found in the DriverAPI
it has more flexibility of control towards the programmer but also requires more responsibility to manage where the RuntimeAPI
library does things more automatic but gives you less control.
This is apparent since you mentioned that you are working with their Kernels
but from the description of their implementation of the function
CUresult cuGraphicsSubResourceGetMappedArray ( CUarray* pArray,
CUgraphicsResource resource,
unsigned int arrayIndex,
unsigned int mipLevel )
The documentation is telling me that the first parameter that this function takes is a returned array through which a subresource of resource may be accessed. The second parameter of this function is the mapped graphics resource itself. The third parameter in which I believe is the parameter that you had in question where it is an enumerated type to a face and you then asked: When one passes such an enum, then the returned texture array will contain only data of that particular face, right? From what I gather and understand from the documentations is that this is an index value to an array
of your cube map resource.
Which can be seen from their documentation:
arrayIndex - Array index for array textures or cubemap face index as defined by CUarray_cubemap_face for cubemap textures for the subresource to access
Read more at: http://docs.nvidia.com/cuda/cuda-driver-api/index.html#ixzz4icHnwe9v
Follow us: @GPUComputing on Twitter | NVIDIA on Facebook
which happens to be an unsigned int
or an index location into the textures that make up that cube map
a typical cube map will have 6 faces
or at most 12
if both inside and outside of the cube are mapped. So if we look at a cube map as well as textures and their relationship with pseudo code we can see that:
// Texture
struct Texture {
unsigned pixelsWidth;
unsigned pixelsHeight;
// Other Texture member variables or fields here.
};
// Only interested in the actual size of the texture `width by height`
// where these would be used to map this texture to one of the 6 faces
// of a cube:
struct CubeMap {
Texture face[6];
// face[0] = frontFace
// face[1] = backFace
// face[2] = leftFace
// face[3] = rightFace
// face[4] = topFace
// face[5] = bottomFace
};
The cubemap object has an array of textures that makes up its face and according to the documents the function that you have in question with its third parameter is asking you for an index into this texture array and the overall function will return this:
Returns in *pArray an array through which the subresource of the mapped graphics resource resource which corresponds to array index arrayIndex and mipmap level mipLevel may be accessed. The value set in *pArray may change every time that resource is mapped.
Read more at: http://docs.nvidia.com/cuda/cuda-driver-api/index.html#ixzz4icKF1c00
Follow us: @GPUComputing on Twitter | NVIDIA on Facebook
I hope this helps to answer your question in regards to the use of the third parameter into the function you are trying to use from their API.
Edit
The OP had asked when passing this enum CU_CUBEMAP_FACE_POSITIVE_X
to the third parameter of the above function call will it return only that face of the cube map which happens to be a texture. When looking at their documentation about this enumerated value or type found here: enum CUarray_cubemap_face
enum CUarray_cubemap_face - Array indices for cube faces
Values
- CU_CUBEMAP_FACE_POSITIVE_X = 0x00
- Positive X face of cubemap
- CU_CUBEMAP_FACE_NEGATIVE_X = 0x01
- Negative X face of cubemap
- CU_CUBEMAP_FACE_POSITIVE_Y = 0x02
- Positive Y face of cubemap
- CU_CUBEMAP_FACE_NEGATIVE_Y = 0x03
- Negative Y face of cubemap
- CU_CUBEMAP_FACE_POSITIVE_Z = 0x04
- Positive Z face of cubemap
- CU_CUBEMAP_FACE_NEGATIVE_Z = 0x05
- Negative Z face of cubemap
Read more at: http://docs.nvidia.com/cuda/cuda-driver-api/index.html#ixzz4idOT67US
Follow us: @GPUComputing on Twitter | NVIDIA on Facebook
It appears to me that when using this method to query or get texture information that is stored into an array of a cube map, that the requirement of the third parameter being this enumerated value; is nothing more than the 0-index
into that array. So by passing in CU_CUBEMAP_FACE_POSITIVE_X
as the third parameter to me doesn't necessarily mean that you will only get back that particular face's texture. It appears to me that since this is the 0th index
that it will return the entire array of textures. The old C
style of passing around arrays as if they were pointers.