Reading from an unaligned uint8_t recast as a uint32_t array - not getting all values
Asked Answered
M

3

2

I am trying to cast a uint8_t array to uint32_t array. However, when i try to do this, I cant seem to be able to access every consecutive 4 bytes.

Let us say I have a uint8_t array with 8 bytes. I would like to access byte 2 -> 6 as one uint32_t.

These all get the same value *((uint32_t*)&uint8Array[0]), *((uint32_t*)&uint8Array[1]), *((uint32_t*)&uint8Array[2]), *((uint32_t*)&uint8Array[3])

While *((uint32_t*)&uint8Array[4]) gets the bytes 4 -> 8 as expected.

So it seem like I can not access 4 consecutive bytes from any address?

Is there any way that I can do this?

Mccormac answered 22/10, 2016 at 15:46 Comment(2)
You can't do this. You can only access properly-aligned words. (Some processors might let you get away with it, but it's extra work on the processor's work, and some processors don't, and the C language does not require them to.) If you want to do an unaligned access portably, you have to do it "by hand", as in @DietrichEpp's answer.Henri
CUDA has very well documented alignment requirements, and all memory transactions have to be transaction sized alignedRosemonde
N
1

If you want bytes 2..6, you're going to have to combine multiple aligned loads to get what you want.

uint32_t *ptr = ...;
uint32_t value = (ptr[0] >> 16) | (ptr[1] << 16);

Technically, this is also the portable way to do things in C in general, but we're all spoiled because you don't have to do the extra work on x86, ARM, Power, or other common architectures.

Neese answered 22/10, 2016 at 15:49 Comment(11)
Reinterpreting type uint8_t as a non-compatible type is undefined behavior.Dose
@2501: If this were C, that would be the case if you use a very narrow reading of the C standard, but this is not C. It looks very much like C so I can understand why you would think so.Neese
If cuda doesn't use the c language, which language does it use and where can I find the specification?Dose
@2501: I don't know where to find a CUDA specification.Neese
Then how do you know it doesn't follow strict aliasing; where do you draw this information from?Dose
Maybe I should present evidence since I'm making the claim. Cuda uses the C++ language without some features. Unless one of those missing features is lack of aliasing rules, which are the same in C and C++ for the purposes of this case, the behavior is undefined.Dose
@2501: Please either open a new question or address your comments to the OP.Neese
@2501: It's particularly annoying from my perspective to have these comments attached to my answer, because I was careful to remove type punning from the code I included in the answer, and here you are complaining to me about somebody else's code.Neese
According to the official documentation that lists missing features, aliasing rules are not missing: docs.nvidia.com/cuda/cuda-c-programming-guide/…Dose
@2501: Again, please open a new question or address your comments to the OP. This is not the appropriate place to have a discussion about aliasing in CUDA.Neese
Let us continue this discussion in chat.Neese
T
8

While unaligned accesses are not allowed in CUDA, the prmt PTX instruction has a handy mode to emulate the effect of unaligned reads within registers. This can be exposed with a bit of inline PTX assembly. If you can tolerate a read past the end of the array, the code becomes quite simple:

// WARNING! Reads past ptr!
__device__ uint32_t read_unaligned(void* ptr)
{
    uint32_t result;
    asm("{\n\t"
        "   .reg .b64    aligned_ptr;\n\t"
        "   .reg .b32    low, high, alignment;\n\t"
        "   and.b64      aligned_ptr, %1, 0xfffffffffffffffc;\n\t"
        "   ld.u32       low, [aligned_ptr];\n\t"
        "   ld.u32       high, [aligned_ptr+4];\n\t"
        "   cvt.u32.u64  alignment, %1;\n\t"
        "   prmt.b32.f4e %0, low, high, alignment;\n\t"
        "}"
        : "=r"(result) : "l"(ptr));
    return result;
}

To ensure the access past the end of the array remains harmless, round up the number of allocated byte to a multiple of 4, and add another 4 bytes.

Above device code has the same effect as the following code on a little-endian host that tolerates unaligned accesses:

__host__ uint32_t read_unaligned_host(void* ptr)
{
    return *(uint32_t*)ptr;
}
Turgor answered 23/10, 2016 at 1:1 Comment(4)
Aren't you assuming the address is less than 2^32 - 1 ?Kilbride
Also, please have a look at my answer.Kilbride
Why do you think the code would fail for addresses >= 2^32? Note that only the lowest two bits of alignment are relevant, so a 32 bit register is more than sufficient.Turgor
Also this code is very conservative explicitly masking out the two lowest bits of the address. I don't think any Nvidia hardware currently in existence requires that.Turgor
N
1

If you want bytes 2..6, you're going to have to combine multiple aligned loads to get what you want.

uint32_t *ptr = ...;
uint32_t value = (ptr[0] >> 16) | (ptr[1] << 16);

Technically, this is also the portable way to do things in C in general, but we're all spoiled because you don't have to do the extra work on x86, ARM, Power, or other common architectures.

Neese answered 22/10, 2016 at 15:49 Comment(11)
Reinterpreting type uint8_t as a non-compatible type is undefined behavior.Dose
@2501: If this were C, that would be the case if you use a very narrow reading of the C standard, but this is not C. It looks very much like C so I can understand why you would think so.Neese
If cuda doesn't use the c language, which language does it use and where can I find the specification?Dose
@2501: I don't know where to find a CUDA specification.Neese
Then how do you know it doesn't follow strict aliasing; where do you draw this information from?Dose
Maybe I should present evidence since I'm making the claim. Cuda uses the C++ language without some features. Unless one of those missing features is lack of aliasing rules, which are the same in C and C++ for the purposes of this case, the behavior is undefined.Dose
@2501: Please either open a new question or address your comments to the OP.Neese
@2501: It's particularly annoying from my perspective to have these comments attached to my answer, because I was careful to remove type punning from the code I included in the answer, and here you are complaining to me about somebody else's code.Neese
According to the official documentation that lists missing features, aliasing rules are not missing: docs.nvidia.com/cuda/cuda-c-programming-guide/…Dose
@2501: Again, please open a new question or address your comments to the OP. This is not the appropriate place to have a discussion about aliasing in CUDA.Neese
Let us continue this discussion in chat.Neese
K
0

As @DietrichEpp suggests, you'll have to make two loads; and as @tera suggests, you can combine these two loads generically for cheap even when the misalignment is not known in advance (i.e. when the initial address of uint8Array is arbitrary) using the prmt PTX instruction.

I'll offer a solution based on @tera's which will let you do:

value = read_unaligned(&uint8Array[offset]);

safely and (relatively) efficiently. Also, it will only have one inline PTX assembly instruction, and an "unsafe" variant if you need it:

#include <cstdint>
#include <cuda_runtime_api.h>

__device__ __forceinline__ uint32_t prmt_forward_4_extract(
    uint32_t first_word,
    uint32_t second_word, 
    uint32_t control_bits)
{
    uint32_t result;
    asm("prmt.b32.f4e %0, %1, %2, %3;"
        : "=r"(result)
        : "r"(first_word), "r"(second_word), "r"(control_bits) );
    return result;
}

/*
 * This unsafe, faster variant may read past the 32-bit naturally-aligned
 * word containing the last relevant byte
 */
__device__ inline uint32_t read_unaligned_unsafe(const uint32_t* __restrict__ ptr)
{
    /*
     *  Clear the bottom 2 bits of the address, making the result aligned 
     *  for the purposes of reading a 32-bit (= 4-byte) value
     */
    auto aligned_ptr  = (uint32_t*) ((uint64_t) ptr & ~((uint64_t) 0x3));
    auto first_value  = *aligned_ptr;
    auto second_value = *(aligned_ptr + 1);

    auto lower_word_of_ptr = (uint32_t)((uint64_t)(ptr));

    return prmt_forward_4_extract(first_value, second_value, lower_word_of_ptr);
}

__device__ inline uint32_t read_unaligned(const uint32_t* __restrict__ ptr)
{
    auto ptr_is_already_aligned = ((uint64_t)(ptr) & 0x3 == 0);
    if (ptr_is_already_aligned) { return *ptr; }
    return read_unaligned_unsafe(ptr);
}
Kilbride answered 8/12, 2017 at 15:43 Comment(2)
This looks wrong to me - you need to to pass (the lowest two bits of) the unaligned pointer for the control bits. Why don't you test your code when you've already compiled it?Turgor
@tera: Indeed, I needed to pass the lower bits of the unaligned pointer. Fixed and tested.Kilbride

© 2022 - 2024 — McMap. All rights reserved.