Intel's intrinsic guide lists the intrinsic _mm256_loadu_epi32
:
_m256i _mm256_loadu_epi32 (void const* mem_addr);
/*
Instruction: vmovdqu32 ymm, m256
CPUID Flags: AVX512VL + AVX512F
Description
Load 256-bits (composed of 8 packed 32-bit integers) from memory into dst.
mem_addr does not need to be aligned on any particular boundary.
Operation
a[255:0] := MEM[mem_addr+255:mem_addr]
dst[MAX:256] := 0
*/
But clang and gcc do not provide this intrinsic. Instead they provide (in file avx512vlintrin.h
) only the masked versions
_mm256_mask_loadu_epi32 (__m256i, __mmask8, void const *);
_mm256_maskz_loadu_epi32 (__mmask8, void const *);
which boil down to the same instruction vmovdqu32
. My question: how can I emulate _mm256_loadu_epi32
:
inline _m256i _mm256_loadu_epi32(void const* mem_addr)
{
/* code using vmovdqu32 and compiles with gcc */
}
without writing assembly, i.e. using only intrinsics available?
_mm256_loadu_si256
. – Represent_mm256_maskz_epi32(0xffu,ptr)
? Would you promote this comment to an answer? – Ruselvmovdqu
. Related: What is the difference between _mm512_load_epi32 and _mm512_load_si512? – Lempres_mm256_loadu_si256
you need to cast the input-pointer toconst __m256i*
(so not a bad idea, to encapsulate that into an inlined function) – Sideburns