This doesn t appears to work/compile
void vec(size_t n) {
typedef char v4si __attribute__((vector_size(n)));
v4si t={1};
}
Is there a proper way to declare this or is it unsupported?
This doesn t appears to work/compile
void vec(size_t n) {
typedef char v4si __attribute__((vector_size(n)));
v4si t={1};
}
Is there a proper way to declare this or is it unsupported?
No, that would make no sense. It's like trying to select uint32_t vs. uint64_t at runtime based on the value of some variable.
Manual vectorization does not work by treating the whole array as one giant SIMD vector, it works by telling the compiler exactly how to use fixed-size short vectors. If auto-vectorization doesn't work with normal arrays, this is not going to help.
To get GCC to "try harder" to auto-vectorize a loop if you don't want to do it manually, there #pragma omp SIMD
with gcc -fopenmp
which can auto-vectorize at -O2
. Or compiling with -O3
will consider every loop as a candidate for auto-vectorization. (Also stuff on single structs; clang is generally better at finding SIMD use-cases in non-looping code than gcc, though. clang may sometimes be too aggressive and spend more time shuffling data together than it would cost to just do separate scalar work.)
But note that GCC and clang's auto-vectorization can only work if the loop trip-count can be calculated before the first iteration. It can be a runtime variable count, but an if()break;
exit condition that could trigger at any time depending on data will defeat them. So e.g. they can't auto-vectorize a naive looping strlen
or strchr
implementation that uses while(*p++ != 0){...}
. ICC can do that.
Also if you need any kind of shuffling, you'll often need to do that yourself with GNU C native vectors, or target-specific intrinsics like SSE/AVX for x86, NEON/AdvSIMD for ARM, AltiVec for Power, etc.
Some early machines apparently had SIMD that worked by giving the hardware a pointer + length and letting it "loop" in whatever chunks it wanted (maybe like how modern x86 rep movsd
can actually use larger chunks in its microcode). But modern CPUs have fixed-width short-vector SIMD instructions that can for example do exactly 16 or exactly 32 bytes.
(ARM SVE is sort of part-way between, allowing forward compatibility for code to take advantage of wider vectors on future HW instead of fully baking in a vector width. It's still a fixed size you can't control, though. You still have to loop using it, and increment your pointer by the hardware's vector-width. It has masking stuff to ignore elements past the end of what you want to process so you can use it for arbitrarily short arrays, I think, and for the leftover end of an array. But for arbitrarily long arrays you still need to loop. Also, very few CPUs support SVE yet. BTW, SVE is a similar concept to SIMD in Agner Fog's ForwardCom blue-sky paper architecture, which also aims to let code take advantage of future wider hardware without recompiling or redoing manual vectorization.)
What kind of asm code-gen are you hoping to get from a runtime-variable sized "vector" when targeting a machine that has fixed-width SIMD vectors, like a choice of 16 or 32 bytes, with the choice being made as part of the instruction encoding?
Related: Are there any problems for which SIMD outperforms Cray-style vectors? contrasts typical short-vector SIMD like x86 SSE/AVX and ARM NEON / ASIMD vs. Cray with 64 or 128-element vectors (which the hardware might internally loop over, maybe doing a few in parallel), vs. earlier machines like CDC which just took pointers and lengths. Modern short-vector SIMD usually only has vector registers as wide as the ALUs, unlike Cray.
Apparently on Cray and similar machines, main memory was fast SRAM (so they didn't have caches), and the large vector registers could be loaded with strided access. Modern-style machines with caches can't do that efficiently, but modern hardware is wide enough to load or store 16, 32, or 64 contiguous bytes from a cache line.
(An earlier version of this answer claimed that Cray used pointer+length memory-to-memory SIMD, but that wasn't Cray, that was earlier ISAs.)
memset
or wmemset
, if either of those match your array's element size? GCC can inline them, or call an efficient libc implementation. Also, SVE is interesting and worth mentioning, but SVE is not like Cray vectorization or x86 rep stosd
. Thanks for reminding me of it. –
Leatherjacket memset()
for my custom allocator. –
Bevin memset
? GCC and clang are generally pretty good at compiling calls to memset. Manually vectorizing could at best be good for some microarchitectures that are similar to the only you're tuning on, but GCC already knows memset strategies that are good on different microarchitectures for different ISAs. (At least in theory; sometimes there can be missed optimizations which you can report on GCC's bugzilla.) –
Leatherjacket for(...) arr[i]=x;
as a memset pattern (if array elements have size == 1 or x
is a repeating byte pattern like 0
- Why is std::fill(0) slower than std::fill(1)?) and turn it into a memset. But if the pattern repeats at longer than 1 byte, GCC unfortunately doesn't look for wmemset, it just auto-vectorizes if it can. –
Leatherjacket memset()
so the behavior is nearer than memset_s()
but still differrent. –
Bevin setvl
instruction. –
Bevin __attribute__((vector_size(n)))
defines a fixed-width type, and isn't related to that RISC-V extension. –
Leatherjacket __attribute__((vector_size(n)))
that you use exactly once? The example code in this question appears to be exactly identical to what you could do with memset.) Anyway, the answer to this question is a clear "no", but you could ask a different question about your real memset-like thing that can't calculate ahead of time when it should stop. –
Leatherjacket setvl t0, a0
every iteration, and elements_left -= t0
. It does make handling of possibly-short arrays efficient, along with cleanup for trailing bytes, if you choose to leave setvl
inside the loop like that. So you get branchless handling of short cases, but other than that not fundamentally different to what you can do with SSE or other fixed-vector SIMD –
Leatherjacket © 2022 - 2024 — McMap. All rights reserved.
-O3
an option? Is switching to C++ an option? (You could passn
as a template parameter, if it is known at compile time) – Merozoite#pragma omp simd
is enough; if you don't know how they work, see primeurmagazine.com/repository/… . Also worth noting that SVE doesn't have to be a power-of-two length; any multiple of (IIRC) 64 is allowed… you can have vectors of 384 bits, for example. – Fretwork