Do gcc vector extensions support variable length vectors?
Asked Answered
B

1

1

This doesn t appears to work/compile

void vec(size_t n) {
    typedef char v4si __attribute__((vector_size(n)));
    v4si t={1};
}

Is there a proper way to declare this or is it unsupported?

Bevin answered 7/1, 2021 at 1:41 Comment(11)
Since for the size only positive power-of-two multiples of the base type are allowed, I don't think it is supported. The documentation says nothing about a non-constant size. -- You might consider to pre-define all types that are needed in your application, and use a switch to select the requested one.Rufescent
What do you want to achieve here? Is writing a simple for loop and compiling with -O3 an option? Is switching to C++ an option? (You could pass n as a template parameter, if it is known at compile time)Merozoite
@Merozoite the problem is gcc tend to not vectorize simple loops, hence the need to state it explicetely.Bevin
@thebusybee size can be anywhere 1 and 10Gb.Bevin
Alright, that makes for 1=2^0 to 10G=2^34 (approximately) bytes ... well, 35 different types. Not too much, if you want this. Remember, only power-of-two are allowed. -- However, the reason for such types is to use special machine code instructions to handle multiple data in parallel. But only as many as fit in a single register, AFAIK. -- I'm afraid, you are digging at the wrong place.Rufescent
@thebusybee sorry I was meaning 1 byte. also, it doesn t have to be a power of 2.Bevin
But you read GCC's documentation, didn't you? Its extension limits the size to power-of-two multiples of the base type.Rufescent
if the loop wasn't vectorized then you need to check the log to see why it's not vectorized and fixInsolvent
An OpenMP SIMD annotation may get the compiler to vectorize. Often #pragma omp simd is enough; if you don't know how they work, see primeurmagazine.com/repository/… . Also worth noting that SVE doesn't have to be a power-of-two length; any multiple of (IIRC) 64 is allowed… you can have vectors of 384 bits, for example.Fretwork
@Fretwork given the array tend to be small, I m not sure a library call to openmp is the right way to do. Better to get it vectorized inlined in the code.Bevin
OpenMP SIMD. There is no library call; you don't even have to link to the OpenMP runtime library (gcc/clang: -fopenmp-simd). It's basically just cross-compiler a standard for annotating loops to help the compiler to vectorize. Works in most modern compilers, including GCC, clang, MSVC, ICC, Cray, Arm, xlc, etc. I suggest you read that article I linked to, it should explain things in more detail.Fretwork
L
1

No, that would make no sense. It's like trying to select uint32_t vs. uint64_t at runtime based on the value of some variable.

Manual vectorization does not work by treating the whole array as one giant SIMD vector, it works by telling the compiler exactly how to use fixed-size short vectors. If auto-vectorization doesn't work with normal arrays, this is not going to help.

To get GCC to "try harder" to auto-vectorize a loop if you don't want to do it manually, there #pragma omp SIMD with gcc -fopenmp which can auto-vectorize at -O2. Or compiling with -O3 will consider every loop as a candidate for auto-vectorization. (Also stuff on single structs; clang is generally better at finding SIMD use-cases in non-looping code than gcc, though. clang may sometimes be too aggressive and spend more time shuffling data together than it would cost to just do separate scalar work.)

But note that GCC and clang's auto-vectorization can only work if the loop trip-count can be calculated before the first iteration. It can be a runtime variable count, but an if()break; exit condition that could trigger at any time depending on data will defeat them. So e.g. they can't auto-vectorize a naive looping strlen or strchr implementation that uses while(*p++ != 0){...}. ICC can do that.

Also if you need any kind of shuffling, you'll often need to do that yourself with GNU C native vectors, or target-specific intrinsics like SSE/AVX for x86, NEON/AdvSIMD for ARM, AltiVec for Power, etc.


Some early machines apparently had SIMD that worked by giving the hardware a pointer + length and letting it "loop" in whatever chunks it wanted (maybe like how modern x86 rep movsd can actually use larger chunks in its microcode). But modern CPUs have fixed-width short-vector SIMD instructions that can for example do exactly 16 or exactly 32 bytes.

(ARM SVE is sort of part-way between, allowing forward compatibility for code to take advantage of wider vectors on future HW instead of fully baking in a vector width. It's still a fixed size you can't control, though. You still have to loop using it, and increment your pointer by the hardware's vector-width. It has masking stuff to ignore elements past the end of what you want to process so you can use it for arbitrarily short arrays, I think, and for the leftover end of an array. But for arbitrarily long arrays you still need to loop. Also, very few CPUs support SVE yet. BTW, SVE is a similar concept to SIMD in Agner Fog's ForwardCom blue-sky paper architecture, which also aims to let code take advantage of future wider hardware without recompiling or redoing manual vectorization.)

What kind of asm code-gen are you hoping to get from a runtime-variable sized "vector" when targeting a machine that has fixed-width SIMD vectors, like a choice of 16 or 32 bytes, with the choice being made as part of the instruction encoding?


Related: Are there any problems for which SIMD outperforms Cray-style vectors? contrasts typical short-vector SIMD like x86 SSE/AVX and ARM NEON / ASIMD vs. Cray with 64 or 128-element vectors (which the hardware might internally loop over, maybe doing a few in parallel), vs. earlier machines like CDC which just took pointers and lengths. Modern short-vector SIMD usually only has vector registers as wide as the ALUs, unlike Cray.

Apparently on Cray and similar machines, main memory was fast SRAM (so they didn't have caches), and the large vector registers could be loaded with strided access. Modern-style machines with caches can't do that efficiently, but modern hardware is wide enough to load or store 16, 32, or 64 contiguous bytes from a cache line.

(An earlier version of this answer claimed that Cray used pointer+length memory-to-memory SIMD, but that wasn't Cray, that was earlier ISAs.)

Leatherjacket answered 7/1, 2021 at 20:6 Comment(13)
Though there s a special case which is initializing a variable length array. The problem here is I need to do it at a fixed address (which as far I know is only covered by cilk). Also one of your statement is bit wrong as arm now feature scalable vector extension working like cray machines allowed it. Unlike Neon, in that case, the vector length can be anything.Bevin
Actually Peter is right; SVE doesn't quite work like that. There is a function for determining length, and you loop through your data while incrementing a counter. In the loop you generate a mask basked on the number of elements remaining (based on the length and the counter) so for the last iteration the machine will mask out any remaining elements and not touch them. See developer.arm.com/documentation/101726/0210/… for an example.Fretwork
@user2284570: Are you looking for memset or wmemset, if either of those match your array's element size? GCC can inline them, or call an efficient libc implementation. Also, SVE is interesting and worth mentioning, but SVE is not like Cray vectorization or x86 rep stosd. Thanks for reminding me of it.Leatherjacket
@PeterCordes I am reimplementing memset() for my custom allocator.Bevin
@user2284570: what are you hoping to gain over calling memset? GCC and clang are generally pretty good at compiling calls to memset. Manually vectorizing could at best be good for some microarchitectures that are similar to the only you're tuning on, but GCC already knows memset strategies that are good on different microarchitectures for different ISAs. (At least in theory; sometimes there can be missed optimizations which you can report on GCC's bugzilla.)Leatherjacket
@user2284570: Also, GCC will generally recognize for(...) arr[i]=x; as a memset pattern (if array elements have size == 1 or x is a repeating byte pattern like 0 - Why is std::fill(0) slower than std::fill(1)?) and turn it into a memset. But if the pattern repeats at longer than 1 byte, GCC unfortunately doesn't look for wmemset, it just auto-vectorizes if it can.Leatherjacket
If there are too much basic blocks, gcc will not attempt to optimize some in order to avoid taking too much compile time or too much memory. I am using a differrent semantic than the C memset() so the behavior is nearer than memset_s() but still differrent.Bevin
About arm, yes, sorry I did the mistake to think about it. I was meaning the Vector est feature of risc 5 with the setvl instruction.Bevin
@user2284570: RISC-V still has a MAXVL dependent on the machine. But yes, the RISC-V manual itself even mentions similarity to Cray. riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf. Interesting. I knew Patterson really didn't like the short-vector SIMD that mainstream ISAs have used in recent years, but hadn't known RISC-V had SIMD at all. But anyway, GCC's __attribute__((vector_size(n))) defines a fixed-width type, and isn't related to that RISC-V extension.Leatherjacket
@user2284570: Can't you just do some size checks outside a loop to figure out a pointer+length you can pass to memset?Leatherjacket
@PeterCordes no I can t.Bevin
@user2284570: Ok, so what data-dependent checks do need to happen inside the loop? Are you suggesting RISC-V's variable-length SIMD could help with this when memset can't? (Or some hypothetical GCC __attribute__((vector_size(n))) that you use exactly once? The example code in this question appears to be exactly identical to what you could do with memset.) Anyway, the answer to this question is a clear "no", but you could ask a different question about your real memset-like thing that can't calculate ahead of time when it should stop.Leatherjacket
@user2284570: I looked again at RISC-V. It's similar to SVE. The MAXVL is likely to be some small power of 2 since it does still use vector registers, and you do have to loop. e.g. Figure 17.1 in the v2.2 spec shows a loop that uses setvl t0, a0 every iteration, and elements_left -= t0. It does make handling of possibly-short arrays efficient, along with cleanup for trailing bytes, if you choose to leave setvl inside the loop like that. So you get branchless handling of short cases, but other than that not fundamentally different to what you can do with SSE or other fixed-vector SIMDLeatherjacket

© 2022 - 2024 — McMap. All rights reserved.