Are there any problems for which SIMD outperforms Cray-style vectors?

Asked 29/5, 2022 at 9:35 Answered 3/6, 2022 at 17:23

Solved vectorization cpu-architecture simd instruction-set

CPUs intended to provide high-performance number crunching, end up with some kind of vector instruction set. There are basically two kinds:

SIMD. This is conceptually straightforward, e.g. instead of just having a set of 64-bit registers and operations thereon, you have a second set of 128-bit registers and you can operate on a short vector of two 64-bit values at the same time. It becomes complicated in the implementation because you also want to have the option of operating on four 32-bit values, and then a new CPU generation provides 256-bit vectors which requires a whole new set of instructions etc.
The older Cray-style vector instructions, where the vectors start off large e.g. 4096 bits, but the number of elements operated on simultaneously is transparent, and the number of elements you want to use in a given operation is an instruction parameter. The idea is that you bite off a little more complexity upfront, in order to avoid creeping complexity later.

It has been argued that option 2 is better, and the arguments seem to make sense, e.g. https://www.sigarch.org/simd-instructions-considered-harmful/

At least at first glance, it looks like option 2 can do everything option 1 can, more easily and generally better.

Are there any workloads where the reverse is true? Where SIMD instructions can do things Cray-style vectors cannot, or can do something faster or with less code?

Cawnpore answered 29/5, 2022 at 9:35 Comment(3)

"a new CPU generation provides 256-bit vectors which requires a whole new set of instructions" Instruction sets are very stable, a new CPU does not require new software, old software just works. AVX512 is dead except very limited use in HPC servers, we're stuck with 32-bytes vectors on PCs and 16-bytes vectors on mobile for a decade now. – Tomkins 31/5, 2022 at 8:14

"the number of elements you want to use in a given operation is an instruction parameter" Who specifically is going to set that parameter for every instruction? If it's the programmer, then I think you're back to square 1 with fixed-length SIMD, because the code across the project needs to agree on the width to pass data in vector registers. Being able to pass vectors by value is one of the sources of the performance win: memcpy doesn't compute anything yet all modern implementations are using at least SSE2 SIMD. – Tomkins 31/5, 2022 at 8:18

If the vector width is set automatically by [jit] compiler, only gonna work if someone will write a sufficiently smart compiler. This is not gonna happen, look at Itanium or current automatic vectorizers. – Tomkins 31/5, 2022 at 8:18

The "traditional" vector approaches (Cray, CDC/ETA, NEC, etc) arose in an era (~1976 to ~1992) with limited transistor budgets and commercially available low-latency SRAM main memories. In this technology regime, processors did not have the transistor budget to implement the full scoreboarding and interlocking for out-of-order operations that is currently available to allow pipelining of multi-cycle floating-point operations. Instead, a vector instruction set was created. Vector arithmetic instructions guaranteed that successive operations within the vector were independent and could be pipelined. It was relatively easy to extend the hardware to allow multiple vector operations in parallel, since the dependency checking only needed to be done "per vector" instead of "per element".

The Cray ISA was RISC-like in that data was loaded from memory into vector registers, arithmetic was performed register-to-register, then results were stored from vector registers back to memory. The maximum vector length was initially 64 elements, later 128 elements.

The CDC/ETA systems used a "memory-to-memory" architecture, with arithmetic instructions specifying memory locations for all inputs and outputs, along with a vector length of 1 to 65535 elements.

None of the "traditional" vector machines used data caches for vector operations, so performance was limited by the rate at which data could be loaded from memory. The SRAM main memories were a major fraction of the cost of the systems. In the early 1990's SRAM cost/bit was only about 2x that of DRAM, but DRAM prices dropped so rapidly that by 2002 SRAM price/MiB was 75x that of DRAM -- no longer even remotely acceptable.

The SRAM memories of the traditional machines were word-addressable (64-bit words) and were very heavily banked to allow nearly full speed for linear, strided (as long as powers of two were avoided), and random accesses. This led to a programming style that made extensive use of non-unit-stride memory access patterns. These access patterns cause performance problems on cached machines, and over time developers using cached systems quit using them -- so codes were less able to exploit this capability of the vector systems. As codes were being re-written to use cached systems, it slowly became clear that caches work quite well for the majority of the applications that had been running on the vector machines. Re-use of cached data decreased the amount of memory bandwidth required, so applications ran much better on the microprocessor-based systems than expected from the main memory bandwidth ratios.

By the late 1990's, the market for traditional vector machines was nearly gone, with workloads transitioned primarily to shared-memory machines using RISC processors and multi-level cache hierarchies. A few government-subsidized vector systems were developed (especially in Japan), but these had little impact on high performance computing, and none on computing in general.

The story is not over -- after many not-very-successful tries (by several vendors) at getting vectors and caches to work well together, NEC has developed a very interesting system (NEC SX-Aurora Tsubasa) that combines a multicore vector register processor design with DRAM (HBM) main memory, and an effective shared cache. I especially like the ability to generate over 300 GB/s of memory bandwidth using a single thread of execution -- this is 10x-25x the bandwidth available with a single thread with AMD or Intel processors.

So the answer is that the low cost of microprocessors with cached memory drove vector machines out of the marketplace even before SIMD was included. SIMD had clear advantages for certain specialized operations, and has become more general over time -- albeit with diminishing benefits as the SIMD width is increased. The vector approach is not dead in an architectural sense (e.g., the NEC Vector Engine), but its advantages are generally considered to be overwhelmed by the disadvantages of software incompatibility with the dominant architectural model.

Miscellany answered 3/6, 2022 at 17:23 Comment(0)

Cray-style vectors are great for pure-vertical problems, the kind of problem that some people think SIMD is limited to. They make your code forward compatible with future CPUs with wider vectors.

I've never worked with Cray-style vectors, so I don't know how much scope there might be for getting them to do horizontal shuffles.

If you don't limit things to Cray specifically, modern instruction-sets like ARM SVE and RISC-V extension V also give you forward-compatible code with variable vector width, and are clearly designed to avoid that problem of short-fixed-vector SIMD ISAs like AVX2 and AVX-512, and ARM NEON.

I think they have some shuffling capability. Definitely masking, but I'm not familiar enough with them to know if they can do stuff like left-pack (AVX2 what is the most efficient way to pack left based on a mask?) or prefix-sum (parallel prefix (cumulative) sum with SSE).

And then there are problems where you're working with a small fixed amount of data at a time, but more than fits in an integer register. For example How to convert a binary integer number to a hex string? although that's still basically doing the same stuff to every element after some initial broadcasting.

But other stuff like Most insanely fastest way to convert 9 char digits into an int or unsigned int where a one-off custom shuffle and horizontal pairwise multiply can get just the right work done with a few single-uop instructions is something that requires tight integration between SIMD and integer parts of the core (as on x86 CPUs) for maximum performance. Using the SIMD part for what it's good at, then getting the low two 32-bit elements of a vector into an integer register for the rest of the work. Part of the Cray model is (I think) a looser coupling to the CPU pipeline; that would defeat use-cases like that. Although some 32-bit ARM CPUs with NEON have the same loose coupling where mov from vector to integer is slow.

Parsing text in general, and atoi, is one use-case where short vectors with shuffle capabilities are effective. e.g. https://www.phoronix.com/scan.php?page=article&item=simdjson-avx-512&num=1 - 25% to 40% speedup from AVX-512 with simdjson 2.0 for parsing JSON, over the already-fast performance of AVX2 SIMD. (See How to implement atoi using SIMD? for a Q&A about using SIMD for JSON back in 2016).

Many of those tricks depend on x86-specific pmovmskb eax, xmm0 for getting an integer bitmap of a vector compare result. You can test if it's all zero or all-1 (cmp eax, 0xffff) to stay in the main loop of a memcmp or memchr loop for example. And if not then bsf eax,eax to find the position of the first difference, possibly after a not.

Having vector width limited to a number of elements that can fit in an integer register is key to this, although you could imagine an instruction-set with compare-into-mask with scalable width mask registers. (Perhaps ARM SVE is already like that? I'm not sure.)

Endosperm answered 29/5, 2022 at 9:57 Comment(0)

Recommended topics

Hot tags