superscalar and VLIW

Asked 20/4, 2011 at 14:51 Answered 23/10, 2016 at 4:40

Solved cross-platform parallel-processing cpu-architecture vliw

I want to ask some questions related to ILP.

A superscalar processor is sort of a mixture of the scalar and vector processor. So can I say that architectures of vector processor follows super-scalar ?
Processing multiple instructions concurrently does not make an architecture superscalar, since pipelined, multiprocessor or multi-core architectures also achieve that. What does this means?
I have read ' A superscalar CPU architecture implements a form of parallelism called instruction level parallelism within a single processor', superscalar cant use more than one processor ? Can anyone provide me example where superscalar are used?
VLIW , I have go through this article there is figure 4 on page 9.It shows a generic VLIW implementation, without the complex reorder buffer and decoding and dispatching logic. The term without decoding is confusing me.

Regards, anas anjaria

Barbarian answered 20/4, 2011 at 14:51 Comment(0)

Check this article.

Basic difference can be seen in these pictures:

Simple processor:

enter image description here

Superscalar processor:

enter image description here

Gruesome answered 12/5, 2011 at 12:5 Comment(1)

Thanks @Gruesome for the reference. I was looking for this image and could not find anywhere. Finally got it here. Its from the Kai Hwang book. – Shoreless 27/5, 2015 at 13:8

A superscalar processor is sort of a mixture of the scalar and vector processor.

LOL, no. A superscalar core is a core that can execute more than one instruction per clock cycle.

Blanchblancha answered 20/4, 2011 at 14:59 Comment(2)

I didnt get ur point .. be descriptive . M newbie n preparing for my exam. So plx be descriptive as u can – Barbarian 21/4, 2011 at 7:36

A VLIW processor can also execute more than one instruction per clock cycle. – Midkiff 16/5, 2016 at 8:9

A superscalar processor is sort of a mixture of the scalar and vector processor.

No, this is definitely not true.

A scalar processor performs computations on piece of data at a time.
A superscalar can execute multiple scalar instructions at a time.
A VLIW can execute multiple operations at a time.
A vector processor can operate on a vector of data at a time.

The superscalar Haswell CPU that I'm typing this on has 8 execution ports: 4 integer operations, 2 memory reads and 2 stores. Potentially 8 x86 instructions could execute simultaneously. That's superscalar. The 8080 could only execute 1 instruction at a time. That's scalar.

Haswell is both pipelined and superscalar. It's also speculative and out-of-order. It's hyperthreaded (2 threads per core) and multi-core (2-18 cores). It's just a beast.

Instruction level parallelism (ILP) is a characteristic or measure of a program not a CPU. A compiler scheduler will search for ILP statically or a CPU's scheduler will search for ILP dynamically. If they find it, then they can order+execute instructions accordingly.

Mayberry answered 23/10, 2016 at 4:40 Comment(15)

Minor error in an otherwise solid answer: Haswell's store ports are store-address and store-data. Each store needs both. Its peak memory throughput is 64B loaded per cycle / 32B stored per cycle. It can begin execution of 8 unfused-domain uops in a single clock cycle (4 ALU, 2 load, 1 store-address, 1 store-data). Note that macro-fusion of cmp/jcc and so on means 2 x86 instructions can decode to only one unfused-domain uop (and that predicted-not-taken branches can execute on ports 0 or 6), peak throughput is I guess 7 + 2 = 9 x86 instructions. – Broderick 23/10, 2016 at 5:12

Or plus another 4 since some instructions don't need an execution port e.g. NOP, reg-reg moves, and xor-zeroing are handled in the issue/rename stage. A group of 4 such instructions issuing in the same cycle that 7 uops execute could mean 13 x86 instructions that begin execution in a single cycle. – Broderick 23/10, 2016 at 5:50

Of course, it makes more sense to talk about how many instructions are in the pipeline at once. 8086 did fetch (2 bytes at a time!) in parallel with execution, but code-fetch was still its major bottleneck. Haswell's ROB size is 192 entries, so ~that many x86 instructions can be in flight at once. One number I like to point out is that even though Haswell's FMA latency is 5 cycles, its throughput is two per clock, so to saturate the FMA units while summing an array, you need to use 10 vector accumulators to keep 10 FMAs in flight (in the execution units) at once. – Broderick 23/10, 2016 at 5:57

See the x86 tag wiki for many links, including David Kanter's Haswell writeup (the slides you linked included some of his diagrams). Agner Fog's optimization guide and microarchitecture guide. – Broderick 23/10, 2016 at 5:58

Instruction level parallelism (ILP) is a characteristic or measure of a program not a CPU: Interesting point. I guess when you want to talk about how much ILP a specific CPU was able to exploit, you should just call it IPC (instructions per cycle). – Broderick 23/10, 2016 at 6:0

Thanks. I didn't get the address/data stuff straight.The Haswell article says that Up to eight simple μops per cycle can be executed by heterogeneous execution ports. That much I kinda understand. But does the stack engine and the renamer allow even more instructions to be executed simultaneously? – Mayberry 23/10, 2016 at 6:33

The stack engine means that PUSH/POP don't decode to multiple uops in the first place. The rename stage is the only place xor eax,eax "executes": Just rename the architectural EAX to refer to a physical zero register (and I guess rename flags to another physical register). Same for mov eax, ecx, since I think Intel SnB-family supports having multiple architectural regs map to the same physical reg. Those instructions generate zero unfused-domain uops, so I guess they enter the ROB in an already-executed state (and unlike other instructions, never enter the RS at all). – Broderick 23/10, 2016 at 6:59

It's kinda silly to talk about burst throughput, though. The sustained throughput is limited by the issue width: 4 fused-domain uops per clock, which can represent up to 6 x86 instructions (with two macro-fused ALU+branch instructions), and can have micro-fused memory operands. Even NOPs count against this 4 per clock limit, because they still decode to a uop. (I'm not 100% sure why they can't decode to 0 uops, but they could be a branch target... And probably adding that special case isn't worth the transistors, because running lots of NOPs isn't an important task!) – Broderick 23/10, 2016 at 7:2

Burst throughput helps when a bunch of uops were all stuck waiting for the same thing (e.g. data from a cache miss). More burst frees up more ROB and RS space faster, allowing the CPU to start finding parallelism in upcoming instructions. But the main use for all those execution units is high sustained throughput regardless of whether the CPU is running code with lots of loads/stores to L1, or whether it's doing a lot of work on data that stays in registers. Modern CPUs are limited by power budget, so they can't actually use all their transistors at the same time. – Broderick 23/10, 2016 at 7:8

Burst is interesting to know what sort of structural limits there are. I too don't understand why NOP isn't free after translation to a uop. It takes up space since a Way is limited to 6 uops. It takes energy. There must be an Intel reason but I don't see it. Less is less. – Mayberry 24/10, 2016 at 5:8

BTW, don't NOPs get handled by the renamer? That is, they don't get issued? Figure 2-1 of the Intel Optimization manual puts rename right before the scheduler. – Mayberry 24/10, 2016 at 5:17

NOPs issue and retire on SnB, but don't dispatch. See the perf counters in this answer, especially the second test where micro-fusion doesn't happen. Less is less, but spending transistors to check for and handle the special case is presumably not worth it, because real code almost never suffers from executing too many NOPs. Toolchains use single long-NOP instructions instead of multiple NOPs, so busting the uop cache with a lot of 1-byte NOPs is also only an issue in bad machine code. – Broderick 24/10, 2016 at 5:25

If NOP handling was important, I'm pretty sure Intel could build CPUs that didn't issue / retire them. The issue stage might have to shuffle other uops to fill gaps. Not even storing them in the uop cache might be slightly tricky when a NOP is a branch target, but presumably you could have some kind of "there's a NOP here" flag that would let it work. It would mean NOPs might not generate perf counter events for instructions-executed; IDK if Intel cares about perf counter behaviour. It would also mean that a RIP=a uop would never happen when an interrupt fired. Again, that's probably fine. – Broderick 24/10, 2016 at 5:32

NOPs issue and retire on SnB, but don't dispatch. I think not dispatching means not getting scheduled and that makes sense with the renamer being before the scheduler. FWIW, there's a talk at the LLVM Dev Meeting on NOPs from someone at Intel: Causes of Performance Instability due to Code Placement in X86. – Mayberry 24/10, 2016 at 7:15

Right, dispatch is when the scheduler sends unfused-domain uops to execution ports. NOP is 1 fused-domain uop but 0 unfused-domain uops, so there's nothing to dispatch. It is still issued into the ROB (fused-domain), and retired (fused-domain). The main issue with NOPs for performance is the alignment of other code affecting fetch/decode and/or branch prediction, not running the NOPs themselves. The "instability" in the title tells me that's what it's prob. about. – Broderick 24/10, 2016 at 7:25

Check out this first (http://en.wikipedia.org/wiki/Superscalar):

A superscalar processor executes more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to redundant functional units on the processor. Each functional unit is not a separate CPU core but an execution resource within a single CPU such as an arithmetic logic unit, a bit shifter, or a multiplier.

This means that for example the CPU with 2(two) ALUs (arithmetic logic unit) can physically issue 2 arithmetic instructions and execute them. Each arithmetic instruction will be executed in different ALU unit.

Second check this (http://en.wikipedia.org/wiki/Instruction_level_parallelism): It will help you not to confuse the different techniques for achieving ILP (instruction level parallelism).
Third (http://en.wikipedia.org/wiki/P5_(microprocessor)): Example for the superscalar processor is the original Intel Pentium. It has two instruction pipelines.

Papule answered 12/5, 2011 at 11:38 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags