Why increased pipeline depth does not always mean increased throughput?

Asked 8/4, 2010 at 2:26 Answered 14/4, 2010 at 22:49

This is perhaps more of a discussion question, but I thought stackoverflow could be the right place to ask it. I am studying the concept of instruction pipelining. I have been taught that a pipeline's instruction throughput is increased once the number of pipeline stages is increased, but in some cases, throughput might not change. Under what conditions, does this happen? I am thinking stalling and branching could be the answer to the question, but I wonder if I am missing something crucial.

Focus answered 8/4, 2010 at 2:26 Comment(1)

Thanks for the answers. Just for your information, another thing that comes to my mind is even if we increase pipeline stages, hoping to break the original pipeline stage logic into smaller subnetworks, an instruction might not propagate through these smaller networks because its simplest form of pipeline hierarchy can be explained in terms of the original stage logic, so this would not affect throughput. – Focus 8/4, 2010 at 3:22

The throughout can be stalled by other instructions when waiting for a result, or on cache misses. Pipelining doesn't itself guarantee that the operations are totally independent. Here is a great presentation about the intricacies of the x86 Intel/AMD architecture: http://www.infoq.com/presentations/click-crash-course-modern-hardware

It explains stuff like this in great detail, and covers some solutions on how to further improve throughput and hide latency. JustJeff mentioned out-of-order execution for one, and you have shadow registers not exposed by the programmer model (more than 8 registers on x86), and you also have branch prediction.

Cariole answered 14/4, 2010 at 21:58 Comment(0)

Agreed. The biggest problems are stalls (waiting for results from previous instructions), and incorrect branch prediction. If your pipeline is 20 stages deep, and you stall waiting for the results of a condition or operation, you're going to wait longer than if your pipeline was only 5 stages. If you predict the wrong branch, you have to flush 20 instructions out of the pipeline, as opposed to 5.

I guess presumably you could have a deep pipeline where multiple stages are attempting to access the same hardware (ALU, etc), which would cause a performance hit, though hopefully you throw in enough additional units to support each stage.

Frank answered 8/4, 2010 at 3:30 Comment(1)

That's not 20 instructions, but 20 cycles' worth of instructions. On a heavily superscalar CPU, that may be MUCH more. – Kuomintang 13/4, 2010 at 0:23

Instruction level parallelism has diminishing returns. In particular, data dependencies between instructions determine the possible parallelism.

Consider the case of Read after Write (known as RAW in textbooks).

In the syntax where the first operand gets the result, consider this example.

10: add r1, r2, r3
20: add r1, r1, r1

The result of line 10 must be known by the time the computation of line 10 begins. Data forwarding mitigates this problem, but...only to the point where the data gets known.

Sessions answered 14/4, 2010 at 22:49 Comment(0)

I would also think that increasing pipelining beyond the amount of time the longest instruction in a series would take to execute would not cause an increase in performance. I do think that stalling and branching are the fundamental issues though.

Barrios answered 8/4, 2010 at 2:30 Comment(0)

Definitely stalls/bubbles in long pipelines cause a huge loss in throughput. And of course, the longer the pipeline the more clock cycles are wasted.

I tried for a long time to think of other scenarios where longer pipelines could cause a loss in performance, but it all comes back to stalls. (And number of execution units and issue schemes, but those don't have much to do with pipeline length.)

Mcquade answered 8/4, 2010 at 2:54 Comment(0)

Recommended topics

Hot tags