What does programming for PS3's Cell Processor entail?

Asked 31/8, 2009 at 4:46 Answered 4/9, 2009 at 22:59

How is programming for the Cell Processor on the PS3 different than programming for any other processor found on a normal desktop?

What kind of programming paradigms, techniques, and practices are used to fully utilize the Cell Processors potential?

All the articles I hear concerning PS3 development discuss, "Learning how to program on the Cell Processor." What does this really mean beyond some hand waving?

Carlinecarling answered 31/8, 2009 at 4:46 Comment(1)

Glib answer: "pain and suffering!" – Terra 31/8, 2009 at 5:6

In addition to everything George mentions, the SPUs are really better thought of as streaming vector processors. They work best when you have an algorithm that works on long sequences of numerical data, which can be fed through the SPU's limited memory via DMA, rather than having the SPU load a chunk of memory, try to operate on it, find that it needs to follow a pointer to somewhere outside its memory, load that, keep going, find another one, and so on.

So, programming for them isn't a simple model of concurrency and threads; it's more like high performance numerical or scientific computation. It is also non-uniform memory access taken to an extreme.

Furthermore, every processor is in-order with deep pipelines, so the programmer has to be much more aware of data hazards and instruction bubbles and all the numerous micro-optimizations that we are told the compiler "should" take care of for us (but it really doesn't). Things like mispredicted branches, load-hit-stores, cache misses, etc. hurt a lot more than they would on an out-of-order processor that could juggle the order of operations around to hide such latencies.

For concrete examples, check out Mike Acton's CellPerformance blog. Mike is my favorite old-school assembly-happy perf curmudgeon in the business, and he's really earned his chops on this issue.

Carbonate answered 31/8, 2009 at 5:12 Comment(1)

I love Mike Acton's '3 big lies' post cellperformance.beyond3d.com/articles/2008/03/…. I once sent that to my team, which includes games and web programmer. The web guys nearly fainted – Palpitation 31/8, 2009 at 13:13

The Cell part of the PS3 consists of 6 SPU processors. They each have 256 KB of non-shared memory and are connected via a high-speed ring that allows for DMA between each other and the PowerPC host processor. They are not pipelined or cached. This makes it rather different than an multi-core x86 with shared memory, pipelining and caching. Also, the SPU processors do not use the same instruction set as the PowerPC so you've got some asymmetry there.

In short, your typical shared-memory, multithreaded program won't just drop onto the Cell without some work (with the caveat that computer science works hard at making different machines appear to be the same so some implementors try hard to automate the process).

At a high level the program will need to be broken up into tasks that fit within the Cell's hard memory limit. Those can run in parallel and each sub-task can be sequenced to an available Cell processor. At a low level, the compiler (or assembly programmer) will need to work harder to generate code that runs quickly on a processor -- no run-time trickery to make things go faster is available. The theory being that those programmer/compiler friendly features cost silicon and speed that can be better spent giving you more and faster SPUs. Of course, you're not getting any more SPU's on the PS3 but in the general case you'll get more SPUs per number of transistor available on chip.

Guiscard answered 31/8, 2009 at 5:4 Comment(0)

Completely agree with George Philips and Crashworks. Only thing I'd add is that SPU programming is fundamentally about job management. To get the best out of the SPUs you need to keep them ticking over and feeding back results. There's no point in having one SPU chewing through some complex post-processing if your having to sit and wait for the results for a frame and the rest of your SPUs are sat idle. So how you distribute your jobs requires a lot of thought and this has a big impact on how you chunk up your data.

Palpitation answered 31/8, 2009 at 14:1 Comment(0)

"All the articles I hear concerning PS3 development discuss, 'Learning how to program on the Cell Processor.' What does this really mean beyond some hand waving?"

Well, stuff you have to deal with on SPUs...

Atomic operations (lock-free try-discard style).
Strong distinction between memory areas. You have to know which pointer is pointing to which memory area or you'll screw everything up.
No enforced hardware distinction between data and code. This is actually a fun thing, you can setup dynamic code loading and essentially stream subroutines in and out. Self-modifying code is possible but not necessarily practical on SPU.
Lack of hardware debugging aids.
Limited memory size.
Fast memory access.
Instruction set balanced toward SIMD operations.
Floating point "gotchas".

You ideally want to keep the SPUs doing useful work all of the time, but it's really challenging. Not only are they not well suited for handling some types of problems, but often moving a system to be efficient on SPU can involve a complete redesign. Debugging problems that would be easy to catch on the PPU can sometimes take days on SPU.

I think when people use the phrase "learning how to program the cell" they are mostly hand waving. You can learn the basics in a week, the challenge comes in trying to apply that knowledge to real code... which often already exists and isn't in a form well-suited for use on SPU.

Ennead answered 4/9, 2009 at 22:59 Comment(1)

"You can setup dynamic code loading and essentially stream subroutines in and out," which you could always do on the PS2, too, but usually no one dared! – Carbonate 4/9, 2009 at 23:59

Recommended topics

Hot tags