Friday, February 20, 2015

Long live the Cell Processor

Talking with a former coworker yesterday, something dawned upon me.  I have already heard allusions from fellow students years ago to this idea.  The GPU is becoming the Cell.  The exposed API may be different, but the seeds are there.  Let's look at the details.

First, we must get over the concept of 'Wavefront', otherwise known as 'Warp'.  A wavefront is a fancy way of saying that you have multiple threads sharing the same instruction pointer.  In other words, it is akin to multiple people following the same to-do list, with one person telling them what line on the to-do list to work on.

Consequentially, if there is a branch, and two threads within a warp diverge, then both branches must be executed.  Some GPUs will fall back to executing each thread in the warp in sequence leading to a sixteenth of the potential performance.

Each wavefront may have its own personal store, known as thread local storage.  This fast memory is used for accumulating data.

Second, a few preliminaries on the Cell.  It had a PowerPC processor, with 8 SPU - synergistic processing units.  SPUs each could be allocated by the application, they could have programs compiled for them uploaded to them, and could communicate between each other using a super-fast ring network in the CPU.

Each SPU in the Cell, 8 total, had about 128 kilobytes of L1.  The SPUs accessed the L1 directly.  This is crucial to understand the Cell, the SPUs exposed their L1 as memory.  Pointers pointed into the L1.  The L1 is akin to the thread local storage on the GPUs.  As for warps, think of them as operating logically on vectors.  And SPUs where vector processors, with 4 floats per vector opposed to the 16 or more on GPUs.  In this sense, they both take similar algorithms.

Where the two differ is in memory access.  Cell went with the idea that the developer should be responsible for all low-level details - if everything on this hardware is going to be optimized, might as well expose all details.  Logically follows, the programmer is responsible to call asynchronous DMA functions to and from main memory to fill small chunks of L1 not used by the application currently loaded to the SPU.

On the other hand, the GPU, when it stalls for data, will run a different set of threads.  It works based on the assumption that there are a ton of threads, and it is less expensive, and easier for the developer, if one that is ready to run takes over.  That means, though, we have an issue where memory allocated per warp must be shared by all the threads that run in that warp.  The less memory used, the more backup work there is for one a thread stalls.  Hence why you don't want to use too many uniforms in OpenGL.  (There are more details, but this is the essential.)

The question is, given memory bandwidth is the issue with modern GPUs, will we see a rise of handing control over how data is transferred to and from the GPU to the developer?  And by then, will we not have simply gone back and gotten cell-like hardware?

No comments:

Post a Comment