Sunday, August 8, 2010

Parallel Processing Predictions: Part II

Last we left off, I stuck to my theory that applications would remain sequential - and that parallelization would be done implicitly through interaction with the OS. I gave a hint to my thinking by specifying that there is a delay between creation of an object, and it's actual use.

Previous Work First, Amdahl's Law [Wikipedia has a good write-up on it]. You can't go faster than the slowest serial portion of your code. DrDobb's has an article on beating Amdahl's law[DrDobb's Article]. I disagree - speeding up the execution on a single processor still means your are bound by the serial portions of your code.

For your reference - I consider SIMDizing[Intel has a good intro to get the idea] to be parallelizing code; as well as reordering instructions to maximize the use of the underlying execution units[Apple's write-up is a bit dated, but I like it.].

So - you must be laughing. I should explain in this post a miraculous way to make all of today's code parallel. Well, most of today's code parallel -- as long as the OS incorporates these changes.

I will give credit to Intel's Thread Building Blocks and distant relative CILK. I shouldn't forget Grand Central Dispatch - if only for it's ability to do task-scheduling with a view of the entire system load.

Aren't these great solutions? Oh yes. Brilliant. Tackling issues of data locality and efficiently doing work across multiple cores. Essential given the future is parallel.

Then what's the problem? If they optimally express the intentions of the programmer, none what-so-ever. For me, just keeping track of what is running, and can be run is not a "good thing".

Inspiration: When I think of implicit parallel programming, I think OpenGL. Even though drawing requests are done in sequence. For example, you can draw plenty of geometry - but that geometry is just cached in a buffer and drawn when the GPU has time. So while the code is doing whatever logic it needs to do, OpenGL is working in the background getting stuff done. And it's not only the video-card that's running, but part of the API. States need not take effect until they are actually needed.

The other thing I'd like to point out is the intelligence of compilers - more exactly, how much optimizations they can do. I'm not suggesting the Octopiler or something similar.

I'm suggesting devising APIs that are designed from the ground up to do work in parallel. But also increasing the integration of the compiler within the process (I'm sure someone else thought of this, there are brilliant people in CS departments around the world. I'm thinking aloud since I enjoy it. It also helps me coherently determine how my code should evolve.)

To avoid the simplicity that is a graphics API, let's consider another object. Let's say an object that does set manipulations. We have intersections, unions, enumerations, and all the wonderful related dangers to parallelizing these operations. A quick google search will lead to results of people writing codes using atomic operations, and making mistakes using atomic operations. So, in this part I'll document how this API would be presented to the end-user and how it could be implemented.

First, I can feel the temptation to say that all objects are immutable - that all mutations happen over copies. However, that is - how to say this - cheating. Our goal is to not change existing code as much as possible.

The API: The end-user of the API can do whatever they're used to doing with scalar code. It's the API itself that must do the miracles.

The Inner Workings: Finally, the moment of truth, how would it work? Let's assume that we are building on Grand Central Dispatch; with cache coherency methods similar to those of Thread Building Blocks (hybridized - these would do a great team for the next few years).

The general strategy is that the API has a series of sequential tasks. These tasks can run in parallel (even later) until the user actually makes a request for data. This leaves us with a few cases - the major ones are Immutable and Mutable.

Ok, Immutable objects are the easiest. We will add a state to the objects: "Pending". An Immutable object is in a "Pending" state if it hasn't completed initialization. We will also add a sequential list of tasks that can run asynchronously.

The entire API's logic is run on this asynchronous task list. Each task is sequentially numbered by the sequential portion of the API.

The easiest methods are those that return a new object. That might be a list, or something else. As long it's all part of the same API; it doesn't matter. This new object, Immutable or not, is not really an object, but more like an IOU. It holds the ID of the task and is put into a "pending" state. The same is true for existing objects stored within the data structure.

The next easiest method is functions that return basic data types. These could be defined to return a special wrapping object; so they'd behave like their object counter-parts. I'd suggest making the compiler hide these gory details, and use decent heuristics to determine if the overhead of creating an object is worth it.

But back to those returning basic data types. We'd get an ID of a task; but would stall the execution of the sequential code until the data was received.

Wouldn't we always be stalling? Here's the beauty of the solution - we know what work we must do sequentially. We can also do dependency checks. If one operation doesn't depend on another, it can be scheduled to run in parallel. Even; individual tasks could be made to run in parallel (like the intersection of two sets).

Next are the Mutable objects. Surprisingly, these aren't that different. Really.

What I've just done is outline how to parallelize a sequential application by properly engineering the backend libraries. This parallelization is completely transparent to the end-user of the API.

This basic solution might be enough to divorce an application's serial logic from the immensely parallel nature of the underlying computational hardware.

No comments:

Post a Comment