Sunday, August 8, 2010

Parallel Processing Predictions: Part II

Last we left off, I stuck to my theory that applications would remain sequential - and that parallelization would be done implicitly through interaction with the OS. I gave a hint to my thinking by specifying that there is a delay between creation of an object, and it's actual use.

Previous Work First, Amdahl's Law [Wikipedia has a good write-up on it]. You can't go faster than the slowest serial portion of your code. DrDobb's has an article on beating Amdahl's law[DrDobb's Article]. I disagree - speeding up the execution on a single processor still means your are bound by the serial portions of your code.

For your reference - I consider SIMDizing[Intel has a good intro to get the idea] to be parallelizing code; as well as reordering instructions to maximize the use of the underlying execution units[Apple's write-up is a bit dated, but I like it.].

So - you must be laughing. I should explain in this post a miraculous way to make all of today's code parallel. Well, most of today's code parallel -- as long as the OS incorporates these changes.

I will give credit to Intel's Thread Building Blocks and distant relative CILK. I shouldn't forget Grand Central Dispatch - if only for it's ability to do task-scheduling with a view of the entire system load.

Aren't these great solutions? Oh yes. Brilliant. Tackling issues of data locality and efficiently doing work across multiple cores. Essential given the future is parallel.

Then what's the problem? If they optimally express the intentions of the programmer, none what-so-ever. For me, just keeping track of what is running, and can be run is not a "good thing".

Inspiration: When I think of implicit parallel programming, I think OpenGL. Even though drawing requests are done in sequence. For example, you can draw plenty of geometry - but that geometry is just cached in a buffer and drawn when the GPU has time. So while the code is doing whatever logic it needs to do, OpenGL is working in the background getting stuff done. And it's not only the video-card that's running, but part of the API. States need not take effect until they are actually needed.

The other thing I'd like to point out is the intelligence of compilers - more exactly, how much optimizations they can do. I'm not suggesting the Octopiler or something similar.

I'm suggesting devising APIs that are designed from the ground up to do work in parallel. But also increasing the integration of the compiler within the process (I'm sure someone else thought of this, there are brilliant people in CS departments around the world. I'm thinking aloud since I enjoy it. It also helps me coherently determine how my code should evolve.)

To avoid the simplicity that is a graphics API, let's consider another object. Let's say an object that does set manipulations. We have intersections, unions, enumerations, and all the wonderful related dangers to parallelizing these operations. A quick google search will lead to results of people writing codes using atomic operations, and making mistakes using atomic operations. So, in this part I'll document how this API would be presented to the end-user and how it could be implemented.

First, I can feel the temptation to say that all objects are immutable - that all mutations happen over copies. However, that is - how to say this - cheating. Our goal is to not change existing code as much as possible.

The API: The end-user of the API can do whatever they're used to doing with scalar code. It's the API itself that must do the miracles.

The Inner Workings: Finally, the moment of truth, how would it work? Let's assume that we are building on Grand Central Dispatch; with cache coherency methods similar to those of Thread Building Blocks (hybridized - these would do a great team for the next few years).

The general strategy is that the API has a series of sequential tasks. These tasks can run in parallel (even later) until the user actually makes a request for data. This leaves us with a few cases - the major ones are Immutable and Mutable.

Ok, Immutable objects are the easiest. We will add a state to the objects: "Pending". An Immutable object is in a "Pending" state if it hasn't completed initialization. We will also add a sequential list of tasks that can run asynchronously.

The entire API's logic is run on this asynchronous task list. Each task is sequentially numbered by the sequential portion of the API.

The easiest methods are those that return a new object. That might be a list, or something else. As long it's all part of the same API; it doesn't matter. This new object, Immutable or not, is not really an object, but more like an IOU. It holds the ID of the task and is put into a "pending" state. The same is true for existing objects stored within the data structure.

The next easiest method is functions that return basic data types. These could be defined to return a special wrapping object; so they'd behave like their object counter-parts. I'd suggest making the compiler hide these gory details, and use decent heuristics to determine if the overhead of creating an object is worth it.

But back to those returning basic data types. We'd get an ID of a task; but would stall the execution of the sequential code until the data was received.

Wouldn't we always be stalling? Here's the beauty of the solution - we know what work we must do sequentially. We can also do dependency checks. If one operation doesn't depend on another, it can be scheduled to run in parallel. Even; individual tasks could be made to run in parallel (like the intersection of two sets).

Next are the Mutable objects. Surprisingly, these aren't that different. Really.

What I've just done is outline how to parallelize a sequential application by properly engineering the backend libraries. This parallelization is completely transparent to the end-user of the API.

This basic solution might be enough to divorce an application's serial logic from the immensely parallel nature of the underlying computational hardware.

Sunday, August 1, 2010

Parallel Processing Predictions: Part I

I've been thinking of parallel processing as of late. I've coded a task scheduler with a few swappable algorithms in the backend to test performance; seen my programs degrade in performance as the number of cores increase; and literally learn everything the "oops I made a mistake" way.

So, standing upon my pile of errors; I'm going to attempt to predict what will happen (some of it very long term). Some of it we already know - hardware will make a massive shift. Others; well - let's see if I'm right about any in the next decade.

No Change For Most: 99% of applications will remain sequential in nature and never directly deal with threads. This sounds crazy; I know given this massive push towards multi-core - but it's an unshakable conclusion in my mind.

Rather, libraries will be parallel and asynchronous - making "DIY" algorithms an even sillier affair. As a simple example: consider comparing two strings. There is a time since the request to compare the strings is sent, and when the value is actually used. If we use something like Grand Central where the threads are already spawned and just need to be used, then we're talking low-overhead implicit parallelization.

Essentially, libraries will present a sequential interface. User-land code will not need to know about parallelism - but will take advantage of the hardware when the programmer leverages the underlying frameworks.

Right about now, some people probably want to remind me that the GUI runs in the main thread whereas heavy processing is normally put into a secondary thread with mutexes to coordinate the two. I'll recall olde code that actually split up the processing so that the GUI event-pump could run. Even, olde code that would, at explicit times, yield execution to the GUI. Just yield when data is available to update the GUI, much less messy than having locks in obscure places. Normally, for such things, I have one lock that synchronizes events from the GUI on the separate thread with my work thread. The more sequential my code, the easier it is to work with.

Procedural Over Functional: Again, the Haskell fans will most likely disagree. Actually, I'd rather code parallel code in Haskell at times... But; let's not get distracted - I believe procedural code will maintain it's reign. The reason is simplicity. Everyone understands a TO-DO list, a sequence of instructions, manuals, etc. No manual I know asks you to trace to the desired result to get to the start of the solution (sometimes I feel like I'm rigging code to have side-effects to get it to run in functional languages).

Even if languages like Haskell are "better"; the rich and parallel underlying libraries of the system will make the entire thing moot.

Software Diets: Back-end system libraries will overthrow software complexity. That is, the OS will provide so much functionality that an application will simply have to connect the pieces. Some specialization might be needed; but very little.

This availability of features in the OS will change the software landscape. Software that does not properly integrate into the OS (that will have a global strategy for multiprocessing, with libraries optimized for that strategy) will slow down.

Think of it this way: the OS and it's libraries will shift together. And as hardware changes, so will the OS-provided libraries. Software using it's own libraries will need to adapt; but those using the OS-provided libraries won't.

This raises problems for multi-platform toolkits as I figure that certain functionality will be radically different on different OSes, even though they accomplish a similar thing.

Hardware Rollercoaster: See the disconnect between the user-land software and the hardware? Well, that means the underlying hardware can change more drastically. If the OS is the only one that needs to know about the specifics of the hardware and the user-land software just needs to focus on the algorithm; then people shouldn't even notice the insanity that is going on underneath the hood.

This implies that something akin to the LLVM project would be used to compile to a machine-independant level and complete compilation on the final hardware.

More Open-Source: Not the year of Linux; but the year of the open-source libraries. The added complexity of the underlying libraries will mean that they are more costly to develop. I'm not advocating using the spare time of developers to advance a corporations heavy work; but rather problems will be shared among OS vendors. Open-source is just a way to save money by reducing the amount of redundant work among companies.

If I had to wager, I'd say a BSD-style license - so each vendor can advertise having an upper-hand when marketing to consumers.

In the end... We aren't headed towards great doom when it comes to multi-core. Yes, people are panicking now as they realize all the caveats, however we've known for quite a while that 10% of the code is responsible for 90% of the execution time. That 10% of the code, in most cases, should be part of the OS and supporting library.

Final Ramblings: I referred to Apple's technology since I'm more familiar with it. I'm sure Microsoft is doing something similar - and that the .NET runtime might be better suited to the type of parallelism that I'm describing.

Now that I've predicted that nothing much will change for your average programmer; I'm going to do a part 2 - and detail exactly how I see the underlying libraries and hardware shifting. In part 3 - I'll explore how the development environment could change.

These predictions aren't final. As I learn more and read more of the results from the scientific community, my predictions may be refined over time. I won't delete these pages; but admit my mistakes and document how I erred.

One last thing. As a reader, you must have noticed the lack of references in this document. I don't trust texts without references. Neither should you. This part essentially puts forward the theory that user applications will just, in the worst case, need a recompile and keep on running using all the available hardware.

All this despite the move to multicore.