Monday, February 16, 2015

Super-Scalar Execution for Software

Update: Looking through the C++11 changes, there is something that I have completely missed.  std::future and std::async (or another std function that may spawn threads) which provide a bare-bones way of doing plenty of operations asynchronously within the existing standard.  This is a step towards what I describe below -- and devised long before I ranted.

Multiple processors are the new norm, and there is nothing we can do about it.  We have to bite the bullet and add threads to our software.  A few mutexes here, signals there, and before you know it we have race conditions and stalls.

There is something to be said about sequential execution of code and how legible it is.  It is a reason why we have exception handling - the common code runs through an efficient path, and the corner cases which are rare go through the stack-unwinding path.  Simple.  Easy

But what about, say, multi-threaded code.  You now have multiple instruction pointers and things dancing all over the place.  It is, for lack of a better word, a mess.

Let me elaborate.  Mutexes are terrible since they stall threads.  They act as bottle-necks preventing parallel execution.  The more code within a mutex, the less parallelism you end up with.  Task-schedulers fragment what would be linear code into a tree of potentially parallel functions.  Again, try to undersand the algorithm in that mess.

Then, it dawns upon me.  Your modern CPU does something known as super-scalar execution.  Fancy term to say it knows the dependencies among instructions and can execute, within reasonable limits, code out of order.  It can fuse common instructions into one, or it can split complex instructions into many.  For lack of a better word, it rewrites and parallelizes code among its various execution units - and that with each and every piece of linear software -- single thread that is all.

Your compiler knows that this happens, and will often enough try to re-organize code such that the various execution units are always fed with something.  It might do something that appears suboptimal, but which runs faster, since the target processor can run that instruction in parallel.

Back to the point: linear code is easy.  Is there a way to parallelize our linear code?  Doing so automatically is an exercise in frustration since the key to making something happen in parallel is the data.  Data has to be operated on, for lack of better words, atomically.  But the CPU works around this.

How?  Each operation takes up an execution unit.  An execution unit is like a thread of execution waiting to be used.  Each thread of execution will operate for a known amount of time (cycles) for each operation.  Can we replicate this pattern with regular code?

Let me propose the idea of a "Super Scalar API", where we have plenty of worker threads awaiting commands to run.  We become completely data-driven, in that the worker threads will distribute work based on data chunks and not around the series of instructions that must be followed for execution to function.

But now we have this issue: we need a way for the API to know of the data, and for the code to deal with the sequence of events.

First, we need the concept of a pipeline.  Running our sequential code should fill this pipeline with instructions and data.  Then, this pipeline should be able to look at what is being done, and assign work to the appropriate worker threads.  Yes, we do have two points of serial execution (logic + pipeline), but I expect that the brunt of the work should occur on a worker thread.

Second, we need to know what these worker threads would support.  If we know the type of operations that applications do so frequently, then would could make them happen out of order.  Think about us making the API act synchronously, but in reality it is asynchronous.  Again, not unheard of.  OpenGL, for the longest time, worked like this.    STL-like functionality would be the prime candidate; you need to do some operation on some data on some container.  Either as reduction, or as accumulation.  As you can imagine, mutability plays a massive role in this.

Third, and lastly, we need to deal with conditions asynchronously.  If (result of previous operation) needs a better solution.  And, thanks to lambdas, we can do something very nifty.  Define our own If (with a capital I) that captures the condition and feeds it rhough to the pipeline.  Sequential execution illusion is not broken, and we get parallel execution out of it.

Finally, the problem will be there is always a way to have too many interdependencies to make parallelism less possible.  Embarassingly sequential.  But I would argue that if the containers are properly built, then over time parallelization of all applications built upon the abstractions can be improved by seeing how they are used.

And there you have it.  Super-scalar execution as the hero to make massive parallelism possible.  Say, simpler.  Well, that is my thinking.

No comments:

Post a Comment