Full well I realize that Amdahl's law specifies that the upper bound of time equals the sum of the serial portions of an application. Another part of me wonders if it really matters.
Let me rephrase my thoughts: certain portions of applications explicitly need the extra compute power to run. Image processing, audio processing, and physics simulations are heavy. But the glue between these is not.
Of course, we are heading towards the land of highly parallel architectures. There is no denying it. Compute-intensive applications will benefit. The OS will become more complex. Most programmers will not see the difference. They should not need to care about threads. Not even about mutex. Even less about race conditions.
Before I was advocating implicit parallelization through complex objects that would run asynchronously to the main application. But then I started thinking about actual applications to realize that only certain frameworks would need to behave like this. To apply any parallelization strategy globally is silly. Consider processor affinity: a single application using a single thread on a single core will more easily benefit from the cache than a system that is distributed among cores.
If I were to write an application, my first reaction is not to parallelize it, but to see how well it behaves on a single core. Test it. Debug it. Get it working. If the performance is good enough, then I don't bother going the extra step. The argument that processors will become more parallel at the expense of speed is valid -- my reply is that the thinking IBM's Cell was well ahead of its time.
Let us now consider a simple forms-based application. The application may be serial, but the database engine is a highly parallel system capable of efficiently managing unthinkable amounts of data. The form should run on a single core. There should have no mutexes. No race conditions. The bottle-neck should be accessing and processing requests to the database. To replace what is currently threads (thinking of .NET with a thread stalled with the database query while the other does the UI) the database API should be asynchronous and trigger events on the main thread of the application. Congratulations! Many more people are taking advantage of parallel architectures without becoming parallel programmers!
A more complex example: an image viewer. Something that shows pictures captured from a camera. This application will ask the OS to load an image. Should loading the image be a sequential task? Not for JPEG. Should the application try to load a different image on multiple threads or should the OS work on loading images in the background and present results when needed? The latter should be the case. Ideally, the programmer never has to worry about a mutex. Not even a race condition. The word "parallel" should never come to mind. Essentially, effort should be expended on the UI rather than the technological merits.
I've come up with a counter-example for myself. What if an image is loaded and immediately drawn somewhere? I'll remind myself that OpenGL is a successful asynchronous API, and that drawing only occurs when glFinish is called (maybe even glFlush, and there are other conditions, but we don't need to explore them). So we request an image to be loaded. That happens in the background. Need to blit it? Sure! That can be added to the queue. Display to the screen? That can be queued as well! At the end of the draw call tree for the window and its controls? Good, crunch through that drawing in parallel and display it.
Let us suppose that the application also does facial recognition. Something heavier. Let us suppose that the OS does not support it and it must be done from scratch. "Scratch" is never the case. At a certain point the Eigenvalues will most likely need to be calculated for the light-corrected faces. That sounds like something any scientific library will provide. And these libraries will be tuned for modern parallel machines.
What I'm getting at is that parallelism is not something that everyone will have to deal with. And no-one should want to deal with it. Small pockets of people will deal with it. Most coders will work in software will then become highly sequential. A push to make parallelism exposed in all wakes of software is not the right move. Rather, specialized libraries and software as a service should be parallel.
Software as a service? Think an SQL server. Indexing services specifying what files change. Any request to the OS. Pattern recognition services in the OS. This opens the door to NUMA-style architectures which partition memory regions with processing units -- a much easier way to scale up the number of cores.
Pushing the idea further; parallel applications should be (in the worst case, like) servers independent of the UI.
What I'm getting at is that there are already many things that need to be thought of when writing code. UI design and usability. Correctness. Resource leaks (lingering connections to a database, for example).
Performance from multiple cores, when done right, requires careful management of data. Processor affinity must be respected. Cache lines should not be shared amongst cores. Cache lines should be accessed sequentially. Race conditions. Synchronization.
Consider that an application made to use cache correctly can run twice as fast as the same application made to use multiple threads.
Why should a controller, that ran well on a i386, be forced to worry about parallelism on multiple cores? Are we that bad at creating application programming interfaces?
My conclusion is that parallel processing is not for the masses. It should be available; but used with discretion. This rush towards parallelism in all matters is part of a techno-freak's fantasized reality-distortion field. Yes - we are going parallel. No - we don't need to make software thousands of times more complicated to write (saving time & money) to benefit from it.
Who decides whether your average person must become a parallel programmer? API designers. I'm sure that if they took a step back, then they would realize that for most purposes asynchrony on a single thread is the easy the way forward for end-users of the API.
EDIT: I initially wrote this quite late. I cleaned up the argument (a bit).
NOTE: I really should add notes about the GPU; but that will be left for another post.