I am a researcher in parallel computing. All of these papers represent good work and are contributions to the field, but they still obey the limited model I was referring to: ship a large amount of data to the GPU, do data parallel computations on that data, and ship the results back. As I pointed out in a post below, this model does not work well when you only have small quantities of data at a time, yet there is parallelism to exploit.
I'll defer to your judgment about this stuff, then. (I'm a PhD student in computer vision, and we've recently been using a GPU SVM library that's been amazing for cutting down our processing times, so I guess I've been a little dazzled by this stuff.)
Anyways, since you're in this field, what's your feeling about the future of parallel computing, with regards to the different vendors? Which of CUDA/OpenCL/Larrabee will win out? Or none of the above? When will APIs settle down?
Honestly, I don't know, and anyone who claims to know is selling you something.
Your question is the question in parallel computing right now. And it effects all sizes and scales, from processor architecture (look at the different architectures of an Intel Quad Core, Cell, GPUs and upcoming Larabee and Fusion) to supercomputers (BlueGene style thousands of slow cores with fast interconnect, RoadRunner style of typical multicore processors with Cells as accelerators, Nvidia's giant GPU box, or just lots of SMPs). We don't know what the future will look, which makes this an interesting time to be in the field. People at all levels are experimenting with different architectures. We don't know what will win, if any one thing will win, and when we'll know.
With that said, I don't think APIs at the processor level will settle down until the hardware does. My understanding of OpenCL is to have a programming model that would work on architectures as different as GPUs, Cell and Larrabee, and that this will supplant Cuda. That sounds like a great idea, but lots of great ideas haven't worked in practice before.
I think it's going to be several at least several years of experimentation before the hardware settles down. My own belief (that is, opinion not based on experimental data) is that we'll end up with a heterogeneous chip with lots of simple cores for parallelism, a small number of sophisticated cores for sequential computation, all part of an integrated memory hierarchy.
The basic metric for this kind of comparison is how long can you get your 'compute' node to work on a part of a problem without any new input data and without any intermediate results that need posting for other parts of the code to continue (rendez-vous I believe these are called).
The longer that time the better suited the problem is for a massive parallel solution.
If the time is low relative to the IO that needs to be done then you'll find very soon that the bus that carries data between the host CPU and the number cruncher is the bottle-neck.
The big problem is that you can work with some fairly large amounts of data and still have the bus be your bottleneck. I was doing some GPU work a few summers ago that focused on matrix multiplications. We were sending matrices with 8k numbers on a side to the GPU for multiplication and still ending up with the bus being the slowest part of the computation.
How much RAM was on the card? 64bit numbers * 8000^2 = 512 MB. Granted today you can have 4GB per card, but back then you where probably stuck with a fraction of that.
Still, PCIe 2.0 x16 is limited to 8GByte/s so I guess the real question is how many matrixes where you multiplying?