"there’s a lot more to realising good parallel speedup than just choosing the right language"
from ye olde days - if you can't split up computation into nice isolated chucks point is moot, so relatively fine granularity of your algo stepping is the key here, lack or severe control of side effects means you don't get to make too many mistakes related to mechanics of running the computation.
> so relatively fine granularity of your algo stepping is the key here
Fine granularity requires pretty much zero overhead synchronization, which green threads or any shared memory multithreading implementations can't do, they need to spend like a few thousands of nanoseconds of useful work before synchronization costs become even just bearable.
> Fine granularity requires pretty much zero overhead synchronization
Yes, like a work-stealing scheduler. Many tiny tasks are fine as long as you keep them on one core. Other cores can steal a batch of them from the other end of your queue with minimal synchronisation every now and again.
That's not my experience. I think work stealing is low overhead as it amortises synchronisation, only does it at all when actively required to get more work, and reduces conflicts due to two ends of the queue. Why do you think it's high overhead?
Isn't work stealing excellent for locality? Jobs stay on the same core until there's a need to steal, and then the most likely non-resident jobs are taken. Is it even cache oblivious?
from ye olde days - if you can't split up computation into nice isolated chucks point is moot, so relatively fine granularity of your algo stepping is the key here, lack or severe control of side effects means you don't get to make too many mistakes related to mechanics of running the computation.