"there’s a lot more to realising good parallel speedup than just choosing the ri...

zzzcpan · on Dec 17, 2019

> so relatively fine granularity of your algo stepping is the key here

Fine granularity requires pretty much zero overhead synchronization, which green threads or any shared memory multithreading implementations can't do, they need to spend like a few thousands of nanoseconds of useful work before synchronization costs become even just bearable.

chrisseaton · on Dec 17, 2019

> Fine granularity requires pretty much zero overhead synchronization

Yes, like a work-stealing scheduler. Many tiny tasks are fine as long as you keep them on one core. Other cores can steal a batch of them from the other end of your queue with minimal synchronisation every now and again.

zzzcpan · on Dec 17, 2019

No, work stealing is pretty high overhead.

chrisseaton · on Dec 17, 2019

That's not my experience. I think work stealing is low overhead as it amortises synchronisation, only does it at all when actively required to get more work, and reduces conflicts due to two ends of the queue. Why do you think it's high overhead?

zzzcpan · on Dec 17, 2019

It's only one part of the story (scheduler) and not even good enough idea considering non-locality overhead.

chrisseaton · on Dec 17, 2019

Isn't work stealing excellent for locality? Jobs stay on the same core until there's a need to steal, and then the most likely non-resident jobs are taken. Is it even cache oblivious?

BubRoss · on Dec 17, 2019

Why would it be high overhead if work is only stolen when a core has run out of work?