Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Doesn't the unrolled version skip the speculation for three out of every four entries? If so, it seems suspicious that it does so much better when it looks like it's a mix between the first and second version instead of just an unrolled implementation of the second version.

I'd also be interested in seeing how this performs when the data isn't allocated in one chunk (which seems to be the absolute best case scenario for this trick). Will it end up worse than the "naive" version due to all the branch mispredictions?



Hi, thank you for your comment.

> Doesn't the unrolled version skip the speculation for three out of every four entries ?

It does, but what matters is removing data dependencies often enough. By only speculating once every 4 items, we're lowering the average number of instructions per iteration (from 16 to 7.7). In the meantime, and thanks to value speculation, the CPU is still able to start working on 4 next items while processing the current loop iteration.

Indeed it's not pure unrolling of the second function.

> Will it end up worse than the "naive" version due to all the branch mispredictions?

If the data is allocated randomly, then the next cell prediction will often be wrong. In turn, the branch predictor of the CPU will predict that the prediction is wrong and won't continue execution with the predicted value. So it's the same behavior as the original implementation. So the overhead is not due to branch misprediction, but to additional cycles needed to compute the predicted value.

If the data is allocated in chunks, then the optimization remains useful as the cost of rolling back the CPU pipeline is not that much (I tested with the original list cut in 10 chunks and it added 100ms caused by 100k additional branch misses).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: