Implementing Value Speculation in OCaml

gadmm · on May 7, 2023

This is about the classic trick of speculating on values using branch prediction (if value == expected then f(expected) else f(value)), which is always fun to see. But be careful in OCaml, as the memory model relies on the memory ordering of data dependencies (e.g. on Arm) to ensure memory-safety, so I suspect that this trick might be in general memory-unsafe in the presence of data races (more precisely values from other threads might be seen before their initialization). (OCaml memory-safety claims do not apply here because it uses Obj.magic.)

Smaug123 · on May 8, 2023

Forgive the silly question, but isn't the speculative execution system responsible for making sure that speculation doesn't break your program semantics? In general, would it not be a bug in the speculative execution system to (e.g.) dereference a null pointer if your program isn't actually going to do that?

More generally, where can I read about this kind of thing to find out exactly what the speculative execution system is or isn't allowed to do? Arm documents the speculation barrier instruction, and I've found a paper which models speculative execution formally, but no actual design semantics.

gadmm · on May 8, 2023

The question is whether the user replacing `f(value)` with `if value == expected then f(expected) else f(value)` preserves program semantics.

Look for Address dependency in Arm: see e.g. https://developer.arm.com/documentation/ddi0406/c/Appendices... .

Note that the approach in the original post is also wrong if the GC sees the expected value, because it would try to access it regardless of whether the prediction is valid.

s28l · on May 8, 2023

Doesn't the unrolled version skip the speculation for three out of every four entries? If so, it seems suspicious that it does so much better when it looks like it's a mix between the first and second version instead of just an unrolled implementation of the second version.

I'd also be interested in seeing how this performs when the data isn't allocated in one chunk (which seems to be the absolute best case scenario for this trick). Will it end up worse than the "naive" version due to all the branch mispredictions?

lortex · on May 8, 2023

Hi, thank you for your comment.

> Doesn't the unrolled version skip the speculation for three out of every four entries ?

It does, but what matters is removing data dependencies often enough. By only speculating once every 4 items, we're lowering the average number of instructions per iteration (from 16 to 7.7). In the meantime, and thanks to value speculation, the CPU is still able to start working on 4 next items while processing the current loop iteration.

Indeed it's not pure unrolling of the second function.

> Will it end up worse than the "naive" version due to all the branch mispredictions?

If the data is allocated randomly, then the next cell prediction will often be wrong. In turn, the branch predictor of the CPU will predict that the prediction is wrong and won't continue execution with the predicted value. So it's the same behavior as the original implementation. So the overhead is not due to branch misprediction, but to additional cycles needed to compute the predicted value.

If the data is allocated in chunks, then the optimization remains useful as the cost of rolling back the CPU pipeline is not that much (I tested with the original list cut in 10 chunks and it added 100ms caused by 100k additional branch misses).

xupybd · on May 8, 2023

It's very smart and shows a deep understanding of what is going on right down to the hardware.

That said I'd be a bit lost if I found the end result in any code that I was working with. I hope anyone using these sorts of tricks is also good at documentation and encapsulation.

titzer · on May 8, 2023

Interesting, worth a read. It's doing a little more than value speculation, it's using unsafe to speculate on memory layout. The first thing I could think of was to unroll that loop, and they did that too. Nice :-)

I'd be curious what the machine code looks like.

garethrowlands · on May 8, 2023

Would this also be a good optimisation in Haskell? Haskell code sometimes ends up pointer chasing down lists too. I wonder if the standard libraries or maybe the compiler could do this?

carterschonwald · on May 8, 2023

I and some others added prefetch to ghc Haskell years ago. It’s surprisingly tricky to get it to work, but it can be used

lindig · on May 8, 2023

Since list traversal is very common in OCaml, could the compiler emit code that pre-fetches the next element?

rwmj · on May 8, 2023

Can't help thinking it would be better to use an array rather than a list.

kragen · on May 8, 2023

i didn't know about the value speculation trick or about Obj.magic, so this is a fantastic article