A problem with this is that the inference has to be 100% deterministic. For exam...

A problem with this is that the inference has to be 100% deterministic. For example of you change tailing order of matmul you may get slightly different results. It's easy to imagine people changing tailing in llama.cpp or other libraries. You are very unlikely to get bit exact results from different libraries.

Interestingly (though maybe not relevant here) you can also get different results from multi-user inference systems depending on what other requests are in the batch. It's possible to avoid this but I'm pretty sure most systems don't.

The "slightly different" bit of course makes it worse - it will work 99% of the time.