Final words: A modern compiler contains a huge set of heuristics that interact with each other in strange ways. They have been tuned for 'average' code. But an interpreter is a very different beast, so you'll inevitably get disappointing results.
"Also, it's not a good idea to second-guess the region selection heuristics. I tried this many times ('This can't be the best path!') and failed miserably. The selected traces are sometimes very strange, but so far they turned out to be near optimal. It's quite embarrasing if your own creation turns out to make smarter decisions than yourself. ;-)"
What you'll ultimately end up doing is spending more time fiddling with pragmas and code layout trying to control what machine code the compiler produces than you save by not writing the assembler directly yourself.
It has to be said, though, that we really are talking about a critical loop here. This is no real justification for using assembler instead of a compiler outside of extremely hot (or extremely odd, e.g. dynamic invocation, coroutine stack switching etc.) pieces of code. An interpreter loop is the ultimate hot loop; all code goes through it.
I wonder how close you'd get to that by running the interpreter with gcc's or llvm's profiling on, and then recompiling with profile-guided optimization. If it's just a matter of compilers having heuristics tuned towards the wrong kind of workload, PGO ought to re-tune them appropriately, if enough of the heuristics in question actually look at the profiling data.
Final words: A modern compiler contains a huge set of heuristics that interact with each other in strange ways. They have been tuned for 'average' code. But an interpreter is a very different beast, so you'll inevitably get disappointing results.
More info in this reddit comment: http://www.reddit.com/r/programming/comments/badl2/luajit_2_...