You're welcome. The thanks should go to haberman, though, who not only linked to the email but apparently also asked the questions that led to such a great explanation.
Reading it really made me want to do some assembly programming.
DynASM, which I used to write my JIT (also written by LuaJIT's author Mike Pall) is also an impressive piece of work. It makes writing a JIT so pleasant that I meant to write an article walking through the process of building a simple JIT. I was going to implement a JIT for the "Universal Machine" from ICFP 2006 as the example (http://www.boundvariable.org/), which is an absolutely delightful problem, but unfortunately discovered that it is heavily self-modifying, and thus would be hard to get a big performance boost from. But it looks like someone did it anyway: http://paul.bluereboot.net/um/
It makes writing a JIT so pleasant that I meant to write an article walking through the process of building a simple JIT.
Please do, if you have the time. DynASM really is a stroke of genius.
The way it works is that the JIT contains bits of pre-generated machine code, and then a little bit of bytecode stitches these bits together and emits them. This means the actual runtime piece of the JIT doesn't need to know all the wacky details of x86 instruction encoding. All that is done in a preprocessor written in Lua.
I never got around to finding out how to do dynamic register allocation though. It seemed to me that there was no way to parameterize the pre-generated machine code, but presumably I'm missing something.
I keep wanting to try it, too. But Lua + LuaJIT is so good that I find I never really need that last ninety-ninth percentile speedup.
One of the things that I have found marvelous and counterintuitive about LuaJIT is Mike Pall's recurring warning to unlearn optimization idioms you might have picked up to accelerate standard Lua (which is already quite fast).
For example, the Lua language has a keyword 'local' to give you a shortcut optimization for identifier lookups in the global namespace. Using a local alias for a global function in a tight loop inside a standard Lua program can and often does result in a nice, juicy speedup.
LuaJIT, however, has highly optimized heuristics for i386, x86_64 and PPC ASM that will perform far better optimizations for you if you leave your clever aliasing out and just run a naive, plain vanilla Lua version. See [0]:"No, please don't do this."
He has also provided a new mechanism to dynamically wire C libraries and ad-hoc data structures directly into the LuaJIT runtime [1].
If there is scope for me to use assembler in this ecosystem somewhere, I can't see it yet. :)
Mike did say on the list that he was considering at some point adding the ability to natively mix assembler code into Lua, that might give you an opportunity!
I enjoyed this article, I always find the argument that "these days you can't beat the compiler" fishy. I remember more than 25 years ago I wrote an 8080 machine code interpreter in 8088/86 assembly. My innermost interpreter comprised 3 lines of asm;
lodsb ;al = *si++
mov ah,al ;eg if al=3eh (=mvi a,dd), then ax=0x3e3e
jmp ax ;eg goto 0x3e3e. Preloader puts the emulation
; code for mvi a,dd followed by another copy
; of the inner interpreter at cs:0x3e3e.
This approach uses 64K to sparsely store the emulation code for each of the 256 opcodes at 0x0000,0x0101,0x0202 ... etc. which seemed a pretty outrageous waste of memory at the time. But it was worth it because none of the 3 inner interpreter instructions modified the flags, which meant I could use the native 8086 flags to directly emulate the 8080 flags, a huge speed win. Remember that the 8086 had LAHF and SAHF instructions to help solve this very problem, yet my approach avoiding the whole flags issue was still twice as quick as commercial equivalents.
Okay admittedly this is all irrelevant dinosaur nostalgia from the perspective of modern technology, but it reinforces the point that there's no substitute for human creativity when it comes to real hardcore optimization tricks.
> I enjoyed this article, I always find the argument that "these days you can't beat the compiler" fishy.
I enjoyed your example; very clever. I think the "these days you can't beat the compiler" should really be understood to carry a qualifier like "for compiling things that we understand well enough and that occur often enough that it's worth teaching the compiler how to optimize them". Knuth's maxim "Science is what we understand well enough to explain to a computer. Art is everything else we do." seems particularly apt here.
Problems where humans can apply very domain-specific knowledge to come up with clever solutions are not the sort of problems that a general-purpose compiler is going to generate good code for. Knowing when you're dealing with one of those problems (e.g. writing inner loops for multi-precision arithmetic or vectorizing multimedia codecs) is part of being a good programmer. But to a first or even second approximation, you are not dealing with one of those problems, so you might as well just sit back and let the compiler do its job.
I was with you until "but to a first or even second approximation, you are not dealing with one of those problems, so ...". Perhaps there is a typo there and it should have been something like "but when to a first or even...."
A great find. And some of this argument is very simple even if you don't do a lot of assembly programming; for instance:
* that C-code interpreters force the compiler to do register allocation across the slow paths of all instructions, even though the interpreter dev can in this case definitely predict which things need to be in registers;
* and, the optimizer in the compiler makes it hard to tune the code for the 20% that does 80% of the work, where an assembly program can hoist the slow path code out of the I-cache entirely.
Dunno. I've found with __builtin_expect() in GCC, the code corresponding to an unlikely branch is placed at the end of the interpreter body, even after all the other instructions, which is probably as good as you can do icache-wise. I'd have to check again, but such annotations could inform register allocation as well.
I was thinking about this, and I think you're mostly right. There's a bit more that needs to be done, but some sort of expect builtin (name varies based on compiler) would help.
What I'm thinking is, he needs to factor all the instructions out into separate functions so that he stops hitting the optimization boundaries (or he may be able to override them with parameters to the compiler). Then, mark the hot paths with the expect builting.
One that's in, use profile guided optimization (again name varies based on compiler), to feed the compiler with more frequency information on the branches. Partial inlining (not sure if GCC has this opt yet) should take care of the rest.
As for the register assignment, well that's hard to get right while getting the instruction scheduling right at the same time. Profile guided optimization will help guide the global register allocator in the right direction, but it likely won't be perfect. Thankfully, on modern architectures, register to register moves are pretty much NOPs thanks to register files, short circuting, and out of order.
Final words: A modern compiler contains a huge set of heuristics that interact with each other in strange ways. They have been tuned for 'average' code. But an interpreter is a very different beast, so you'll inevitably get disappointing results.
"Also, it's not a good idea to second-guess the region selection heuristics. I tried this many times ('This can't be the best path!') and failed miserably. The selected traces are sometimes very strange, but so far they turned out to be near optimal. It's quite embarrasing if your own creation turns out to make smarter decisions than yourself. ;-)"
What you'll ultimately end up doing is spending more time fiddling with pragmas and code layout trying to control what machine code the compiler produces than you save by not writing the assembler directly yourself.
It has to be said, though, that we really are talking about a critical loop here. This is no real justification for using assembler instead of a compiler outside of extremely hot (or extremely odd, e.g. dynamic invocation, coroutine stack switching etc.) pieces of code. An interpreter loop is the ultimate hot loop; all code goes through it.
I wonder how close you'd get to that by running the interpreter with gcc's or llvm's profiling on, and then recompiling with profile-guided optimization. If it's just a matter of compilers having heuristics tuned towards the wrong kind of workload, PGO ought to re-tune them appropriately, if enough of the heuristics in question actually look at the profiling data.
A couple of years ago I wrote a direct-threaded .NET interpreter. This article made me go and have a look at the assembly that is being generated for the simple 'add' instruction case.
The .NET IL is stack-based which the interpreter runs directly, so it will never be as efficient as LuaJits register-based approach, but I was somewhat disappointed to see this as the output of the Windows VS2008 C compiler in release mode (enables all optimizations):
The assembly after 'BINARY_OP(...)' is the integer add code. pCurEvalStack points to the top of the current evaluation stack. Binary ops are performed on the top two items on the stack, and pushing the result back onto the stack.
There are 3 seperate loads of the same memory address, which surely can be done better.
I'll probably have a go at altering the definition of the BINARY_OP(...) macro to see if I can persuade it to generate better code. I'm not keen to hand-write the assembly for this as it can compile to multiple processors.
...which can probably be improved, or at least specialised to give better results for binary ops where the operand and result types are all the same - e.g. integer addition.
which is still unimaginably terrible. Why isn't it re-using values that have already been loaded into registers? Why isn't it using a register for pRet? I've even told it to! Although I think it's documented that the MS compiler ignores the 'register' keyword.
And this is with all optimisations turned on. How depressing.
Most compilers completely ignore "register" these days. I believe also that aliasing the stack when you're performing the actual operation is greatly hindering the compiler's ability to optimize.
It does look as though no optimisation is being performed at all.
I just isolated the use of the BINARY_OP macro, and put it in a simple-ish test function. now it's being optimized excellently.
The function that contains the apparently unoptimisable code is in a hugely long and complex function, and I wonder if something in it is preventing all optimisation from occuring within that function. I've quickly looked through the assembly produced in the function and all of it appears unoptimised; whereas code in other functions is optimised ok.
What can prevent all optimisation from occuring in a function?
How long is long? Is this a code gened function? I've seen in some compilers that they sometimes have limits where they turn off optimization due to throughput issues.
If you could break the function up in to pieces, and see if it still doesn't optimize.
The function starts at line 232, and the disassembly I was looking at is from line 1852.
This is the function that implements the direct-threaded interpreter. Direct-threading works by using goto's (jmp's) to dispatch the next instruction to be interpreted, which means that I don't think it can be broken up into multiple smaller functions.
It's also not immediately obvious to me that pCurEvalStack is initialized.
Having a look at the architecture optimization manual re: the redundant loads; it's not immediately obvious to me that modern processors won't handle this fine (http://www.intel.com/Assets/PDF/manual/248966.pdf).
And also the nice thing about the BINARY_OP() macro is that it can take types and the operator as arguments (e.g. BINARY_OP(I32, I32, I32, +), BINARY_OP(I64, I64, I32, <<) ) meaning it can be used for many operations on many types. I would have to write seperate functions for each case if using in inline function.
And pCurEvalStack is initialised in the LOAD_METHOD_STATE macro, referenced first on line 590, before interpretation begins.
That's definitely the problem, compiler optimizations are easily O(N^2) on the # of operations in a function, so rather than taking forever to compile they'll turn off optimization if it takes too long.
Ask yourself, do you even need a function that large? You might already be approaching the limits of the L1 cache, at which point you might as well be using separate functions since the call itself will be negligible to the cache miss.
I don't really know Microsoft's C compiler well enough to answer why your specific example fails so spectacularly, but I can make a guess: You are using inline assembly containing a computed jump instruction. In this situation the compiler no longer has a complete view of the control flow graph of your function, and it is likely that all optimizations which need a control flow graph will be disabled. Common subexpression elimination needs a CFG since it determines if two computations are the same for all paths in the program.
Now I'm reasonably certain that the C standard does not allow inline assembly like this. However, given the decades of legacy software that Microsoft has to contend with I would not be too surprised if this was part of the reason.
By the way, GCC's computed gotos are not actually much better. In this case you get a control flow graph where you cannot split all critical edges. There is a workaround for this, but it's probably easier to just disable all optimizations that depend on it.
I've tested that inline assembler performing the computed goto in a smaller test function, and optimisation is performed.
And almost all of the C code that I would like to be optimised is in short fairly self-contained statements that use their own local variables, so I would expect
Again, I've tested this in a smaller function and good optimisation is done.
Therefore I am coming to the conclusion that the problem may just be the length of the function.
I don't have time to do it now, but next week I'll experiment with function length, although as I stated ealier - I'm fairly sure that a threaded interpreter has to be all implemented within a single function.
This is the assembly produced by the Microsoft C compiler for the part of my interpreter that adds two 32-bit integers together.
This code is executed whenever the .NET IL 'add' opcode is encountered when the two top evaluation stack values are 32-bit integers.
(Note that the evaluation stack type analysis is done in a pre-execution stage, so when the code shown is executing it is already known that the top two stack values are 32-bit integers)
This is a very interesting post that fits nicetly with many other interesting statements Mike Pall made (such as his reference to the LuaJIT2 interpreter being faster than the LuaJIT1 jit compiler on some cases [http://lambda-the-ultimate.org/node/3851#comment-57646].)
On a related note, a similar problem is highlighted in a paper by Anton Ertl and David Gregg ([1]), where they claim that the slowdowns of an efficient interpreter to an optimizing native code compiler is 1:10, whereas the slowdown between an efficient interpreter and an ineffcient one is 1:100. Consequently, there seems to be a lot of optimization potential (though I admit that depending on the kind of the interpreter, it might be unlikely to stay within a slowdown of 1:10)
While this is true, it's important to note that you need to know a lot about how a modern process works to do this.
I once looked at the possibility of optimizing the byte code dispatcher for a javascript engine (Opera's old one), and my 386 assembly skills were far outdated: Removing one seemingly useless instruction of the ~10 core assembly instructions actually slowed the whole thing down. (Unfortunately, I have forgotten the details)
It seems to me that this is more of a complaint leveled at C-ish compilers. One could envision a language that makes the sort of structures that baffle the normal compilerore visible and amenable to optimization. I'm not sure if some syntax and some clever semantics could address all these problems, but...
When I was first learning C formally, I had this idea that C was the "fastest language". My instructor said, day 1, that the problem with C was that it was not amenable to optimization. It was pretty surprising, but as time goes on and languages, compilers, and interpreters get better I think we'll see this sort of problem more and more.
// One could envision a language that makes the sort of structures that baffle the normal compilerore visible and amenable to optimization. //
Yes, the problem is that it starts to look like assembly at that point. With a whole bunch of notes on each variable and operation to let the compiler know about whether you care about specific aspects of the result of the operation. (See the 'restrict' keyword in C for an example of what I mean)
That sounds like PL/X[1] to me. That language will let you do things that REALLY aren't advisable on most systems like specify pointer sizes (per variable), specify address mode (arbitrary blocks, functions, nested functions, per module, whatever), specify registers for any variable. Oh an my favourite, when calling a function, you can specify an arbitrary global label for it to return to.
Or of course you can write as cleanly as possible, let the compiler do the lion's share of the work, and fix up what you need to.
"On a TriMedia VLIW with a load latency of three cycles and a jump latency of four cycles, the interpreter achieves a peak performance of four cycles per instruction and a sustained performance of 6.27 cycles per instruction. Experiments are described that demonstrate the compression quality of the system and the execution speed of the pipelined interpreter; these were found to be about five times more compact than native TriMedia code and a slowdown of about eight times, respectively."
They used pragmas. They used loop pipelining: http://en.wikipedia.org/wiki/Software_pipelining and compiler optimizations. Their results aren't that bad, they achieved CISC code density for 8 times slowdown.
I've always thought that the standard wisdom is that most developers can't beat the compiler by hand-writing most of their code in ASM, but in things like game development, things start higher level Lua, C++, C and work their way down to increase performance as bottlenecks are found.
Compilers are like Google or Wikipedia, 99% of the time they know more than 99% of developers.
Also, I just put a small functional programming library for Lua on GitHub about an hour before reading this: https://github.com/davidhollander/fn Here's an example of how to create a sum function with Fold Left and an addition function:
require'fn'
sum = fn.partial( fn.foldl, function(a, b) return a+b end)
x = sum(1, 2, 3, 4, 5, 6, 7, 8, 9)
print(x)
-- 45
The referenced page is more interesting than what the title promises. The title is trivially true, the referenced page makes the stronger claim that a compiler can not _tie_ with hand-coded assembly.
It's a very old story: Hand coded assembler can commonly get a factor of several faster execution time on compiled code and more when using special instructions on special data. But on large programs, no, because some of what compilers and runtime libraries do for large programs is too difficult for hand coded assembler. Indeed, some of what compilers do in register allocation (and maybe cache usage) will be too difficult for hand coded assembler.
More generally, now the execution time of a small, single threaded program is nearly never of much concern. E.g., when have a processor with dozens of cores, it's easy to let a small, single threaded program have a core and then f'get about it.
Execution time remains important but now mostly just for large, multi-threaded programs.