Author of LuaJIT explains why compilers can't beat hand-coded assembly

BrandonM · on May 26, 2011

In the "BareMetal is a 64-bit OS for x86-64 based computers" comment thread (http://news.ycombinator.com/item?id=2587640) haberman linked to this email (http://news.ycombinator.com/item?id=2588170), and I thought it was interesting enough to garner its own submission.

silentbicycle · on May 26, 2011

Definitely. Thanks for posting this, I wish HN still regularly had more hard tech content like this.

BrandonM · on May 26, 2011

You're welcome. The thanks should go to haberman, though, who not only linked to the email but apparently also asked the questions that led to such a great explanation.

Reading it really made me want to do some assembly programming.

haberman · on May 26, 2011

I used this info as the basis of writing some assembly of my own: http://blog.reverberate.org/2011/04/25/upb-status-and-prelim...

DynASM, which I used to write my JIT (also written by LuaJIT's author Mike Pall) is also an impressive piece of work. It makes writing a JIT so pleasant that I meant to write an article walking through the process of building a simple JIT. I was going to implement a JIT for the "Universal Machine" from ICFP 2006 as the example (http://www.boundvariable.org/), which is an absolutely delightful problem, but unfortunately discovered that it is heavily self-modifying, and thus would be hard to get a big performance boost from. But it looks like someone did it anyway: http://paul.bluereboot.net/um/

ssp · on May 26, 2011

It makes writing a JIT so pleasant that I meant to write an article walking through the process of building a simple JIT.

Please do, if you have the time. DynASM really is a stroke of genius.

The way it works is that the JIT contains bits of pre-generated machine code, and then a little bit of bytecode stitches these bits together and emits them. This means the actual runtime piece of the JIT doesn't need to know all the wacky details of x86 instruction encoding. All that is done in a preprocessor written in Lua.

I never got around to finding out how to do dynamic register allocation though. It seemed to me that there was no way to parameterize the pre-generated machine code, but presumably I'm missing something.

gregholmberg · on May 26, 2011

I keep wanting to try it, too. But Lua + LuaJIT is so good that I find I never really need that last ninety-ninth percentile speedup.

One of the things that I have found marvelous and counterintuitive about LuaJIT is Mike Pall's recurring warning to unlearn optimization idioms you might have picked up to accelerate standard Lua (which is already quite fast).

For example, the Lua language has a keyword 'local' to give you a shortcut optimization for identifier lookups in the global namespace. Using a local alias for a global function in a tight loop inside a standard Lua program can and often does result in a nice, juicy speedup.

LuaJIT, however, has highly optimized heuristics for i386, x86_64 and PPC ASM that will perform far better optimizations for you if you leave your clever aliasing out and just run a naive, plain vanilla Lua version. See [0]:"No, please don't do this."

He has also provided a new mechanism to dynamically wire C libraries and ad-hoc data structures directly into the LuaJIT runtime [1].

If there is scope for me to use assembler in this ecosystem somewhere, I can't see it yet. :)

[0] http://lua-users.org/lists/lua-l/2011-01/msg00978.html

[1] http://lua-users.org/lists/lua-l/2011-02/msg00455.html

justincormack · on May 27, 2011

Mike did say on the list that he was considering at some point adding the ability to natively mix assembler code into Lua, that might give you an opportunity!

billforsternz · on May 27, 2011

I enjoyed this article, I always find the argument that "these days you can't beat the compiler" fishy. I remember more than 25 years ago I wrote an 8080 machine code interpreter in 8088/86 assembly. My innermost interpreter comprised 3 lines of asm;

  lodsb     ;al = *si++
  mov ah,al ;eg if al=3eh (=mvi a,dd), then ax=0x3e3e
  jmp ax    ;eg goto 0x3e3e. Preloader puts the emulation
            ; code for mvi a,dd followed by another copy
            ; of the inner interpreter at cs:0x3e3e.

This approach uses 64K to sparsely store the emulation code for each of the 256 opcodes at 0x0000,0x0101,0x0202 ... etc. which seemed a pretty outrageous waste of memory at the time. But it was worth it because none of the 3 inner interpreter instructions modified the flags, which meant I could use the native 8086 flags to directly emulate the 8080 flags, a huge speed win. Remember that the 8086 had LAHF and SAHF instructions to help solve this very problem, yet my approach avoiding the whole flags issue was still twice as quick as commercial equivalents.

Okay admittedly this is all irrelevant dinosaur nostalgia from the perspective of modern technology, but it reinforces the point that there's no substitute for human creativity when it comes to real hardcore optimization tricks.

froydnj · on May 27, 2011

> I enjoyed this article, I always find the argument that "these days you can't beat the compiler" fishy.

I enjoyed your example; very clever. I think the "these days you can't beat the compiler" should really be understood to carry a qualifier like "for compiling things that we understand well enough and that occur often enough that it's worth teaching the compiler how to optimize them". Knuth's maxim "Science is what we understand well enough to explain to a computer. Art is everything else we do." seems particularly apt here.

Problems where humans can apply very domain-specific knowledge to come up with clever solutions are not the sort of problems that a general-purpose compiler is going to generate good code for. Knowing when you're dealing with one of those problems (e.g. writing inner loops for multi-precision arithmetic or vectorizing multimedia codecs) is part of being a good programmer. But to a first or even second approximation, you are not dealing with one of those problems, so you might as well just sit back and let the compiler do its job.

billforsternz · on May 27, 2011

I was with you until "but to a first or even second approximation, you are not dealing with one of those problems, so ...". Perhaps there is a typo there and it should have been something like "but when to a first or even...."

tptacek · on May 26, 2011

A great find. And some of this argument is very simple even if you don't do a lot of assembly programming; for instance:

* that C-code interpreters force the compiler to do register allocation across the slow paths of all instructions, even though the interpreter dev can in this case definitely predict which things need to be in registers;

* and, the optimizer in the compiler makes it hard to tune the code for the 20% that does 80% of the work, where an assembly program can hoist the slow path code out of the I-cache entirely.

wingo · on May 26, 2011

Dunno. I've found with __builtin_expect() in GCC, the code corresponding to an unlikely branch is placed at the end of the interpreter body, even after all the other instructions, which is probably as good as you can do icache-wise. I'd have to check again, but such annotations could inform register allocation as well.

CountHackulus · on May 27, 2011

I was thinking about this, and I think you're mostly right. There's a bit more that needs to be done, but some sort of expect builtin (name varies based on compiler) would help.

What I'm thinking is, he needs to factor all the instructions out into separate functions so that he stops hitting the optimization boundaries (or he may be able to override them with parameters to the compiler). Then, mark the hot paths with the expect builting.

One that's in, use profile guided optimization (again name varies based on compiler), to feed the compiler with more frequency information on the branches. Partial inlining (not sure if GCC has this opt yet) should take care of the rest.

As for the register assignment, well that's hard to get right while getting the instruction scheduling right at the same time. Profile guided optimization will help guide the global register allocator in the right direction, but it likely won't be perfect. Thankfully, on modern architectures, register to register moves are pretty much NOPs thanks to register files, short circuting, and out of order.

VMG · on May 26, 2011

To be clear, he qualifies this with the statement

Final words: A modern compiler contains a huge set of heuristics that interact with each other in strange ways. They have been tuned for 'average' code. But an interpreter is a very different beast, so you'll inevitably get disappointing results.

More info in this reddit comment: http://www.reddit.com/r/programming/comments/badl2/luajit_2_...

stcredzero · on May 26, 2011

What if someone developed pragmas and a compiler that could use that information with a different set of heuristics?

barrkel · on May 26, 2011

The paradoxical reason why pragmas don't work well is that the machine is better than the human.

http://lua-users.org/lists/lua-l/2011-01/msg00978.html

"Also, it's not a good idea to second-guess the region selection heuristics. I tried this many times ('This can't be the best path!') and failed miserably. The selected traces are sometimes very strange, but so far they turned out to be near optimal. It's quite embarrasing if your own creation turns out to make smarter decisions than yourself. ;-)"

What you'll ultimately end up doing is spending more time fiddling with pragmas and code layout trying to control what machine code the compiler produces than you save by not writing the assembler directly yourself.

It has to be said, though, that we really are talking about a critical loop here. This is no real justification for using assembler instead of a compiler outside of extremely hot (or extremely odd, e.g. dynamic invocation, coroutine stack switching etc.) pieces of code. An interpreter loop is the ultimate hot loop; all code goes through it.

fanf2 · on May 27, 2011

Coroutine stack switching doesn't need assembler in many C implementations: http://fanf.livejournal.com/105413.html?nc=6

barrkel · on May 27, 2011

Sure, if you presume the existence of setjmp and longjmp. I'm more normally in the position of writing those routines, not using them.

_delirium · on May 26, 2011

I wonder how close you'd get to that by running the interpreter with gcc's or llvm's profiling on, and then recompiling with profile-guided optimization. If it's just a matter of compilers having heuristics tuned towards the wrong kind of workload, PGO ought to re-tune them appropriately, if enough of the heuristics in question actually look at the profiling data.

caf · on May 27, 2011

I was thinking this too - you would think that profile guided optimisation ought to be able to identify the slow paths.

stcredzero · on May 26, 2011

Either way, I think the field would benefit if tools like that made writing fast JIT VM easier.

chrisb · on May 26, 2011

A couple of years ago I wrote a direct-threaded .NET interpreter. This article made me go and have a look at the assembly that is being generated for the simple 'add' instruction case.

The .NET IL is stack-based which the interpreter runs directly, so it will never be as efficient as LuaJits register-based approach, but I was somewhat disappointed to see this as the output of the Windows VS2008 C compiler in release mode (enables all optimizations):

  JIT_ADD_I32I32_start:
  	BINARY_OP(I32, I32, I32, +);
  0041164D  mov         eax,dword ptr [pCurEvalStack] 
  00411650  sub         eax,4 
  00411653  mov         dword ptr [pCurEvalStack],eax 
  00411656  mov         ecx,dword ptr [pCurEvalStack] 
  00411659  mov         edx,dword ptr [ecx-4] 
  0041165C  mov         eax,dword ptr [pCurEvalStack] 
  0041165F  add         edx,dword ptr [eax] 
  00411661  mov         ecx,dword ptr [pCurEvalStack] 
  00411664  mov         dword ptr [ecx-4],edx 
  JIT_ADD_I32I32_end:
  	GO_NEXT();
  00411667  mov         edi,dword ptr [pCurOp] 
  0041166A  add         edi,4 
  0041166D  mov         dword ptr [pCurOp],edi 
  00411670  jmp         dword ptr [edi-4]

The assembly after 'BINARY_OP(...)' is the integer add code. pCurEvalStack points to the top of the current evaluation stack. Binary ops are performed on the top two items on the stack, and pushing the result back onto the stack.

The assembly after 'GO_NEXT()' is the standard epilogue, which just dispatches the next instruction (see http://en.wikipedia.org/wiki/Threaded_code#Direct_threading for description of direct-threading).

There are 3 seperate loads of the same memory address, which surely can be done better.

I'll probably have a go at altering the definition of the BINARY_OP(...) macro to see if I can persuade it to generate better code. I'm not keen to hand-write the assembly for this as it can compile to multiple processors.

The macro is currently defined as:

  #define BINARY_OP(returnType, type1, type2, op) \
  	pCurEvalStack -= sizeof(type1) + sizeof(type2) - sizeof(returnType); \
  	*(returnType*)(pCurEvalStack - sizeof(returnType)) = \
  	*(type1*)(pCurEvalStack - sizeof(returnType)) op \
  	*(type2*)(pCurEvalStack - sizeof(returnType) + sizeof(type1))

...which can probably be improved, or at least specialised to give better results for binary ops where the operand and result types are all the same - e.g. integer addition.

chrisb · on May 26, 2011

Frustratingly, I just changed the macro to this:

  #define BINARY_OP(returnType, type1, type2, op) \
  { \
  	register PTR pRet = pCurEvalStack - sizeof(type1) - sizeof(type2); \
  	pCurEvalStack = pRet + sizeof(returnType); \
  	*(returnType*)pRet = *(type1*)pRet op *(type2*)(pRet + sizeof(type1)); \
  }

And the assembly generated is this:

  	BINARY_OP(I32, I32, I32, +);
  004117BA  mov         eax,dword ptr [pCurEvalStack] 
  004117BD  sub         eax,8 
  004117C0  mov         dword ptr [pRet],eax 
  004117C6  mov         ecx,dword ptr [pRet] 
  004117CC  add         ecx,4 
  004117CF  mov         dword ptr [pCurEvalStack],ecx 
  004117D2  mov         edx,dword ptr [pRet] 
  004117D8  mov         eax,dword ptr [edx] 
  004117DA  mov         ecx,dword ptr [pRet] 
  004117E0  add         eax,dword ptr [ecx+4] 
  004117E3  mov         edx,dword ptr [pRet] 
  004117E9  mov         dword ptr [edx],eax

which is still unimaginably terrible. Why isn't it re-using values that have already been loaded into registers? Why isn't it using a register for pRet? I've even told it to! Although I think it's documented that the MS compiler ignores the 'register' keyword.

And this is with all optimisations turned on. How depressing.

maximilianburke · on May 26, 2011

Most compilers completely ignore "register" these days. I believe also that aliasing the stack when you're performing the actual operation is greatly hindering the compiler's ability to optimize.

chrisb · on May 26, 2011

It does look as though no optimisation is being performed at all.

I just isolated the use of the BINARY_OP macro, and put it in a simple-ish test function. now it's being optimized excellently.

The function that contains the apparently unoptimisable code is in a hugely long and complex function, and I wonder if something in it is preventing all optimisation from occuring within that function. I've quickly looked through the assembly produced in the function and all of it appears unoptimised; whereas code in other functions is optimised ok.

What can prevent all optimisation from occuring in a function?

kenjackson · on May 26, 2011

How long is long? Is this a code gened function? I've seen in some compilers that they sometimes have limits where they turn off optimization due to throughput issues.

If you could break the function up in to pieces, and see if it still doesn't optimize.

chrisb · on May 26, 2011

The function is just over 2900 lines long. Every line crafted lovingly by hand.

The whole source file is here: http://pastebin.com/9L8N3AVF

The function starts at line 232, and the disassembly I was looking at is from line 1852.

This is the function that implements the direct-threaded interpreter. Direct-threading works by using goto's (jmp's) to dispatch the next instruction to be interpreted, which means that I don't think it can be broken up into multiple smaller functions.

Please let me know if you think I'm wrong :)

swolchok · on May 26, 2011

What if you made BINARY_OP a __forceinline (http://msdn.microsoft.com/en-us/library/z8y1yy88.aspx) function instead of a macro?

Things that the function contains that might disable optimization: a lot of gotos, inline assembly in GO_NEXT (this does affect optimization: http://msdn.microsoft.com/en-us/library/5hd5ywk0.aspx)...

It's also not immediately obvious to me that pCurEvalStack is initialized.

Having a look at the architecture optimization manual re: the redundant loads; it's not immediately obvious to me that modern processors won't handle this fine (http://www.intel.com/Assets/PDF/manual/248966.pdf).

chrisb · on May 26, 2011

Thanks for the idea. Unfortunately it doesn't improve things:

  	pCurEvalStack = TestAdd(pCurEvalStack);
  0040BFFA  mov         eax,dword ptr [pCurEvalStack] 
  0040BFFD  mov         dword ptr [ebp-694h],eax 
  0040C003  mov         ecx,dword ptr [ebp-694h] 
  0040C009  sub         ecx,8 
  0040C00C  mov         dword ptr [ebp-690h],ecx 
  0040C012  mov         edx,dword ptr [ebp-694h] 
  0040C018  sub         edx,4 
  0040C01B  mov         dword ptr [ebp-694h],edx 
  0040C021  mov         eax,dword ptr [ebp-690h] 
  0040C027  mov         ecx,dword ptr [eax] 
  0040C029  mov         edx,dword ptr [ebp-694h] 
  0040C02F  add         ecx,dword ptr [edx] 
  0040C031  mov         eax,dword ptr [ebp-690h] 
  0040C037  mov         dword ptr [eax],ecx 
  0040C039  mov         ecx,dword ptr [ebp-694h] 
  0040C03F  mov         dword ptr [pCurEvalStack],ecx

And also the nice thing about the BINARY_OP() macro is that it can take types and the operator as arguments (e.g. BINARY_OP(I32, I32, I32, +), BINARY_OP(I64, I64, I32, <<) ) meaning it can be used for many operations on many types. I would have to write seperate functions for each case if using in inline function.

And pCurEvalStack is initialised in the LOAD_METHOD_STATE macro, referenced first on line 590, before interpretation begins.

swolchok · on May 26, 2011

Looks like I was wrong about the redundant loads. They should probably go away (and indeed you say they do if you put BINARY_OP in its own function).

kenjackson · on May 26, 2011

That's the longest handwritten function I think I've ever seen :-)

I don't know what limits the compiler imposes, but I wouldn't be surprised if you hit them.

iam · on May 27, 2011

That's definitely the problem, compiler optimizations are easily O(N^2) on the # of operations in a function, so rather than taking forever to compile they'll turn off optimization if it takes too long.

Ask yourself, do you even need a function that large? You might already be approaching the limits of the L1 cache, at which point you might as well be using separate functions since the call itself will be negligible to the cache miss.

fmap · on May 27, 2011

I don't really know Microsoft's C compiler well enough to answer why your specific example fails so spectacularly, but I can make a guess: You are using inline assembly containing a computed jump instruction. In this situation the compiler no longer has a complete view of the control flow graph of your function, and it is likely that all optimizations which need a control flow graph will be disabled. Common subexpression elimination needs a CFG since it determines if two computations are the same for all paths in the program.

Now I'm reasonably certain that the C standard does not allow inline assembly like this. However, given the decades of legacy software that Microsoft has to contend with I would not be too surprised if this was part of the reason.

By the way, GCC's computed gotos are not actually much better. In this case you get a control flow graph where you cannot split all critical edges. There is a workaround for this, but it's probably easier to just disable all optimizations that depend on it.

chrisb · on May 27, 2011

I've tested that inline assembler performing the computed goto in a smaller test function, and optimisation is performed.

And almost all of the C code that I would like to be optimised is in short fairly self-contained statements that use their own local variables, so I would expect Again, I've tested this in a smaller function and good optimisation is done.

Therefore I am coming to the conclusion that the problem may just be the length of the function.

I don't have time to do it now, but next week I'll experiment with function length, although as I stated ealier - I'm fairly sure that a threaded interpreter has to be all implemented within a single function.

kenjackson · on May 26, 2011

Some of those loads are unexpected. Can you verify that /O2 optimization is on (since release builds can, in theory, have optimizations off).

chrisb · on May 26, 2011

Yes - one of the first things I checked - I've tried it with /O2 and /Ox

Same result both times.

beagle3 · on May 27, 2011

If you have the time to play with it some more ... I wonder, does gcc or clang or icc do a better job?

chrisb · on May 27, 2011

Unfortunately I don't have any of these compilers ready to use at the moment, but I may try some of them in a week or two.

jerhewet · on May 27, 2011

ICC is pretty spectacular (and sometimes awesome!) at optimization.

_3u10 · on May 27, 2011

It that what NGEN really generates or is that what your interpreter generated?

chrisb · on May 27, 2011

This is nothing to do with NGEN.

This is the assembly produced by the Microsoft C compiler for the part of my interpreter that adds two 32-bit integers together.

This code is executed whenever the .NET IL 'add' opcode is encountered when the two top evaluation stack values are 32-bit integers.

(Note that the evaluation stack type analysis is done in a pre-execution stage, so when the code shown is executing it is already known that the top two stack values are 32-bit integers)

sb · on May 26, 2011

This is a very interesting post that fits nicetly with many other interesting statements Mike Pall made (such as his reference to the LuaJIT2 interpreter being faster than the LuaJIT1 jit compiler on some cases [http://lambda-the-ultimate.org/node/3851#comment-57646].)

On a related note, a similar problem is highlighted in a paper by Anton Ertl and David Gregg ([1]), where they claim that the slowdowns of an efficient interpreter to an optimizing native code compiler is 1:10, whereas the slowdown between an efficient interpreter and an ineffcient one is 1:100. Consequently, there seems to be a lot of optimization potential (though I admit that depending on the kind of the interpreter, it might be unlikely to stay within a slowdown of 1:10)

[1]: Ertl, Gregg. "The Structure and Performance of Efficient Interpreters", 2001. (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.69....)

bajsejohannes · on May 26, 2011

While this is true, it's important to note that you need to know a lot about how a modern process works to do this.

I once looked at the possibility of optimizing the byte code dispatcher for a javascript engine (Opera's old one), and my 386 assembly skills were far outdated: Removing one seemingly useless instruction of the ~10 core assembly instructions actually slowed the whole thing down. (Unfortunately, I have forgotten the details)

KirinDave · on May 26, 2011

It seems to me that this is more of a complaint leveled at C-ish compilers. One could envision a language that makes the sort of structures that baffle the normal compilerore visible and amenable to optimization. I'm not sure if some syntax and some clever semantics could address all these problems, but...

When I was first learning C formally, I had this idea that C was the "fastest language". My instructor said, day 1, that the problem with C was that it was not amenable to optimization. It was pretty surprising, but as time goes on and languages, compilers, and interpreters get better I think we'll see this sort of problem more and more.

ori_b · on May 27, 2011

// One could envision a language that makes the sort of structures that baffle the normal compilerore visible and amenable to optimization. //

Yes, the problem is that it starts to look like assembly at that point. With a whole bunch of notes on each variable and operation to let the compiler know about whether you care about specific aspects of the result of the operation. (See the 'restrict' keyword in C for an example of what I mean)

CountHackulus · on May 27, 2011

That sounds like PL/X[1] to me. That language will let you do things that REALLY aren't advisable on most systems like specify pointer sizes (per variable), specify address mode (arbitrary blocks, functions, nested functions, per module, whatever), specify registers for any variable. Oh an my favourite, when calling a function, you can specify an arbitrary global label for it to return to.

Or of course you can write as cleanly as possible, let the compiler do the lion's share of the work, and fix up what you need to.

[1]http://en.wikipedia.org/wiki/PL/X and http://c2.com/cgi/wiki?PlsLanguage

thesz · on May 26, 2011

There was an attempt to comperss code for VLIW processors using stack machines.

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.37.3...

"On a TriMedia VLIW with a load latency of three cycles and a jump latency of four cycles, the interpreter achieves a peak performance of four cycles per instruction and a sustained performance of 6.27 cycles per instruction. Experiments are described that demonstrate the compression quality of the system and the execution speed of the pipelined interpreter; these were found to be about five times more compact than native TriMedia code and a slowdown of about eight times, respectively."

They used pragmas. They used loop pipelining: http://en.wikipedia.org/wiki/Software_pipelining and compiler optimizations. Their results aren't that bad, they achieved CISC code density for 8 times slowdown.

programminggeek · on May 27, 2011

I've always thought that the standard wisdom is that most developers can't beat the compiler by hand-writing most of their code in ASM, but in things like game development, things start higher level Lua, C++, C and work their way down to increase performance as bottlenecks are found.

Compilers are like Google or Wikipedia, 99% of the time they know more than 99% of developers.

gregholmberg · on May 26, 2011

LuaJIT is freakishly fast for a certain (not small) set of problems [0].

[0]: http://luajit.org/performance_x86.html

_tef · on May 27, 2011