which is still unimaginably terrible. Why isn't it re-using values that have already been loaded into registers? Why isn't it using a register for pRet? I've even told it to! Although I think it's documented that the MS compiler ignores the 'register' keyword.
And this is with all optimisations turned on. How depressing.
Most compilers completely ignore "register" these days. I believe also that aliasing the stack when you're performing the actual operation is greatly hindering the compiler's ability to optimize.
It does look as though no optimisation is being performed at all.
I just isolated the use of the BINARY_OP macro, and put it in a simple-ish test function. now it's being optimized excellently.
The function that contains the apparently unoptimisable code is in a hugely long and complex function, and I wonder if something in it is preventing all optimisation from occuring within that function. I've quickly looked through the assembly produced in the function and all of it appears unoptimised; whereas code in other functions is optimised ok.
What can prevent all optimisation from occuring in a function?
How long is long? Is this a code gened function? I've seen in some compilers that they sometimes have limits where they turn off optimization due to throughput issues.
If you could break the function up in to pieces, and see if it still doesn't optimize.
The function starts at line 232, and the disassembly I was looking at is from line 1852.
This is the function that implements the direct-threaded interpreter. Direct-threading works by using goto's (jmp's) to dispatch the next instruction to be interpreted, which means that I don't think it can be broken up into multiple smaller functions.
It's also not immediately obvious to me that pCurEvalStack is initialized.
Having a look at the architecture optimization manual re: the redundant loads; it's not immediately obvious to me that modern processors won't handle this fine (http://www.intel.com/Assets/PDF/manual/248966.pdf).
And also the nice thing about the BINARY_OP() macro is that it can take types and the operator as arguments (e.g. BINARY_OP(I32, I32, I32, +), BINARY_OP(I64, I64, I32, <<) ) meaning it can be used for many operations on many types. I would have to write seperate functions for each case if using in inline function.
And pCurEvalStack is initialised in the LOAD_METHOD_STATE macro, referenced first on line 590, before interpretation begins.
That's definitely the problem, compiler optimizations are easily O(N^2) on the # of operations in a function, so rather than taking forever to compile they'll turn off optimization if it takes too long.
Ask yourself, do you even need a function that large? You might already be approaching the limits of the L1 cache, at which point you might as well be using separate functions since the call itself will be negligible to the cache miss.
I don't really know Microsoft's C compiler well enough to answer why your specific example fails so spectacularly, but I can make a guess: You are using inline assembly containing a computed jump instruction. In this situation the compiler no longer has a complete view of the control flow graph of your function, and it is likely that all optimizations which need a control flow graph will be disabled. Common subexpression elimination needs a CFG since it determines if two computations are the same for all paths in the program.
Now I'm reasonably certain that the C standard does not allow inline assembly like this. However, given the decades of legacy software that Microsoft has to contend with I would not be too surprised if this was part of the reason.
By the way, GCC's computed gotos are not actually much better. In this case you get a control flow graph where you cannot split all critical edges. There is a workaround for this, but it's probably easier to just disable all optimizations that depend on it.
I've tested that inline assembler performing the computed goto in a smaller test function, and optimisation is performed.
And almost all of the C code that I would like to be optimised is in short fairly self-contained statements that use their own local variables, so I would expect
Again, I've tested this in a smaller function and good optimisation is done.
Therefore I am coming to the conclusion that the problem may just be the length of the function.
I don't have time to do it now, but next week I'll experiment with function length, although as I stated ealier - I'm fairly sure that a threaded interpreter has to be all implemented within a single function.
And this is with all optimisations turned on. How depressing.