Frustratingly, I just changed the macro to this: #define BINARY_OP(returnType, t...

maximilianburke · on May 26, 2011

Most compilers completely ignore "register" these days. I believe also that aliasing the stack when you're performing the actual operation is greatly hindering the compiler's ability to optimize.

chrisb · on May 26, 2011

It does look as though no optimisation is being performed at all.

I just isolated the use of the BINARY_OP macro, and put it in a simple-ish test function. now it's being optimized excellently.

The function that contains the apparently unoptimisable code is in a hugely long and complex function, and I wonder if something in it is preventing all optimisation from occuring within that function. I've quickly looked through the assembly produced in the function and all of it appears unoptimised; whereas code in other functions is optimised ok.

What can prevent all optimisation from occuring in a function?

kenjackson · on May 26, 2011

How long is long? Is this a code gened function? I've seen in some compilers that they sometimes have limits where they turn off optimization due to throughput issues.

If you could break the function up in to pieces, and see if it still doesn't optimize.

chrisb · on May 26, 2011

The function is just over 2900 lines long. Every line crafted lovingly by hand.

The whole source file is here: http://pastebin.com/9L8N3AVF

The function starts at line 232, and the disassembly I was looking at is from line 1852.

This is the function that implements the direct-threaded interpreter. Direct-threading works by using goto's (jmp's) to dispatch the next instruction to be interpreted, which means that I don't think it can be broken up into multiple smaller functions.

Please let me know if you think I'm wrong :)

swolchok · on May 26, 2011

What if you made BINARY_OP a __forceinline (http://msdn.microsoft.com/en-us/library/z8y1yy88.aspx) function instead of a macro?

Things that the function contains that might disable optimization: a lot of gotos, inline assembly in GO_NEXT (this does affect optimization: http://msdn.microsoft.com/en-us/library/5hd5ywk0.aspx)...

It's also not immediately obvious to me that pCurEvalStack is initialized.

Having a look at the architecture optimization manual re: the redundant loads; it's not immediately obvious to me that modern processors won't handle this fine (http://www.intel.com/Assets/PDF/manual/248966.pdf).

chrisb · on May 26, 2011

Thanks for the idea. Unfortunately it doesn't improve things:

  	pCurEvalStack = TestAdd(pCurEvalStack);
  0040BFFA  mov         eax,dword ptr [pCurEvalStack] 
  0040BFFD  mov         dword ptr [ebp-694h],eax 
  0040C003  mov         ecx,dword ptr [ebp-694h] 
  0040C009  sub         ecx,8 
  0040C00C  mov         dword ptr [ebp-690h],ecx 
  0040C012  mov         edx,dword ptr [ebp-694h] 
  0040C018  sub         edx,4 
  0040C01B  mov         dword ptr [ebp-694h],edx 
  0040C021  mov         eax,dword ptr [ebp-690h] 
  0040C027  mov         ecx,dword ptr [eax] 
  0040C029  mov         edx,dword ptr [ebp-694h] 
  0040C02F  add         ecx,dword ptr [edx] 
  0040C031  mov         eax,dword ptr [ebp-690h] 
  0040C037  mov         dword ptr [eax],ecx 
  0040C039  mov         ecx,dword ptr [ebp-694h] 
  0040C03F  mov         dword ptr [pCurEvalStack],ecx

And also the nice thing about the BINARY_OP() macro is that it can take types and the operator as arguments (e.g. BINARY_OP(I32, I32, I32, +), BINARY_OP(I64, I64, I32, <<) ) meaning it can be used for many operations on many types. I would have to write seperate functions for each case if using in inline function.

And pCurEvalStack is initialised in the LOAD_METHOD_STATE macro, referenced first on line 590, before interpretation begins.

swolchok · on May 26, 2011

Looks like I was wrong about the redundant loads. They should probably go away (and indeed you say they do if you put BINARY_OP in its own function).

kenjackson · on May 26, 2011

That's the longest handwritten function I think I've ever seen :-)

I don't know what limits the compiler imposes, but I wouldn't be surprised if you hit them.

iam · on May 27, 2011

That's definitely the problem, compiler optimizations are easily O(N^2) on the # of operations in a function, so rather than taking forever to compile they'll turn off optimization if it takes too long.

Ask yourself, do you even need a function that large? You might already be approaching the limits of the L1 cache, at which point you might as well be using separate functions since the call itself will be negligible to the cache miss.

fmap · on May 27, 2011

I don't really know Microsoft's C compiler well enough to answer why your specific example fails so spectacularly, but I can make a guess: You are using inline assembly containing a computed jump instruction. In this situation the compiler no longer has a complete view of the control flow graph of your function, and it is likely that all optimizations which need a control flow graph will be disabled. Common subexpression elimination needs a CFG since it determines if two computations are the same for all paths in the program.

Now I'm reasonably certain that the C standard does not allow inline assembly like this. However, given the decades of legacy software that Microsoft has to contend with I would not be too surprised if this was part of the reason.

By the way, GCC's computed gotos are not actually much better. In this case you get a control flow graph where you cannot split all critical edges. There is a workaround for this, but it's probably easier to just disable all optimizations that depend on it.

chrisb · on May 27, 2011

I've tested that inline assembler performing the computed goto in a smaller test function, and optimisation is performed.

And almost all of the C code that I would like to be optimised is in short fairly self-contained statements that use their own local variables, so I would expect Again, I've tested this in a smaller function and good optimisation is done.

Therefore I am coming to the conclusion that the problem may just be the length of the function.

I don't have time to do it now, but next week I'll experiment with function length, although as I stated ealier - I'm fairly sure that a threaded interpreter has to be all implemented within a single function.

kenjackson · on May 26, 2011

Some of those loads are unexpected. Can you verify that /O2 optimization is on (since release builds can, in theory, have optimizations off).

chrisb · on May 26, 2011

Yes - one of the first things I checked - I've tried it with /O2 and /Ox

Same result both times.

beagle3 · on May 27, 2011

If you have the time to play with it some more ... I wonder, does gcc or clang or icc do a better job?

chrisb · on May 27, 2011

Unfortunately I don't have any of these compilers ready to use at the moment, but I may try some of them in a week or two.

jerhewet · on May 27, 2011

ICC is pretty spectacular (and sometimes awesome!) at optimization.