Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Frustratingly, I just changed the macro to this:

  #define BINARY_OP(returnType, type1, type2, op) \
  { \
  	register PTR pRet = pCurEvalStack - sizeof(type1) - sizeof(type2); \
  	pCurEvalStack = pRet + sizeof(returnType); \
  	*(returnType*)pRet = *(type1*)pRet op *(type2*)(pRet + sizeof(type1)); \
  }
And the assembly generated is this:

  	BINARY_OP(I32, I32, I32, +);
  004117BA  mov         eax,dword ptr [pCurEvalStack] 
  004117BD  sub         eax,8 
  004117C0  mov         dword ptr [pRet],eax 
  004117C6  mov         ecx,dword ptr [pRet] 
  004117CC  add         ecx,4 
  004117CF  mov         dword ptr [pCurEvalStack],ecx 
  004117D2  mov         edx,dword ptr [pRet] 
  004117D8  mov         eax,dword ptr [edx] 
  004117DA  mov         ecx,dword ptr [pRet] 
  004117E0  add         eax,dword ptr [ecx+4] 
  004117E3  mov         edx,dword ptr [pRet] 
  004117E9  mov         dword ptr [edx],eax 
which is still unimaginably terrible. Why isn't it re-using values that have already been loaded into registers? Why isn't it using a register for pRet? I've even told it to! Although I think it's documented that the MS compiler ignores the 'register' keyword.

And this is with all optimisations turned on. How depressing.



Most compilers completely ignore "register" these days. I believe also that aliasing the stack when you're performing the actual operation is greatly hindering the compiler's ability to optimize.


It does look as though no optimisation is being performed at all.

I just isolated the use of the BINARY_OP macro, and put it in a simple-ish test function. now it's being optimized excellently.

The function that contains the apparently unoptimisable code is in a hugely long and complex function, and I wonder if something in it is preventing all optimisation from occuring within that function. I've quickly looked through the assembly produced in the function and all of it appears unoptimised; whereas code in other functions is optimised ok.

What can prevent all optimisation from occuring in a function?


How long is long? Is this a code gened function? I've seen in some compilers that they sometimes have limits where they turn off optimization due to throughput issues.

If you could break the function up in to pieces, and see if it still doesn't optimize.


The function is just over 2900 lines long. Every line crafted lovingly by hand.

The whole source file is here: http://pastebin.com/9L8N3AVF

The function starts at line 232, and the disassembly I was looking at is from line 1852.

This is the function that implements the direct-threaded interpreter. Direct-threading works by using goto's (jmp's) to dispatch the next instruction to be interpreted, which means that I don't think it can be broken up into multiple smaller functions.

Please let me know if you think I'm wrong :)


What if you made BINARY_OP a __forceinline (http://msdn.microsoft.com/en-us/library/z8y1yy88.aspx) function instead of a macro?

Things that the function contains that might disable optimization: a lot of gotos, inline assembly in GO_NEXT (this does affect optimization: http://msdn.microsoft.com/en-us/library/5hd5ywk0.aspx)...

It's also not immediately obvious to me that pCurEvalStack is initialized.

Having a look at the architecture optimization manual re: the redundant loads; it's not immediately obvious to me that modern processors won't handle this fine (http://www.intel.com/Assets/PDF/manual/248966.pdf).


Thanks for the idea. Unfortunately it doesn't improve things:

  	pCurEvalStack = TestAdd(pCurEvalStack);
  0040BFFA  mov         eax,dword ptr [pCurEvalStack] 
  0040BFFD  mov         dword ptr [ebp-694h],eax 
  0040C003  mov         ecx,dword ptr [ebp-694h] 
  0040C009  sub         ecx,8 
  0040C00C  mov         dword ptr [ebp-690h],ecx 
  0040C012  mov         edx,dword ptr [ebp-694h] 
  0040C018  sub         edx,4 
  0040C01B  mov         dword ptr [ebp-694h],edx 
  0040C021  mov         eax,dword ptr [ebp-690h] 
  0040C027  mov         ecx,dword ptr [eax] 
  0040C029  mov         edx,dword ptr [ebp-694h] 
  0040C02F  add         ecx,dword ptr [edx] 
  0040C031  mov         eax,dword ptr [ebp-690h] 
  0040C037  mov         dword ptr [eax],ecx 
  0040C039  mov         ecx,dword ptr [ebp-694h] 
  0040C03F  mov         dword ptr [pCurEvalStack],ecx 
And also the nice thing about the BINARY_OP() macro is that it can take types and the operator as arguments (e.g. BINARY_OP(I32, I32, I32, +), BINARY_OP(I64, I64, I32, <<) ) meaning it can be used for many operations on many types. I would have to write seperate functions for each case if using in inline function.

And pCurEvalStack is initialised in the LOAD_METHOD_STATE macro, referenced first on line 590, before interpretation begins.


Looks like I was wrong about the redundant loads. They should probably go away (and indeed you say they do if you put BINARY_OP in its own function).


That's the longest handwritten function I think I've ever seen :-)

I don't know what limits the compiler imposes, but I wouldn't be surprised if you hit them.


That's definitely the problem, compiler optimizations are easily O(N^2) on the # of operations in a function, so rather than taking forever to compile they'll turn off optimization if it takes too long.

Ask yourself, do you even need a function that large? You might already be approaching the limits of the L1 cache, at which point you might as well be using separate functions since the call itself will be negligible to the cache miss.


I don't really know Microsoft's C compiler well enough to answer why your specific example fails so spectacularly, but I can make a guess: You are using inline assembly containing a computed jump instruction. In this situation the compiler no longer has a complete view of the control flow graph of your function, and it is likely that all optimizations which need a control flow graph will be disabled. Common subexpression elimination needs a CFG since it determines if two computations are the same for all paths in the program.

Now I'm reasonably certain that the C standard does not allow inline assembly like this. However, given the decades of legacy software that Microsoft has to contend with I would not be too surprised if this was part of the reason.

By the way, GCC's computed gotos are not actually much better. In this case you get a control flow graph where you cannot split all critical edges. There is a workaround for this, but it's probably easier to just disable all optimizations that depend on it.


I've tested that inline assembler performing the computed goto in a smaller test function, and optimisation is performed.

And almost all of the C code that I would like to be optimised is in short fairly self-contained statements that use their own local variables, so I would expect Again, I've tested this in a smaller function and good optimisation is done.

Therefore I am coming to the conclusion that the problem may just be the length of the function.

I don't have time to do it now, but next week I'll experiment with function length, although as I stated ealier - I'm fairly sure that a threaded interpreter has to be all implemented within a single function.


Some of those loads are unexpected. Can you verify that /O2 optimization is on (since release builds can, in theory, have optimizations off).


Yes - one of the first things I checked - I've tried it with /O2 and /Ox

Same result both times.


If you have the time to play with it some more ... I wonder, does gcc or clang or icc do a better job?


Unfortunately I don't have any of these compilers ready to use at the moment, but I may try some of them in a week or two.


ICC is pretty spectacular (and sometimes awesome!) at optimization.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: