Pike and Winterbottom (http://www.vitanuova.com/inferno/papers/hotchips.html) used register-based VM for Plan9/Inferno because it's much faster and easier to write a high-quality jit from register-based VM to register-based CPU (and all of them are register based) than from byte-code VM to register-based CPU.
This is probably the same reason Android chose register-based VM design for their Java-based platform.
Jitting is not free - an increase in jit complexity is paid by lower performance of the code being jitted. Java's HotSpot does wonders but it takes much latency and high memory use to get the really high-quality results.
It makes sense to do as much work as possible where you have time and CPU cycles to spare (during compilation to VM instruction set) and not on the device when every millisecond and every byte matters.
Lower complexity of stack-based VM is a moot point - when you're jitting, you're doing the hard stuff anyway.
I don't follow the reasoning. Tracking the stack depth to convert into the equivalent of Pike and Winterbottom's VM registers takes at most one instruction at JIT time per stack-VM instruction (to increment or decrement the depth). If you want distinguish different types at the same stack offset, that might be one more instruction. Pike and Winterbottom seem to be saying their VM has an infinite number of registers, so the frontend is simple, and then the JIT backend has a choice of how hard to work at register allocation. But the same goes for a stack machine, with a very small overhead at JIT time, nothing like the expense of HotSpot.
The argument that does make sense to me is, if you want a somewhat faster interpreter at some expense in code size, then a register VM wins. (Perhaps by not as much as commonly thought, because of some optimizations like stack caching and superinstructions: http://savannah.gnu.org/projects/vmgen/ But it does seem to win still.) This has the disadvantage of having to having to implement spilling or some other such complication in the frontend when the number of virtual registers needed exceeds 256 or whatever.
Well we are both just appealing to experts here (unless you are infact one. I am certainly not - just an enthusiast).
What did you think of the last part of Erik Lipperts post?
The code is much smaller and much easier to understand. A stack machine is a very simple way to describe a complex computation; by being able to write code for such a simple machine, it lowers the cost of making a compiler. And not only is it easier to write compilers and jitters that target simple stack machines, it is easier to write other code analysis tools as well. The IL verifier, for example, can quickly determine when there is a code path through a method that, say, misaligns the stack, or passes the wrong types of arguments to a method.
The stack machine may be conceptually simple, but here's the thing:
Either you have a way to address an offset into the stack, or you don't. If you do, the stack is conceptually equivalent to a register file where you can optionally swap out the entire content in one go (by changing the stack pointer).
If you don't, then registers will in most designs offer you a strict superset of operations: I've yet to see any register-based machines where you can't load things into registers, do the operation and push them back on the stack. In fact, on most machines that are not purist RISC designs, you will tend to be able to do many/most operations with one of the operands in memory.
That superset comes at a complexity cost in some areas, sure, but being able to address more different data items without having to manipulate the stack also turns out to be very handy. If it wasn't, we'd just stick to mostly manipulating the stack even in register-based designs.
EDIT: In fact, you'll often see "naive" first code generators for register-based machines heavily rely on stack manipulation for expression evaluation because it massively reduces the need for register allocation complexity.
Also, it's worth noting that computation stack and execution stack need not be the same thing (and, indeed, often aren't).
In CLR, for example, local variables are referenced by index on the execution stack, but you cannot dynamically push new variables onto it - you have to predeclare the locals at the beginning of your execution frame, and the only way to push a new execution frame is by calling another function. On the other hand, opcodes like ADD operate on the computation stack, which is an abstraction completely separate from, and not affected by, the execution stack (every execution frame has its own independent computation stack; there's no way to peek across the boundary).
So in reality it's a bit more complicated - those locals are also kinda like named (or indexed) registers (at least until you take their address). Then the stack adds more temp registers on top of that. In this model, you could always take any register-based code, and translate it to stack-based with a very blunt transform, where something like "ADD x, y, z" becomes "LOAD x; LOAD y; ADD; STORE z". But in addition to that, you also have the flexibility of doing a bunch of operations directly on the temp stack, without intervening stores and loads, that are unavoidable in the register representation. So I would argue that it's the stack representation that is a superset of register one, not the other way around.
An even shorter way to say this is that stack-based VMs are more high-level. With a register-based VM, you shift some important questions (like register allocation, and handling temporaries) to the compilers. With a stack-based VM, the compiler can be simpler, but the JIT will need to be correspondingly more complicated.
Which one to choose depends on how it is all used. For .NET, multi-language targeting was seen as an important design goal, and so making the bytecode easier to target was prioritized, with all the complexity relegated to JIT.
Long-term, I think it was a right solution in general, because most platforms seem to be moving from JIT to AOT; and with AOT, having the complexity in the bytecode-to-native compiler still has the upside of DRY, and slower compile times stop being a downside.
However, if you leave register allocation to the source->VM compiler and don't do it in the VM->machine compiler (JIT), you cannot perform machine-specific register allocation (8 registers on IA-32, 16 on AMD64 and ARM, 32 on Aarch64). So register-based VMs typically work with large register sets, and JITs allocate them to the smaller register sets of real machines.
I have not read the present paper yet, but earlier papers I have read were about reducing interpretation overhead by performing fewer VM instructions (if VM instruction dispatch is expensive; dynamic superinstructions make it cheap, but not everyone implements them).
Note that the JIT mechanism mentioned in the paper meant basically JIT from a program code to the bytecode executed in the VM, and directly executing it (very much like PHP), not necessarily directly to machine assembly.
Interesting expressions usually use local variables. JVM, CLR etc. despite being nominally stack-oriented, use local variable slots by index rather than shuffling the stack for anything much more complicated than eager expression evaluation.
Pike and Winterbottom (http://www.vitanuova.com/inferno/papers/hotchips.html) used register-based VM for Plan9/Inferno because it's much faster and easier to write a high-quality jit from register-based VM to register-based CPU (and all of them are register based) than from byte-code VM to register-based CPU.
This is probably the same reason Android chose register-based VM design for their Java-based platform.
Jitting is not free - an increase in jit complexity is paid by lower performance of the code being jitted. Java's HotSpot does wonders but it takes much latency and high memory use to get the really high-quality results.
It makes sense to do as much work as possible where you have time and CPU cycles to spare (during compilation to VM instruction set) and not on the device when every millisecond and every byte matters.
Lower complexity of stack-based VM is a moot point - when you're jitting, you're doing the hard stuff anyway.