Update: I changed it to a switch statement. The euler1 program went from 0.081s to 0.091s with gcc on a core i3.
The cascading if statements are indeed faster than a switch.
EOU - original comment follows…
Without the benefit of profiling, I can suggest it may not be as bad as it looks.
The relative frequency of the opcodes could make this faster than the switch. See MOV, PUSH, POP in the front? Theses likely have tiny, inline implementations and are also probably a large percentage of the opcodes implemented. They may all fit in the same cache line and give the pipelines on deep pipeline, branch predicting machines lots of good stuff to work on.
Likewise, the infrequently occurring opcode comparisons in the back part of the statement, by definition, are almost always false, which should tell the branch predictors which way to go to again keep the pipelines full.
A switch on the other hand is pretty much a guaranteed pipeline flush.
But… if you asked me to code this control structure without being able to profile and to get it fast the first time… I'd use a switch. (well, maybe with a couple if statements out front if I knew I had a heavily lopsided distribution.)
Then if I'd go reread that article that came by HN a couple weeks ago and turn the control structure inside out and see how much better it was.
It's satisfying that you were able to say "There's a good chance you're wrong, here's why" and then be confirmed by testing. You obviously have a great deal of experience with low-level programming; relatively rare, nowadays. Mine comes from graphics programming.
Also, which article are you referring to? Any keywords I can search for?
But after looking at tinyvm it may be a special animal. The native implementation of its virtual opcodes are only one or two instructions. Locality, cacheing, and pipelining are working at their finest here and it might be best to just let them do their thing.
I was quite curious as to this, so I looked into the code (using GCC 4.4.5 on a 32-bit VM). Switch statements can be optimized in 2 ways (doing a binary search - log n checks/jumps, or doing a jump table, no checks and 2 jumps). Usually the jump table is done for dense cases and binary search for sparse cases.
I will post the disassembly of the code as-is first. Then the same exact code but using a switch statement (change all the if/else into a 'case' and nothing else).
// loop header (break out of loop if we hit END)
1a: 8b 04 99 mov eax,DWORD PTR [ecx+ebx*4]
1d: 83 f8 ff cmp eax,0xffffffff
20: 0f 84 62 01 00 00 je 188 <run_vm+0x188>
26: 89 fa mov edx,edi
// inside of the loop begins here (theres a jmp 28 instruction at the bottom that i omitted here)
28: 8b 52 0c mov edx,DWORD PTR [edx+0xc]
// now edx == &vm->pProgram->args
// FAIL: vm->pProgram->args is loop invariant and shouldn't be reloaded every iteration
// this is either a failure to prove the load was redundant or a failure to allocate this its own register without spilling
2b: 83 f8 01 cmp eax,0x1
// why is this here? instruction scheduling magic probably.
2e: 8b 14 32 mov edx,DWORD PTR [edx+esi*1]
// esi is the instr_idx
// new edx is vm->pProgram->args[instr_idx]
31: 8b 32 mov esi,DWORD PTR [edx]
// esi is arg0 == vm->pProgram->args[instr_idx][0]
33: 8b 52 04 mov edx,DWORD PTR [edx+0x4]
// edx is arg1 == vm->pProgram->args[instr_idx][1]
36: 89 55 d4 mov DWORD PTR [ebp-0x2c],edx
// FAIL: spill vm->pProgram->args[instr_idx] to stack.
// why is this a fail? it's already on the stack, so we can reconstruct it slightly slower without spilling
39: 0f 84 01 01 00 00 je 140 <run_vm+0x140>
// finally the je for the cmp
// CMP/JE pairs for ELSE IFs follow until end
3f: 83 f8 02 cmp eax,0x2
42: 0f 84 48 01 00 00 je 190 <run_vm+0x190>
48: 83 f8 03 cmp eax,0x3
4b: 0f 84 5f 01 00 00 je 1b0 <run_vm+0x1b0>
51: 83 f8 04 cmp eax,0x4
54: 0f 84 0e 01 00 00 je 168 <run_vm+0x168>
5a: 83 f8 05 cmp eax,0x5
5d: 8d 76 00 lea esi,[esi+0x0]
60: 0f 84 12 01 00 00 je 178 <run_vm+0x178>
66: 83 f8 06 cmp eax,0x6
69: 0f 84 61 01 00 00 je 1d0 <run_vm+0x1d0>
6f: 83 f8 07 cmp eax,0x7
72: 0f 84 70 01 00 00 je 1e8 <run_vm+0x1e8>
78: 83 f8 08 cmp eax,0x8
eax is the bytecode for the current instruction (vm->pProgram->args[instr_idx]). Notice that the first time it inlines the code (for the MOV instruction). The rest of the time it's a cmp/jcc pair (I omitted the rest of the asm for clarity, but it's cmp/jcc all the way down).
So that alone makes it that a MOV bytecode would get executed almost right away. Then for everything else, PUSH, POP, etc you have to go through a linearly increasing sets of cmp/jcc pairs. So in this case if the probability distribution of your interpreted program meant that most opcodes were MOV, PUSH, POP, you could get pretty good performance. But once it goes through a few cmp/jcc pairs you have lost the performance benefit of using if/else, it will get slower than using a switch.
There are also a few potential fails here (look at my FAIL comments). The value for &vm->pProgram->args gets recalculated every loop iteration, this is probably a register allocation fail (not enough registers on x86). And vm->pProgram->args[instr_idx] gets spilled to stack. That's really annoying and probably unnecessary.
---------
Now let's take a look at the switch statement! I was curious to see if it would generate a binary search table or a jump table. In theory it should be the latter since the opcodes are 0-25 inclusive (with -1 for END, which is actually 0xFFFFFFFF and is not easily a candidate to become a jump table member).
x
//loop header
1a: 8b 04 99 mov eax,DWORD PTR [ecx+ebx*4]
1d: 83 f8 ff cmp eax,0xffffffff
20: 74 48 je 6a <run_vm+0x6a>
22: 8d b6 00 00 00 00 lea esi,[esi+0x0]
// loop beginning (there's an omitted jmp 28 lower on)
28: 8b 7e 0c mov edi,DWORD PTR [esi+0xc]
// now edi == &vm->pProgram->args
// FAIL: vm->pProgram->args is loop invariant and shouldn't be reloaded every iteration
// this is either a failure to prove the load was redundant or a failure to allocate this its own register without spilling
2b: 83 f8 19 cmp eax,0x19
// compare eax to 25 (JLE)
// compiler was unable to prove that the instruction is in the range [-1, 25]. this will always cause a >25 check every loop iteration (bad)
// i wanted to try adding a __builtin_unreachable() as a default switch case to fix this, unfortunately i don't have gcc 4.5
// also maybe doing a switch on eax & 0x20 so it knows its less than 32, 32 might be small enough to generate 7 blank jump table entries
2e: 8b 14 17 mov edx,DWORD PTR [edi+edx*1]
// new edx is vm->pProgram->args[instr_idx]
31: 8b 3a mov edi,DWORD PTR [edx]
// arg0
33: 8b 52 04 mov edx,DWORD PTR [edx+0x4]
// arg1
36: 89 55 d4 mov DWORD PTR [ebp-0x2c],edx
// FAIL: spill vm->pProgram->args[instr_idx] to stack.
// why is this a fail? we can always get it from [esi+0xc] like at the beginning of the loop
39: 77 1d ja 58 <run_vm+0x58>
// aha, thats what the cmp eax, 0x19 was for previously.. refer to previous comment
3b: ff 24 85 00 00 00 00 jmp DWORD PTR [eax*4+0x0]
// jmp table badassery. but why did it take so long to get here?
// this is a JMP r/m32 (jump near, absolute indirect)
// the 0x0 gets set to a jump table offset by the linker later (i decompiled the .o file only)
// the jmp table itself will look like JMP address-back-into-run_vm-as-a-constant
This is basically the same code as before except that: there's an additional cmp/ja instruction pair to check that the instruction code is in the [0,25] range (and of course the jmp to the jmp table instead of the if/else pairs). This makes it always slower for a branch that will never get taken, and ruins the branch prediction of the first few iterations.
Also the indirect jmp does make it a bit harder to do instruction prefetching, but this should be truly straightforward since the value of 'eax' is known for a good 8 instructions from the beginning of the loop (and the prefetching is trivial for the direct jumps). Another consideration is the L1 i-cache, which is 32K on a Nehalem. A quick google search for the micro-ops cache reveals it to be 12K. Since the loop is fairly tiny, both of these easily fit into those caches with probably at least 1 order of magnitude left over.
Supposing that the redundant cmp/ja is removed, we are now on a more even field. It should be faster for pretty much anything except the first case (instruction is a MOV).
jmp [indirect abs]
jmp direct abs
vs
cmp eax, 0x0 // is EAX a mov?
jg somewhere_else // skip the following section if its not a MOV
The second code will always be faster if we're always hitting MOVs, don't ask me to prove this, but it's a gut feeling that there's less micro ops involved in the second case. The equal point will probably be when the instrucitons are always PUSH (1) or POP (2). But after that it will always be faster. So I think overall unless all the instructions are MOV,PUSH,POP (haha unlikely) the jmp table solution should be faster with cmp/ja removed.
That being said, I think I've spent too much time on this as-is and will not be trying to remove the cmp/ja.
-----------
Summary:
GCC doesn't generate the best code, and there's some things that are hard for even the compiler to optimize. Sometimes you have to massage the compiler into doing what you want, and it doesn't mean the theoretical approach to using switches over chained if/elses is flawed.
It's my gut feeling that if we can get the optimizer in this case to work slightly better, the switch statement should always be faster than the if/elses.
It also would've been interested to see the x86_64 disassembly, but of course then we have the downside of using 64-bit everywhere which leads to larger instruction encodings and the upside of less register spills.
For the benefit of those who don't understand this, have a look at http://en.wikipedia.org/wiki/Threaded_code where it explains the different ways one can process byte codes.
The key issue here is that the efficiency of dispatch loop determines the performance of a VM.
That is surprisingly bad. It reminds me of when I first heard that you could get fast sine/cosine approximations with table lookups -- not fully understanding this idea I wrote a program that generated the following (I was in high school, young and naive):
double sin(double x) {
if (x >= 0.00 && x < 0.01) return 0;
if (x >= 0.01 && x < 0.02) return 0.01;
// ...
}
At least you wrote the program to generate it rather than typing all that by hand. I think you got the gist of the idea. In any case, would all those branches be faster than the comparatively expensive floating-point arithmetic?
No, at best you could do a binary search using if's to simulate a lookup table, but you are going to trash the pipeline on a modern CPU if you want a reasonably complex lookup table you are much better off with something like:
float lookup_Sine(x){
int I = 100 * (x mod 360);
return sine_table(I);
}
Anyway, for a modern CPU hitting main memory takes a lot of cycles. So you need something a lot more complex than a simple trig function to make it worth it.
Yes, that's very odd. At least they could have used a case statement, which a C compiler will be able to optimize into an efficient jump table.
I remember reading about a VM (I remember it as being Google's V8, but that one compiles directly to machine code so probably I'm misremembering; maybe it was Strongtalk?) which basically ditched the classic bytecode loop (grab next instruction, check what it is, evaluate it, repeat) and instead implemented each bytecode type as a function and somehow munged the instruction pointer or modified the VM's machine code itself so that the CPU would move directly to the next instruction's function without jumping. Damn, I wish I could remember the idea. Anyone know what I'm talking about?
You are probably talking about threaded code which is described in "The Structure and Performance of Efficient Interpreters"
http://www.jilp.org/vol5/v5paper12.pdf
Thanks, that's an interesting paper. I have never heard of that paper, I'm thinking specifically of a VM that implemented the feature. Not sure if we are talking about the same algorithm, either, although it sounds similar.
That just makes the case for switch expressions even stronger. If you had those this would be a damn simple switch statement and it'd provide lots of hints to the compiler for speedups. (All but the modified jump instructions can go in a straight switch statement).
I'm very tempted to add type tagging into this and write a burroughs b5000 emulator. That'd be a neat side project.
Theoretically, it could do, at least in this case, I think? Very risky to rely on that though! I doubt compiler writers have people who haven't heard of the switch statement in mind when they decide which optimizations to put in.
(The "fast" branch uses a switch statement - I guess the author is going for minimum number of lines in this one? I'm not sure saving 2-3 lines compared to a squeezed-together switch statement is a great tradeoff, though. You could get better results by just putting everything in one source file. Easier deployment, too. I bet it wouldn't affect readability significantly.)
It can. However, this is a very uncommon pattern, so it's not worth trying very hard to detect. On top of that, I'd expect the compiler to bail after the pattern gets to a certain size.
This made me think back to a tiny virtual machine that took less than 300 lines -- in fact it took less than 500 bytes of code (thanks to the Woz): http://en.wikipedia.org/wiki/SWEET16
As a c programmer I find it quite odd/unnerving actually. It may be simple to understand its structure and it may look clean, but that isn't much to say of a project this size, especially if you know what it does.
I found the file division & include pattern quite strange actually, and don't get me started in the cascading "if/else if" for the instruction dispatch: Not only is it almost impossible to optimize for the compiler, it's also more difficult to read than the obvious switch. Also, if the instructions were enums instead of defines, the compiler could even generate warnings when there are unimplemented opcodes. I don't use github but I'm tempted to make a branch and a pull request because it's driving me crazy!
EDIT: I was about to fork it when I noticed that in the other two branches (fast & simple, default is master) the dispatch is implemented with a switch. Please somebody explain to me why this wasn't moved to the default branch...
Really? How would dereferencing a NULL pointer cause "some really nasty crashes"?
Yes, it will crash. That is a Good Thing(tm).
1) It is extremely unlikely that malloc() will ever return NULL (on a PC). This is due to virtual memory. If you manage to allocate more than 1.5GB of memory, then yes. Otherwise no.
2) Even if malloc() does return NULL, then NULL checking it is pointless because program operation cannot realistically continue. You need that memory. The best and cleanest solution is to crash. You will get a call trace leading to the exact point of failure (which you won't get if you do NULL checks + continue program execution). Also, as far as I know, it's almost impossible for a dereferenced NULL pointer to be a security vulnerability, unlike e.g. a buffer overrun.
In summary, just let it crash. There isn't any reason not to, unless you're developing a "third-party library" (e.g. you're a Lua developer, etc) and you want to leave the decision of whether to crash to the programmers using your code.
Are you kidding? Most real software does something like allocating some emergency memory at startup time, and then handles a failing malloc call by using that reserve pool to save work, perhaps do a GC, and print a decent error message before quitting.
Losing all of the user's work and printing "Segmentation fault (core dumped)" is never the right answer.
The point is, the crash you describe will only occur on 32-bit operating systems, and only after you've allocated more than 1.5GB of memory. For most projects, that is extremely unlikely.
Your solution would work. But the extra code complexity, development time, and maintenance isn't worth it except in very, very specific scenarios, like if you're writing a third-party library such as Lua or FMOD or FreeType or... etc.
Here's an alternative solution that doesn't require NULL checking malloc, and provides all of the same benefits:
1) create a function (I call mine "Sys_Exit()") which allows you to cleanly exit your application from anywhere in your codebase. For example in my game engine, my main() function looks like:
int main()
{
App_Startup();
while ( App_Frame() ) {}
App_Shutdown();
return 0;
}
2) write a "my_malloc" function which forwards to malloc. Use it for all memory allocation. When it detects that more than 1.2GB of memory has been allocated, then it prints an error message and calls Sys_Exit().
This allows you to save the user's work, etc, and your shutdown sequence has a fairly large amount of remaining memory to work with (so that you don't truly run out of memory after you've fake-run-out-of-memory).
This is a very simple solution if you really care about the extremely unlikely out-of-memory crash.
----
tl;dr: It is almost always a bad idea to NULL check malloc(). It tends to destroy readability, is error prone, and is hard to maintain, for very little practical benefit.
Even ignoring the possibility of out-of-memory crashes, null pointer bugs can lead to memory corruption and code execution and have been popular targets in recent years: http://www.google.com/search?sourceid=chrome&ie=UTF-8...
Edit: http://flashmypassion.blogspot.com/2008/04/this-new-vulnerab... is a copy of a blog post (which seems to be missing from the Matasano chargen; odd) about Mark Dowd's crazy Flash null-pointer vuln. The first comment from Dino Dai Zovi (another crazy good security researcher) is very relevant:
> Oh, the sweetsweet vindication. I remember an argument that I have had twice in the last several years about whether one should check the return value of malloc. I argued that it should always be done for two reasons: Reading address zero might crash, but not necessarily zero-plus-offset and because a compare register to zero is so free performance-wise that it isn’t even funny.
> Guess who was arguing the contrary
Edit #2: tptacek's responses on the thread are quite interesting, and I agree. These are your friends:
Argh, this is so often overlooked. NULL POINTER DEREFERENCE IS A POSSIBLE SECURITY HOLE. Sorry about the shouting, I've gone through this so much I feel like my brain is going to explode. Here, read this: http://lwn.net/Articles/360328/
Don't trust NULL pointer dereference. Even in user code -- it puts you in bad habits.
Writing and using these functions is not "error prone" or "hard to maintain", and does not "destroy readability". It adds a single letter to each of these function calls and avoids writing any more error-handling code.
That's a completely reasonable solution, but it's not the one you took. It's not the path of least resistance, but when you're programming in C, writing robust code is very rarely the path of least resistance.
With regard to your nonsense about "will only ever happen on 32-bit OSes", I run programs under VM size ulimits every day. This prevents them from forcing the system into swap and becoming unresponsive, by causing their allocations to fail. Some of them (e.g. less) usually handle running out of memory quite well. I happen to be using 32-bit OSes, but that's irrelevant; I would do the same thing on 64-bit OSes.
A friend asked me if this was a good example to learn C from. I had to say no; in half an hour of looking at the code, I found seven bugs in a single 60-line routine. Details at:
Try opening it up in an editor/ide with ctrl+click to step into function or similar like gvim or eclipse(and hopefully a way to jump back. eclipse also has a nice expand macro feature which lets you look at macros fully expanded and at different expansion stages.
darn forgot to mention ctags.
If you are up for it supposedly scope is better but I couldn't get it set up. I don't know of any vim plugin for eclipse like macro expansion but if you see hairy macros you can always do gcc -E |vim - (gcc -E does full includes so macro expansion + importing of all headers)and search for your code or jump to the bottom.
Looking at the instruction set it doesn't even support subroutines. It would only need a way to get and set the current program counter (like ARM) and you could implement subroutine calls using push, pop, and jmp.
This is probably more appropriate for the Github support forum than some random HN posting about a project on Github. Your post adds nothing to this discussion.
Ah, I love virtual machines/cpus :] Some weeks ago I found out that you can create pointers to goto labels in C (gcc and clang that is - it's a non standard extension afaik).
I then implemented a super small subset of the 6502 to test how messy the code would like:
That's neat. I like how your code is so concise and readable. Now all you need to do is to emulate a display, a sound chip and a tape cassette drive. :-)
While academically interesting, I'll wager a guess that QEMU wins in every performance comparison, right? While writing a TCG-style thunking generator is not "simple", it is extremely fast.
I'm also not really sure what instruction set you're targeting here. This is a very small subset of IA-32(e) if I'm not mistaken.
This is, as I understand it, not an emulator but a VM core for running bytecode. A better comparison would be the JVM, or LLVM, or any of the VMs used by popular languages such as Ruby and Python.
Emulators like QEMU are designed to emulate an existing instruction set such as x86, and they also need to emulate a specific memory model, interrupt model, I/O subsystem etc. VMs such as in this post are designed without such constraints.
Both are designed around the core principle of a "virtual machine" — decode some kind of stream of machine code instructions, execute each instruction, and maintain the stack and memory state along the way — but they target different use cases with different nuances in the implementation.
if(vm->pProgram->instr[instr_idx] == MOV) arg0 = arg1; else if(vm->pProgram->instr[instr_idx] == PUSH) stack_push(vm->pStack, arg0); else if(vm->pProgram->instr[instr_idx] == POP) stack_pop(vm->pStack, arg0); else if(vm->pProgram->instr[instr_idx] == INC) ++(arg0); else if(vm->pProgram->instr[instr_idx] == DEC) --(arg0); else if(vm->pProgram->instr[instr_idx] == ADD) arg0 += arg1; else if(vm->pProgram->instr[instr_idx] == SUB) arg0 -= arg1; else if(vm->pProgram->instr[instr_idx] == MUL) arg0 = arg1; else if(vm->pProgram->instr[instr_idx] == DIV) arg0 /= arg1; else if(vm->pProgram->instr[instr_idx] == MOD) vm->pMemory->remainder = arg0 % arg1; else if(vm->pProgram->instr[instr_idx] == REM) arg0 = vm->pMemory->remainder; else if(vm->pProgram->instr[instr_idx] == NOT) arg0 = ~(arg0); else if(vm->pProgram->instr[instr_idx] == XOR) arg0 ^= arg1; else if(vm->pProgram->instr[instr_idx] == OR) arg0 |= arg1; else if(vm->pProgram->instr[instr_idx] == AND) arg0 &= arg1; else if(vm->pProgram->instr[instr_idx] == SHL) arg0 <<= arg1; else if(vm->pProgram->instr[instr_idx] == SHR) arg0 >>= arg1; else if(vm->pProgram->instr[instr_idx] == CMP) vm->pMemory->FLAGS = ((arg0 == arg1) | (arg0 > arg1) << 1); else if(vm->pProgram->instr[instr_idx] == JMP) instr_idx = arg0 - 1; else if(vm->pProgram->instr[instr_idx] == JE && (vm->pMemory->FLAGS & 0x1)) instr_idx = arg0 - 1; else if(vm->pProgram->instr[instr_idx] == JNE && !(vm->pMemory->FLAGS & 0x1)) instr_idx = arg0 - 1; else if(vm->pProgram->instr[instr_idx] == JG && (vm->pMemory->FLAGS & 0x2)) instr_idx = arg0 - 1; else if(vm->pProgram->instr[instr_idx] == JGE && (vm->pMemory->FLAGS & 0x3)) instr_idx = arg0 - 1; else if(vm->pProgram->instr[instr_idx] == JL && !(vm->pMemory->FLAGS & 0x3)) instr_idx = arg0 - 1; else if(vm->pProgram->instr[instr_idx] == JLE && !(vm->pMemory->FLAGS & 0x2)) instr_idx = *arg0 - 1;