TinyVM: A lightweight, fast virtual machine in < 500 lines of ANSI C

anon_comenter9 · on July 2, 2011

Unless this is relying on the compiler optimizing this, the dispatch looks to be not good

if(vm->pProgram->instr[instr_idx] == MOV) arg0 = arg1; else if(vm->pProgram->instr[instr_idx] == PUSH) stack_push(vm->pStack, arg0); else if(vm->pProgram->instr[instr_idx] == POP) stack_pop(vm->pStack, arg0); else if(vm->pProgram->instr[instr_idx] == INC) ++(arg0); else if(vm->pProgram->instr[instr_idx] == DEC) --(arg0); else if(vm->pProgram->instr[instr_idx] == ADD) arg0 += arg1; else if(vm->pProgram->instr[instr_idx] == SUB) arg0 -= arg1; else if(vm->pProgram->instr[instr_idx] == MUL) arg0 = arg1; else if(vm->pProgram->instr[instr_idx] == DIV) arg0 /= arg1; else if(vm->pProgram->instr[instr_idx] == MOD) vm->pMemory->remainder = arg0 % arg1; else if(vm->pProgram->instr[instr_idx] == REM) arg0 = vm->pMemory->remainder; else if(vm->pProgram->instr[instr_idx] == NOT) arg0 = ~(arg0); else if(vm->pProgram->instr[instr_idx] == XOR) arg0 ^= arg1; else if(vm->pProgram->instr[instr_idx] == OR) arg0 |= arg1; else if(vm->pProgram->instr[instr_idx] == AND) arg0 &= arg1; else if(vm->pProgram->instr[instr_idx] == SHL) arg0 <<= arg1; else if(vm->pProgram->instr[instr_idx] == SHR) arg0 >>= arg1; else if(vm->pProgram->instr[instr_idx] == CMP) vm->pMemory->FLAGS = ((arg0 == arg1) | (arg0 > arg1) << 1); else if(vm->pProgram->instr[instr_idx] == JMP) instr_idx = arg0 - 1; else if(vm->pProgram->instr[instr_idx] == JE && (vm->pMemory->FLAGS & 0x1)) instr_idx = arg0 - 1; else if(vm->pProgram->instr[instr_idx] == JNE && !(vm->pMemory->FLAGS & 0x1)) instr_idx = arg0 - 1; else if(vm->pProgram->instr[instr_idx] == JG && (vm->pMemory->FLAGS & 0x2)) instr_idx = arg0 - 1; else if(vm->pProgram->instr[instr_idx] == JGE && (vm->pMemory->FLAGS & 0x3)) instr_idx = arg0 - 1; else if(vm->pProgram->instr[instr_idx] == JL && !(vm->pMemory->FLAGS & 0x3)) instr_idx = arg0 - 1; else if(vm->pProgram->instr[instr_idx] == JLE && !(vm->pMemory->FLAGS & 0x2)) instr_idx = *arg0 - 1;

jws · on July 3, 2011

Update: I changed it to a switch statement. The euler1 program went from 0.081s to 0.091s with gcc on a core i3.

The cascading if statements are indeed faster than a switch.

EOU - original comment follows…

Without the benefit of profiling, I can suggest it may not be as bad as it looks.

The relative frequency of the opcodes could make this faster than the switch. See MOV, PUSH, POP in the front? Theses likely have tiny, inline implementations and are also probably a large percentage of the opcodes implemented. They may all fit in the same cache line and give the pipelines on deep pipeline, branch predicting machines lots of good stuff to work on.

Likewise, the infrequently occurring opcode comparisons in the back part of the statement, by definition, are almost always false, which should tell the branch predictors which way to go to again keep the pipelines full.

A switch on the other hand is pretty much a guaranteed pipeline flush.

But… if you asked me to code this control structure without being able to profile and to get it fast the first time… I'd use a switch. (well, maybe with a couple if statements out front if I knew I had a heavily lopsided distribution.)

Then if I'd go reread that article that came by HN a couple weeks ago and turn the control structure inside out and see how much better it was.

palish · on July 3, 2011

You're very wise. Thank you for sharing.

It's satisfying that you were able to say "There's a good chance you're wrong, here's why" and then be confirmed by testing. You obviously have a great deal of experience with low-level programming; relatively rare, nowadays. Mine comes from graphics programming.

Also, which article are you referring to? Any keywords I can search for?

jws · on July 3, 2011

I think I am remembering this one: http://news.ycombinator.com/item?id=2593095 The Common CPU Interpreter Loop Revisited

But after looking at tinyvm it may be a special animal. The native implementation of its virtual opcodes are only one or two instructions. Locality, cacheing, and pipelining are working at their finest here and it might be best to just let them do their thing.

iam · on July 3, 2011

I was quite curious as to this, so I looked into the code (using GCC 4.4.5 on a 32-bit VM). Switch statements can be optimized in 2 ways (doing a binary search - log n checks/jumps, or doing a jump table, no checks and 2 jumps). Usually the jump table is done for dense cases and binary search for sparse cases.

I will post the disassembly of the code as-is first. Then the same exact code but using a switch statement (change all the if/else into a 'case' and nothing else).

  // loop header (break out of loop if we hit END)
   1a:   8b 04 99                mov    eax,DWORD PTR [ecx+ebx*4]
  1d:   83 f8 ff                cmp    eax,0xffffffff
  20:   0f 84 62 01 00 00       je     188 <run_vm+0x188>
  26:   89 fa                   mov    edx,edi
  // inside of the loop begins here (theres a jmp 28 instruction at the bottom that i omitted here)
  28:   8b 52 0c                mov    edx,DWORD PTR [edx+0xc]
  // now edx == &vm->pProgram->args
  // FAIL: vm->pProgram->args is loop invariant and shouldn't be reloaded every iteration
  // this is either a failure to prove the load was redundant or a failure to allocate this its own register without spilling
  2b:   83 f8 01                cmp    eax,0x1
  // why is this here? instruction scheduling magic probably.
  2e:   8b 14 32                mov    edx,DWORD PTR [edx+esi*1]
  // esi is the instr_idx 
  // new edx is vm->pProgram->args[instr_idx]
  31:   8b 32                   mov    esi,DWORD PTR [edx]
  // esi is arg0     == vm->pProgram->args[instr_idx][0]
  33:   8b 52 04                mov    edx,DWORD PTR [edx+0x4]
  // edx is arg1     == vm->pProgram->args[instr_idx][1]
  36:   89 55 d4                mov    DWORD PTR [ebp-0x2c],edx
  // FAIL: spill vm->pProgram->args[instr_idx] to stack.
  // why is this a fail? it's already on the stack, so we can reconstruct it slightly slower without spilling
  39:   0f 84 01 01 00 00       je     140 <run_vm+0x140>
  // finally the je for the cmp

  // CMP/JE pairs for ELSE IFs follow until end
  3f:   83 f8 02                cmp    eax,0x2
  42:   0f 84 48 01 00 00       je     190 <run_vm+0x190>
  48:   83 f8 03                cmp    eax,0x3
  4b:   0f 84 5f 01 00 00       je     1b0 <run_vm+0x1b0>
  51:   83 f8 04                cmp    eax,0x4
  54:   0f 84 0e 01 00 00       je     168 <run_vm+0x168>
  5a:   83 f8 05                cmp    eax,0x5
  5d:   8d 76 00                lea    esi,[esi+0x0]
  60:   0f 84 12 01 00 00       je     178 <run_vm+0x178>
  66:   83 f8 06                cmp    eax,0x6
  69:   0f 84 61 01 00 00       je     1d0 <run_vm+0x1d0>
  6f:   83 f8 07                cmp    eax,0x7
  72:   0f 84 70 01 00 00       je     1e8 <run_vm+0x1e8>
  78:   83 f8 08                cmp    eax,0x8

eax is the bytecode for the current instruction (vm->pProgram->args[instr_idx]). Notice that the first time it inlines the code (for the MOV instruction). The rest of the time it's a cmp/jcc pair (I omitted the rest of the asm for clarity, but it's cmp/jcc all the way down).

So that alone makes it that a MOV bytecode would get executed almost right away. Then for everything else, PUSH, POP, etc you have to go through a linearly increasing sets of cmp/jcc pairs. So in this case if the probability distribution of your interpreted program meant that most opcodes were MOV, PUSH, POP, you could get pretty good performance. But once it goes through a few cmp/jcc pairs you have lost the performance benefit of using if/else, it will get slower than using a switch.

There are also a few potential fails here (look at my FAIL comments). The value for &vm->pProgram->args gets recalculated every loop iteration, this is probably a register allocation fail (not enough registers on x86). And vm->pProgram->args[instr_idx] gets spilled to stack. That's really annoying and probably unnecessary.

---------

Now let's take a look at the switch statement! I was curious to see if it would generate a binary search table or a jump table. In theory it should be the latter since the opcodes are 0-25 inclusive (with -1 for END, which is actually 0xFFFFFFFF and is not easily a candidate to become a jump table member). x

  //loop header
  1a:   8b 04 99                mov    eax,DWORD PTR [ecx+ebx*4]
  1d:   83 f8 ff                cmp    eax,0xffffffff
  20:   74 48                   je     6a <run_vm+0x6a>
  22:   8d b6 00 00 00 00       lea    esi,[esi+0x0]
  // loop beginning (there's an omitted jmp 28 lower on)
  28:   8b 7e 0c                mov    edi,DWORD PTR [esi+0xc]
  // now edi == &vm->pProgram->args
  // FAIL: vm->pProgram->args is loop invariant and shouldn't be reloaded every iteration
  // this is either a failure to prove the load was redundant or a failure to allocate this its own register without spilling
  2b:   83 f8 19                cmp    eax,0x19
  // compare eax to 25 (JLE)
  // compiler was unable to prove that the instruction is in the range [-1, 25]. this will always cause a >25 check every loop iteration (bad)
  // i wanted to try adding a __builtin_unreachable() as a default switch case to fix this, unfortunately i don't have gcc 4.5
  // also maybe doing a switch on eax & 0x20 so it knows its less than 32, 32 might be small enough to generate 7 blank jump table entries
  2e:   8b 14 17                mov    edx,DWORD PTR [edi+edx*1]
  // new edx is vm->pProgram->args[instr_idx]
  31:   8b 3a                   mov    edi,DWORD PTR [edx]
  // arg0
  33:   8b 52 04                mov    edx,DWORD PTR [edx+0x4]
  // arg1
  36:   89 55 d4                mov    DWORD PTR [ebp-0x2c],edx
  // FAIL: spill vm->pProgram->args[instr_idx] to stack.
  // why is this a fail? we can always get it from [esi+0xc] like at the beginning of the loop
  39:   77 1d                   ja     58 <run_vm+0x58>

  // aha, thats what the cmp eax, 0x19 was for previously.. refer to previous comment
  3b:   ff 24 85 00 00 00 00    jmp    DWORD PTR [eax*4+0x0]

  // jmp table badassery. but why did it take so long to get here?
  // this is a JMP r/m32 (jump near, absolute indirect)
  // the 0x0 gets set to a jump table offset by the linker later (i decompiled the .o file only)

  // the jmp table itself will look like JMP address-back-into-run_vm-as-a-constant

This is basically the same code as before except that: there's an additional cmp/ja instruction pair to check that the instruction code is in the [0,25] range (and of course the jmp to the jmp table instead of the if/else pairs). This makes it always slower for a branch that will never get taken, and ruins the branch prediction of the first few iterations.

Also the indirect jmp does make it a bit harder to do instruction prefetching, but this should be truly straightforward since the value of 'eax' is known for a good 8 instructions from the beginning of the loop (and the prefetching is trivial for the direct jumps). Another consideration is the L1 i-cache, which is 32K on a Nehalem. A quick google search for the micro-ops cache reveals it to be 12K. Since the loop is fairly tiny, both of these easily fit into those caches with probably at least 1 order of magnitude left over.

Supposing that the redundant cmp/ja is removed, we are now on a more even field. It should be faster for pretty much anything except the first case (instruction is a MOV).

  jmp [indirect abs]
  jmp direct abs

vs

  cmp eax, 0x0 // is EAX a mov?
  jg somewhere_else // skip the following section if its not a MOV

The second code will always be faster if we're always hitting MOVs, don't ask me to prove this, but it's a gut feeling that there's less micro ops involved in the second case. The equal point will probably be when the instrucitons are always PUSH (1) or POP (2). But after that it will always be faster. So I think overall unless all the instructions are MOV,PUSH,POP (haha unlikely) the jmp table solution should be faster with cmp/ja removed.

That being said, I think I've spent too much time on this as-is and will not be trying to remove the cmp/ja.

-----------

Summary: GCC doesn't generate the best code, and there's some things that are hard for even the compiler to optimize. Sometimes you have to massage the compiler into doing what you want, and it doesn't mean the theoretical approach to using switches over chained if/elses is flawed.

It's my gut feeling that if we can get the optimizer in this case to work slightly better, the switch statement should always be faster than the if/elses.

It also would've been interested to see the x86_64 disassembly, but of course then we have the downside of using 64-bit everywhere which leads to larger instruction encodings and the upside of less register spills.

teyc · on July 2, 2011

For the benefit of those who don't understand this, have a look at http://en.wikipedia.org/wiki/Threaded_code where it explains the different ways one can process byte codes.

The key issue here is that the efficiency of dispatch loop determines the performance of a VM.

anon_comenter9 · on July 3, 2011

Exactamundo!

haberman · on July 2, 2011

That is surprisingly bad. It reminds me of when I first heard that you could get fast sine/cosine approximations with table lookups -- not fully understanding this idea I wrote a program that generated the following (I was in high school, young and naive):

    double sin(double x) {
      if (x >= 0.00 && x < 0.01) return 0;
      if (x >= 0.01 && x < 0.02) return 0.01;
      // ...
    }

gcr · on July 2, 2011

At least you wrote the program to generate it rather than typing all that by hand. I think you got the gist of the idea. In any case, would all those branches be faster than the comparatively expensive floating-point arithmetic?

Retric · on July 3, 2011

No, at best you could do a binary search using if's to simulate a lookup table, but you are going to trash the pipeline on a modern CPU if you want a reasonably complex lookup table you are much better off with something like:

float lookup_Sine(x){ int I = 100 * (x mod 360); return sine_table(I); }

Anyway, for a modern CPU hitting main memory takes a lot of cycles. So you need something a lot more complex than a simple trig function to make it worth it.

atombender · on July 3, 2011

Yes, that's very odd. At least they could have used a case statement, which a C compiler will be able to optimize into an efficient jump table.

I remember reading about a VM (I remember it as being Google's V8, but that one compiles directly to machine code so probably I'm misremembering; maybe it was Strongtalk?) which basically ditched the classic bytecode loop (grab next instruction, check what it is, evaluate it, repeat) and instead implemented each bytecode type as a function and somehow munged the instruction pointer or modified the VM's machine code itself so that the CPU would move directly to the next instruction's function without jumping. Damn, I wish I could remember the idea. Anyone know what I'm talking about?

anon_comenter9 · on July 3, 2011

You are probably talking about threaded code which is described in "The Structure and Performance of Efficient Interpreters" http://www.jilp.org/vol5/v5paper12.pdf

With real code examples

jws · on July 3, 2011

The bodies of these opcodes are only single machine instruction or two themselves and are inlined. This changes the game.

atombender · on July 3, 2011

Thanks, that's an interesting paper. I have never heard of that paper, I'm thinking specifically of a VM that implemented the feature. Not sure if we are talking about the same algorithm, either, although it sounds similar.

anon_comenter9 · on July 3, 2011

Thinking about it, that sounds like inline threading

http://blog.mozilla.com/dmandelin/2008/08/27/inline-threadin...

spitfire · on July 2, 2011

That just makes the case for switch expressions even stronger. If you had those this would be a damn simple switch statement and it'd provide lots of hints to the compiler for speedups. (All but the modified jump instructions can go in a straight switch statement).

I'm very tempted to add type tagging into this and write a burroughs b5000 emulator. That'd be a neat side project.

palish · on July 2, 2011

Yeah, I noticed the same thing. But this is a good starting point.

From here, he could VirtualProtect() some memory and write the equivalent x86 instructions.

matthew-wegner · on July 2, 2011

Perhaps that's why there is a FastVM fork?

windsurfer · on July 2, 2011

Why can't the compiler optimize this?

to3m · on July 2, 2011

Theoretically, it could do, at least in this case, I think? Very risky to rely on that though! I doubt compiler writers have people who haven't heard of the switch statement in mind when they decide which optimizations to put in.

(The "fast" branch uses a switch statement - I guess the author is going for minimum number of lines in this one? I'm not sure saving 2-3 lines compared to a squeezed-together switch statement is a great tradeoff, though. You could get better results by just putting everything in one source file. Easier deployment, too. I bet it wouldn't affect readability significantly.)

ori_b · on July 2, 2011

It can. However, this is a very uncommon pattern, so it's not worth trying very hard to detect. On top of that, I'd expect the compiler to bail after the pattern gets to a certain size.

sehugg · on July 3, 2011

This made me think back to a tiny virtual machine that took less than 300 lines -- in fact it took less than 500 bytes of code (thanks to the Woz): http://en.wikipedia.org/wiki/SWEET16

Aw, heck, let's get to the meat of it: http://www.6502.org/source/interpreters/sweet16.htm

GrooveStomp · on July 2, 2011

The code actually doesn't need documentation. It's written in a very clear, straight-forward manner that's easy to mentally parse. Very nice!

AndresNavarro · on July 2, 2011

As a c programmer I find it quite odd/unnerving actually. It may be simple to understand its structure and it may look clean, but that isn't much to say of a project this size, especially if you know what it does. I found the file division & include pattern quite strange actually, and don't get me started in the cascading "if/else if" for the instruction dispatch: Not only is it almost impossible to optimize for the compiler, it's also more difficult to read than the obvious switch. Also, if the instructions were enums instead of defines, the compiler could even generate warnings when there are unimplemented opcodes. I don't use github but I'm tempted to make a branch and a pull request because it's driving me crazy!

EDIT: I was about to fork it when I noticed that in the other two branches (fast & simple, default is master) the dispatch is implemented with a switch. Please somebody explain to me why this wasn't moved to the default branch...

ehsanu1 · on July 3, 2011

Please see http://hackerne.ws/item?id=2722896

jrockway · on July 3, 2011

No error checking around mallocs?

alcuadrado · on July 3, 2011

Most applications just exit when malloc() returns NULL, but in a more graceful way than crashing.

Only critical applications (not a toy vm) need to handle OOM errors, and that's something incredible hard to achieve.

BTW: All the code is a kind of crappy.

Locke1689 · on July 3, 2011

jrockway's comment should really be addressed -- this can cause some really nasty crashes.

palish · on July 3, 2011

Really? How would dereferencing a NULL pointer cause "some really nasty crashes"?

Yes, it will crash. That is a Good Thing(tm).

1) It is extremely unlikely that malloc() will ever return NULL (on a PC). This is due to virtual memory. If you manage to allocate more than 1.5GB of memory, then yes. Otherwise no.

2) Even if malloc() does return NULL, then NULL checking it is pointless because program operation cannot realistically continue. You need that memory. The best and cleanest solution is to crash. You will get a call trace leading to the exact point of failure (which you won't get if you do NULL checks + continue program execution). Also, as far as I know, it's almost impossible for a dereferenced NULL pointer to be a security vulnerability, unlike e.g. a buffer overrun.

In summary, just let it crash. There isn't any reason not to, unless you're developing a "third-party library" (e.g. you're a Lua developer, etc) and you want to leave the decision of whether to crash to the programmers using your code.

jrockway · on July 3, 2011

Are you kidding? Most real software does something like allocating some emergency memory at startup time, and then handles a failing malloc call by using that reserve pool to save work, perhaps do a GC, and print a decent error message before quitting.

Losing all of the user's work and printing "Segmentation fault (core dumped)" is never the right answer.

palish · on July 3, 2011

The point is, the crash you describe will only occur on 32-bit operating systems, and only after you've allocated more than 1.5GB of memory. For most projects, that is extremely unlikely.

Your solution would work. But the extra code complexity, development time, and maintenance isn't worth it except in very, very specific scenarios, like if you're writing a third-party library such as Lua or FMOD or FreeType or... etc.

Here's an alternative solution that doesn't require NULL checking malloc, and provides all of the same benefits:

1) create a function (I call mine "Sys_Exit()") which allows you to cleanly exit your application from anywhere in your codebase. For example in my game engine, my main() function looks like:

  int main()
  {
    App_Startup();
    while ( App_Frame() ) {}
    App_Shutdown();
    return 0;
  }

and Sys_Exit() looks like:

  void Sys_Exit( int code )
  {
    App_Shutdown();
    exit( code );
  }

2) write a "my_malloc" function which forwards to malloc. Use it for all memory allocation. When it detects that more than 1.2GB of memory has been allocated, then it prints an error message and calls Sys_Exit().

This allows you to save the user's work, etc, and your shutdown sequence has a fairly large amount of remaining memory to work with (so that you don't truly run out of memory after you've fake-run-out-of-memory).

This is a very simple solution if you really care about the extremely unlikely out-of-memory crash.

----

tl;dr: It is almost always a bad idea to NULL check malloc(). It tends to destroy readability, is error prone, and is hard to maintain, for very little practical benefit.

daeken · on July 3, 2011

Even ignoring the possibility of out-of-memory crashes, null pointer bugs can lead to memory corruption and code execution and have been popular targets in recent years: http://www.google.com/search?sourceid=chrome&ie=UTF-8...

Edit: http://flashmypassion.blogspot.com/2008/04/this-new-vulnerab... is a copy of a blog post (which seems to be missing from the Matasano chargen; odd) about Mark Dowd's crazy Flash null-pointer vuln. The first comment from Dino Dai Zovi (another crazy good security researcher) is very relevant:

> Oh, the sweet sweet vindication. I remember an argument that I have had twice in the last several years about whether one should check the return value of malloc. I argued that it should always be done for two reasons: Reading address zero might crash, but not necessarily zero-plus-offset and because a compare register to zero is so free performance-wise that it isn’t even funny.

> Guess who was arguing the contrary

Edit #2: tptacek's responses on the thread are quite interesting, and I agree. These are your friends:

    void *safe_malloc(size_t size) { void *ptr = malloc(size); if(ptr == NULL) abort(); return ptr; }
    #define safe_free(ptr) do { if((ptr) != NULL) { free(ptr); (ptr) = NULL; } } while(0)

Locke1689 · on July 3, 2011

Argh, this is so often overlooked. NULL POINTER DEREFERENCE IS A POSSIBLE SECURITY HOLE. Sorry about the shouting, I've gone through this so much I feel like my brain is going to explode. Here, read this: http://lwn.net/Articles/360328/

Don't trust NULL pointer dereference. Even in user code -- it puts you in bad habits.

jrockway · on July 3, 2011

Where is this magic 1.2GB number coming from?

kragen · on July 4, 2011

> Really? How would dereferencing a NULL pointer cause "some really nasty crashes"?

See "Dowd's Inhuman Flash Exploit":

http://evernote.com/pub/petro/random#b=1988c9d5-c0ab-466e-9c...

Maybe see also the various posts on the clang blog about undefined behavior.

> The best and cleanest solution is to crash.

Then write code to crash, you bozo, don't hope the runtime will do it for you. We're not talking about a fucking missile guidance system here:

    void *xmalloc(size_t s) {
            void *rv = malloc(s);
            if (!rv) abort();
            return rv;
    }

xrealloc() is only slightly longer.

Writing and using these functions is not "error prone" or "hard to maintain", and does not "destroy readability". It adds a single letter to each of these function calls and avoids writing any more error-handling code.

That's a completely reasonable solution, but it's not the one you took. It's not the path of least resistance, but when you're programming in C, writing robust code is very rarely the path of least resistance.

With regard to your nonsense about "will only ever happen on 32-bit OSes", I run programs under VM size ulimits every day. This prevents them from forcing the system into swap and becoming unresponsive, by causing their allocations to fail. Some of them (e.g. less) usually handle running out of memory quite well. I happen to be using 32-bit OSes, but that's irrelevant; I would do the same thing on 64-bit OSes.

kragen · on July 4, 2011

A friend asked me if this was a good example to learn C from. I had to say no; in half an hour of looking at the code, I found seven bugs in a single 60-line routine. Details at:

https://github.com/GenTiradentes/tinyvm/commit/e78c192313c35...

saintfiends · on July 3, 2011

I wish I could read C code like this. I really want to, but where do I begin. If someone could point out that would be very helpful.

leon_ · on July 3, 2011

find main() and follow the parts that interest you - that's how I start with reading a new codebase

saintfiends · on July 3, 2011

first thing I did. Opened the main.c, tried to follow the include files. That's where the problem starts, as it branches out I start to lose track.

When ever I start reading code, my confidence hits rock bottom. Seriously, I just feel really bad about myself and just walk away.

rat · on July 3, 2011

Try opening it up in an editor/ide with ctrl+click to step into function or similar like gvim or eclipse(and hopefully a way to jump back. eclipse also has a nice expand macro feature which lets you look at macros fully expanded and at different expansion stages.

saintfiends · on July 3, 2011

you are a life saver. Kind of embarrassed that I didn't do a Google search on that one.

I just found ctags. :D !

rat · on July 3, 2011

darn forgot to mention ctags. If you are up for it supposedly scope is better but I couldn't get it set up. I don't know of any vim plugin for eclipse like macro expansion but if you see hairy macros you can always do gcc -E |vim - (gcc -E does full includes so macro expansion + importing of all headers)and search for your code or jump to the bottom.

kragen · on July 4, 2011

> supposedly scope is better

cscope?

monopede · on July 3, 2011

Looking at the instruction set it doesn't even support subroutines. It would only need a way to get and set the current program counter (like ARM) and you could implement subroutine calls using push, pop, and jmp.

jws · on July 3, 2011

I think subroutines would go something like:

   PUSH the_address_you_want_to_come_back_to
   JMP  some_other_spot
 the_address_you_want_to_come_back_to:
   BLAH
   BLAH
   ...
 some_other_spot:
   BLAH
   BLAH
   POP some_scratch_address
   JMP some_scratch_address

So yes, JSR is two opcodes and RTS is also two opcodes, but it works.

kragen · on July 4, 2011

In your first snippet, you seem to be assuming that JMP is direct, and in the second, you seem to be assuming that JMP is indirect. It appears to be direct: https://github.com/GenTiradentes/tinyvm/blob/master/tvm.c#L5...

You could, of course, do it with self-modifying code if that's possible. (I'm not familiar enough with the VM to know if it is.)

udoprog · on July 3, 2011

This is an epic introduction into virtual machines for the brave C coder. Very neat and clean, thank you.

voidmain · on July 3, 2011

It appears to have no indirect load or store instruction. I don't think it's Turing complete!

zwieback · on July 3, 2011

Made me think of Sweet16, the 16 bit virtual machine Woz put into the Apple ][ ROM.

sehugg · on July 3, 2011

Creepy man. Are you my clone or something? :)

gcb · on July 3, 2011

he's running you in a vm

phektus · on July 3, 2011

check the commit message: https://github.com/GenTiradentes/tinyvm/commit/35c2fdbdabc63...

on July 3, 2011

[deleted]

jrockway · on July 3, 2011

This is probably more appropriate for the Github support forum than some random HN posting about a project on Github. Your post adds nothing to this discussion.

leon_ · on July 3, 2011

Ah, I love virtual machines/cpus :] Some weeks ago I found out that you can create pointers to goto labels in C (gcc and clang that is - it's a non standard extension afaik).

I then implemented a super small subset of the 6502 to test how messy the code would like:

https://github.com/jsz/6502/blob/master/vm/cpu.c

I don't know if I should like it or not :)

atombender · on July 3, 2011

That's neat. I like how your code is so concise and readable. Now all you need to do is to emulate a display, a sound chip and a tape cassette drive. :-)

RodgerTheGreat · on July 3, 2011

I have a project that's kinda along those lines, but it's not based on the 6502: https://github.com/JohnEarnest/Mako

leon_ · on July 3, 2011

heh, thanks. but I think I won't have time to work on that project in the near future :(

Locke1689 · on July 3, 2011

While academically interesting, I'll wager a guess that QEMU wins in every performance comparison, right? While writing a TCG-style thunking generator is not "simple", it is extremely fast.

I'm also not really sure what instruction set you're targeting here. This is a very small subset of IA-32(e) if I'm not mistaken.

atombender · on July 3, 2011

This is, as I understand it, not an emulator but a VM core for running bytecode. A better comparison would be the JVM, or LLVM, or any of the VMs used by popular languages such as Ruby and Python.

erikb · on July 3, 2011

Can you explain the difference? From my point of view, both are in principle loops that execute a set of Assembly-like instructions.

atombender · on July 3, 2011

Emulators like QEMU are designed to emulate an existing instruction set such as x86, and they also need to emulate a specific memory model, interrupt model, I/O subsystem etc. VMs such as in this post are designed without such constraints.

Both are designed around the core principle of a "virtual machine" — decode some kind of stream of machine code instructions, execute each instruction, and maintain the stack and memory state along the way — but they target different use cases with different nuances in the implementation.

Locke1689 · on July 3, 2011

Ah yes, a language VM, not an OS VM. My mistake.